Operations Guide
Monitoring

It is important to monitor Ultra Messaging to ensure smooth operation. By tracking the changes in UM statistics over time, you may be able to predict and avoid future overloads. When contacting support to report anomalous behavior, recording UM statistics can greatly assist the support engineers' root cause analysis.


Monitoring Transport Statistics  <-

Monitoring the activity on your UM transport sessions is the most important component of your UM monitoring effort. UM provides the following four methods to monitor your UM activities.

  • Use UM API function calls within your applications to retrieve statistics and deliver them to your monitoring application.
  • Use the UM Monitoring API to more easily retrieve and send statistics to your monitoring application.
  • Use Automatic Monitoring to easily employ the UM Monitoring API to monitor UM activity at an UM context level.
  • Use the Ultra Messaging SNMP Agent and MIB (purchased separately to monitor statistics through a Network Management System). See The Ultra Messaging SNMP Agent for detailed information.

Automatic Monitoring is the easiest method to implement using configuration options or environment variables. Since many topics can use multiple transport sessions, UM Monitoring doesn't provide transport information for individual topics. From an Operations point of view, however, the health and behavior of your transport sessions is more correlated to system performance. Although UM Monitoring also provides statistics on event queues, these statistics are more specific to a single application and not a system wide health indication.

The interval for collecting statistics should be as short as practical. A too-long interval can hide microbursts of traffic. However, a too-short interval can lead to massive amounts of statistical data which needs to be stored and processed.

Note that certain statistics are initialized to the maximum unsigned value for the fields, i.e. all bits set (-1 if printed signed). This special value indicates that the field has not yet been calculated. This is used for the "min" statistic in a "minimum / maximum" statistics pair. For example, nak_tx_min is initialized to the maximum unsigned long, while nak_tx_max is initialized to zero.

This section lists some of the more important transport statistics to monitor listed by transport type.


LBT-RM and LBT-RU Receiver Statistics  <-

Essentially, aside from msg_rcved and bytes_rcved, if any receiver statistics increment, a problem may exist. The following lists the most important statistics.

  • naks_sent means a transport has a gap in sequence numbers, which can be recoverable or unrecoverable loss.
  • unrecovered_txw and unrecovered_tmo loss statistics. Indicates retransmissions not delivered to a receiver. (The receiving application will have received a LBM_MSG_UNRECOVERABLE_LOSS or LBM_MSG_UNRECOVERABLE_LOSS_BURST log message via its receive callback, which should be found in the streaming or API log file.
  • lbm_msgs_no_topic_rcved indicates that receivers may be doing too much topic filtering (wasting CPU resource) because they are processing messages in which they have no interest. If this statistic is greater than 25% of msgs_rcvd, a problem may exist or topics may need to be distributed to different transport sessions.
  • dgrams_dropped_* - Indicates the reception of invalid datagrams, e.g. a non-UM datagram or datagram from an incompatible version.

For additional information, see Monitoring Receiving Statistics.


LBT-RM and LBT-RU Source Statistics  <-

The following lists the most important statistics.

For additional information, see Monitoring Sending Statistics.


TCP Statistics  <-

Receiver statistic lbm_msgs_no_topic_rcvd indicates that receivers may be doing too much topic filtering (wasting CPU resource) because they are processing messages in which they have no interest. If this statistic is greater than 25% of msgs_rcvd, a problem may exist or topics may need to be distributed to different transport sessions.


LBT-IPC Statistics  <-

Receiver statistic lbm_msgs_no_topic_rcvd indicates that receivers may be doing too much topic filtering (wasting CPU resource) because they are processing messages in which they have no interest. If this statistic is greater than 25% of msgs_rcvd, a problem may exist or topics may need to be distributed to different transport sessions.


Monitoring Event Queues  <-

The following lists the most important statistics.

  • data_msgs & events - Total data messages and events enqueued - check these not growing beyond pre-defined bounds
  • age_mean & age_max - If an application uses a receive-side event queue for message delivery rather than direct callbacks, this indicates average and longest time messages wait on that queue before the application starts processing them.
  • data_msgs_svc_mean & data_msgs_svc_max - indicates average and longest time the application spends processing each event-queued message.

For more information, see Monitoring Event Queue Statistics.


Monitoring Application Log Messages  <-

UM returns log messages to your application when conditions warrant. Your applications can decide what messages to collect and log. Most UM development teams are concerned with the efficiency and performance of their applications and therefore will log any messages returned to their applications from UM. It may be helpful to meet with your UM development team to learn exactly what they log and how best to monitor the log files. Ideally your UM development team includes a timestamp when they log a message, which can be important for comparison of disparate data, such as CPU information to transport statistics.

See the UM Log Messages section for individual messages and descriptions.

UM daemons (lbmrd, umestored, tnwgd) automatically log messages to the log files specified in their XML configuration files.


Monitoring the Persistent Store Daemon (umestored)  <-

With the UMP/UMQ products, the Persistent Store provides persistence services to UM sources and receivers. Multiple stores are typically configured in Quorum/Consensus. Monitor every store process.

Monitor the following for all stores.

  • Store log files
  • Application events and log files
  • Store's internal transport statistics
  • Persistent Store daemon web monitor


Monitoring Store Log File  <-

The store generates log messages that are used to monitor its health and operation. There is some flexibility on where those log messages are written; see Store Log File. Each store daemon should have its own log file.

To prevent unbounded disk file growth, the Persistent Store log file can be configured to automatically roll. See Store Rolling Logs for more information.

The following lists critical things to monitor in a store log file:

  • aio_warnings - may indicate a problem with the disk (disk full, cannot write, etc.)
  • Proxy source creation - indicates that a source 'went away'. This may be fine, but could also indicate an error condition. Discuss with your UM development team when this event is safe and when it indicates a problem.
  • Rapid log file growth - Log files growing rapidly or growing much more rapidly than normal, may indicate a problem. Look at what types of messages are being written to the log at higher-than-normal rates to see where the issue might be.

In application log files, look for LBM_SRC_EVENT_UME_REGISTRATION_ERROR messages. These can indicate many different problems that will prevent message persistence. See the UM Log Messages section for details.


Monitoring a Store's Internal Transport Statistics  <-

Since umestored is a proprietary UM application developed with the UM API library, you can configure the daemon with automatic monitoring and then access transport statistics for the daemon's internal sources and receivers. To accomplish this, follow the procedure below.

  1. Enable Automatic Monitoring in the UM configuration file cited in the umestored XML configuration file's "<daemon>" element.
  2. For each store configured in the umestored XML configuration file, add a "<context-name>" element. Automatic Monitoring then maintains complete transport statistics for each store at the interval set in the UM configuration file.


Persistent Store Web Monitoring  <-

For information about umestored statistics see Store Web Monitor.

The web address of th Store Web Monitor is configured in the store XML configuration file. See <daemon>.

You can monitor the following information on the umestored Web Monitor:

  • List of stores the daemon is running.
  • List of topics and wildcard topic patterns for each store, along with registration IDs for the sources sending on the topics.
  • Source and receiver information for each topic.
  • Ultra Messaging statistics or transport-level statistics for underlying receivers in the store. These are similar to the transport statistics mentioned earlier, however they indicate how the store is communicating with its sources for a given topic. For example, a non-zero number of naks_sent indicates that the store is experiencing some loss.

TIP: You can build a script that executes the Linux wget command at a 5 second interval to get a web monitor screen shot and save it to a directory or file.


Persistent Store Daemon Statistics  <-

The Persistent Store daemon has a simple web server which provides operational information. However, while the web-based presentation is convenient for manual, on-demand monitoring, it is not suitable for automated collection and recording of operational information for historical analysis.

Starting with UM version 6.11, a feature called "Daemon Statistics" has been added to the Store daemon. This feature supports the background publishing of their operational information via UM messages. System designers can now subscribe to this information for their own automated monitoring systems.

See Store Daemon Statistics for general information on Daemon Statistics, followed by specific information regarding the Store.


Detecting Persistent Store Failures  <-

You can detect the loss of a store with the following.

  • Loss of the Persistent Store's Process ID (PID)
  • Application log messages stating the loss of connection to the store

Stores can also be "too busy" and therefore cannot service source and receiving applications. Sources declare a store inactive with the LBM_SRC_EVENT_UME_STORE_UNRESPONSIVE event when the store's activity timeout expires. This can be caused by the following.

  • Disk is too busy (or when the system starts swapping)
  • The store is processing an overly-large load of recovery traffic. You may want to recommend that UM administrators consider a larger quorum / consensus group size.


Monitoring the UM Router Daemon (tnwgd)  <-

The Ultra Messaging UM Router links disjoint topic resolution domains by forwarding multicast and/or unicast topic resolution traffic ensuring that receivers on the "other" side of the UM Router receive the topics to which they subscribe. See the UM Dynamic Routing Guide for more details.

Understand UM Router (tnwgd) output traffic and WAN impacts - especially the use of rate limiters.

  • WAN overrun is the number one source of UM Router problems
  • Test WAN link throughput to determine the real limits of the UM Router and environment
  • Make sure WAN bandwidth can cope with UM and any other traffic

Review and understand loss conditions unique to using a UM Router. Collaborate with your UM development team to ensure the correct tuning and configurations are applied for your messaging system. Also monitor latency over the UM Router with the UM sample application lbmpong routinely and monitor output.

Monitor the following for UM Routers.

  • UM Router log files
  • Application events and log files
  • UM Router internal transport statistics
  • UM Router daemon web monitor


Monitoring UM Router Log File  <-

The UM router generates log messages that are used to monitor its health and operation. There is some flexibility on where those log messages are written; see UM Router Log Messages. Each UM router should have its own log file.

To prevent unbounded disk file growth, the UM Router log file can be configured to automatically roll. See UM Router Rolling Logs for more information.

The following are important UM Router (tnwgd) log messages.

Connection Failure Messages to Monitor:

  • peer portal [name] failed to connect to peer at [IP:port] via [interface]

  • peer portal [name] failed to accept connection (accept) [err]: reason

Lost Connection Messages to Monitor:

  • peer portal [name] lost connection to peer at [IP:port] via [interface]
  • peer portal [name] connection destroyed due to socket failure
  • peer portal [name] detected dropped inbound connection (read) [err]: reason
  • peer portal [name] detected dropped inbound connection (zero-len read)

Peer Messages to Monitor: Dual TCP:

  • peer portal [name] received connection from [IP:port]
  • peer portal [name] connected to [IP:port]

Single TCP:

  • Acceptor: peer portal [name] received connection from [IP:port]
  • Initiator: peer portal [name] connected to [IP:port]
  • UM Router Transport Statistics


UM Router Transport Statistics  <-

Using the "<monitor>" element in a UM Router's XML configuration file, you can monitor the transport activity between the UM Router and its Topic Resolution Domain. The configuration also provides Context and Event Queue statistics. The statistics output identifies individual portals by name.


UM Router Web Monitoring  <-

The UM Router web monitor provides access to a UM Router's portal and general statistics and information. The UM Router XML configuration file contains the location of the gateway web monitor. The default port is 15305.

A UM Router Web Monitor provides a web page for each endpoint and peer portal configured for the UM Router. Peer portals connect UM Routers and communicate only with other peer portals. Endpoint portals communicate with topic resolution domains. Each statistic display a value for units (messages or fragments) and bytes.

Important statistics you can monitor on the tnwgd Web Monitor include the following.

Endpoint Send Statistics

Increases in the Endpoint Send Statistics values indicate errors and problems. A separate statistic appears for each of the three types of topic message: transport topic, immediate topic, immediate topicless.

  • Fragments/bytes dropped due to blocking - Indicates inability to send due to a transport's rate controller. Message rates on other portals probably exceed the rate controller limit on the monitored portal. The UM Router's XML configuration file may need to be adjusted.
  • Fragments/bytes dropped due to error - Indicates a possible network socket or memory failure.
  • Fragments/Bytes Dropped Due To Fragment Size Error - Indicates a configuration error which should be corrected. Maximum datagram size for all transports must be the same throughout the network. Nonzero indicates fragments were received which were larger than the egress portal's maximum datagram size.
  • Current/maximum data bytes enqueued - Indicates how much data is currently queued and indicates the maximum amount of data queued because the incoming rate exceeded what the TCP connection could handle. Results in a latency penalty. Size of the queue is limited, so if the limit is exceeded, messages are dropped due to blocking.

Peer Send Statistics

Increases in the Peer Send Statistics values indicate errors and problems.

  • Fragments/bytes (or messages/bytes) dropped (blocking) - The result of attempting to send too much data via the peer link.
  • Fragments/bytes (or messages/bytes) dropped (not operational) - Peer connection not yet fully established. The UM Router peer could be down or starting up.
  • Current/maximum data bytes enqueued - Indicates how much data is currently queued and indicates the maximum amount of data queued because the incoming rate exceeded what the TCP connection could handle. Results in a latency penalty. Size of the queue is limited, so if the limit is exceeded, messages are dropped due to blocking.
  • Messages or bytes Received / Fragments or bytes Forwarded - Increasing counters indicate communicating peers. Stagnant counters indicate a lack of traffic flow. A sender could be down, receivers on the remote side could have no interest for the topics, the peer connection could have failed.


UM Router Daemon Statistics  <-

The UM Router daemon has a simple web server which provides operational information. However, while the web-based presentation is convenient for manual, on-demand monitoring, it is not suitable for automated collection and recording of operational information for historical analysis.

Starting with UM version 6.11, a feature called "Daemon Statistics" has been added to the UM Router. This feature supports the background publishing of their operational information via UM messages. System designers can now subscribe to this information for their own automated monitoring systems.

See Store Daemon Statistics for general information on Daemon Statistics, followed by specific information regarding the UM Router.


Detecting UM Router Failures  <-

You can detect the loss of a UM Router by the following.

  • Loss of the UM Router's Process ID (PID)
  • Loss of the UM Router's Web Monitor (you can poll the UM Router's Web Monitor to be sure it is accessible.)
  • Monitoring the performance of applications sending messages through the UM Router.
    • Are applications receiving the appropriate volume of data?
    • Do you see a high number of retransmissions?
    • Are applications generating the expected number of actions? Understanding the expected flow and actions is critical and requires collaboration with your UM development team.
  • Monitoring network performance and behavior in and out of the UM Router. Understanding your network topology and the expected network traffic through the UM Router is critical and requires collaboration with your UM development team.


Monitoring Messaging System Resources  <-

In addition to monitoring UM activity, you must also consider the health and activity of your system resources.

  • CPU usage
  • Memory Usage
  • UDP loss (netstat -s)
  • Latency


Persistent Store System Considerations  <-

Consider the following system issues regarding Persistent Store monitoring.

  • Make sure that the environment in which a Persistent Store daemon (umestored) is started has enough available file descriptors for the number of sources in your environment. UM uses a minimum of 2 file descriptors per UM source in addition to normal UM file descriptors for transports and other objects. You can use ulimit in Linux and Process Explorer on Microsoft® Windows® to monitor file handles.

    Note: The reduced-fd repository type uses 5 File Descriptors for the entire store, regardless of the number of topics, in addition to normal UM file descriptors for transports and other objects. Use of this repository type may impact performance.

  • Monitor system resources (CPU usage, memory, disk space, wait%, memory swapping).
  • If the system is about to start swapping, your resources are insufficient for the required system performance. Reconfiguration and/or additional resources will be required.


Sources of Latency  <-

The following are common sources of latency.

  • Loss and recovery
  • Slow receivers
  • Wildcard receivers with overly broad interest patterns
  • High resource utilization
  • 'Busy' applications - messages backed up in event queues. Your UM Development Team can tell you if your UM applications use event queues.


Runtime Diagnostics  <-

Use the following to validate a healthy system.

  • UM monitoring metrics are active as a sign of liveness
  • Pre-defined thresholds are not breached in the monitoring systems
  • Application logs are clear of errors/warnings
  • Required processes are running i.e. lbmrd
  • General system resources are within pre-defined bounds i.e. CPU, memory, network stats (specific to the applications)
  • Operating system e.g. UDP buffers for loss detection

Use the following to validate the system is operating within acceptable limits.

  • Monitor memory usage and growth over time.
    • Applications with increasing memory could indicate a future problem
    • Could indicate apps are misconfigured for required scalability
    • Event queue growth (also UM metrics)
    • Theoretical memory limits for 32-bit/64-bit processes, dependent on OS and language choice.
  • Spikes in CPU usage across multiple systems indicate a system wide event and could be an indication of a "crybaby" receiver causing source retransmissions or a rogue wildcard receiver.
  • Monitor network activity across the environment.
    • Switch failures / unplugged cable
    • Network Interface Card (NIC) failures
    • Symptoms of NIC bonding failure
    • Significant changes in overall network traffic could indicate a problem such as loss (discussed later)
  • Look for correlated activity. Do CPU spikes and network spikes lead or lag each other?
  • Build thresholds based on an established business as usual (BAU) baseline.
  • These diagnostics and UM metrics could indicate a general problem with the applications, network or underlying hardware.