Ultra Messaging Knowledge Base

Monitoring Notes

• Monitoring Notes
• Introduction
• Library Statistics
    • Context Statistics
    • Receiver Statistics
        • Transport LBT-RM Receiver
        • Transport LBT-RU Receiver
    • Source Statistics
        • Transport LBT-RM Source
        • Transport LBT-RU Source
• Store Statistics
• DRO Statistics
• SRS Statistics

Introduction

Best practices for monitoring an Ultra Messaging network are talked about at length here. To summarize:

You should monitor your hosts (memory, CPU, network utilization) and network (packet drops). UM doesn't help you with this; there are many third-party tools availalbe.
As part of network monitoring, many customers run "always on" packet capture, like Corvil.
UM components, like the Store and DRO, generate log files. These should be monitored for problems.
When an application uses UM, that UM instance can also generate logs. Your application has a responsibility to save those logs somewhere. Some customers save them to a database, others just write them to a disk file.
Your application also is also delivered events from UM. Most of these events, including unexpected events, should be logged. For example, receiver BOS and EOS events.
The UM library maintains internal statistics. These should be collected and saved. The recommended method of doing this is to configure your applications to publish their statistics periodically to a central collector. UM components, like the Store and DRO, also run on top of the UM library, and they should also be configured to publish their statistics to the central statistics collector.
UM components, like the Store and DRO, also generate "daemon statistics", which is information specific to the component type. For example, the Store publishes Store statistics, and the DRO publishes DRO statistics. These should also be published to the central statistics collector.

Library Statistics

The UM Library maintains statistics related to contexts, sources, receivers, and event queues that the application creates. Note that the statistics are organized by transport session, not individual topic.

There are a large number of counters maintained for each transport session being used. Many of those fields will be of direct interest to customers (e.g. "lost"), while many others are typically only interesting to UM Support engineers. We request that you save all of the fields, not just the ones you are especially interested in. (Some of our customers write the whole record in the form of a BLOB, and specific fields of special interest as separate columns.)

The fields referenced below are of particular interest to customers. We recommend that customers trigger alerts to operations staff if any of the field values are abnormal.

Please click each link to the documentation for information about the field. For example, the documentation will tell if the field is "cumulative" (most of them) or a "snapshot" (a few).

The documentation links below are for the C API. The same statstics and descriptions apply for Java and .NET, although the field names will be different. For example, the C field tr_dgrams_dropped_ver is the same as the Java/.NET getter topicResolutionDatagramsDroppedVersion().

Context Statistics

The following context statistics should be proactively monitored by customers:

tr_dgrams_dropped_ver, tr_dgrams_dropped_type tr_dgrams_dropped_malformed - None of these counters should be greater than zero; alert the operator if they are. If any of these counts increase, it usually due to interference from a non-UM packet source. Contact UM Support.
tr_rcv_unresolved_topics - This field is a snapshot of the topics subscriptions that have not discovered any sources. This value can be non-zero during normal operation during the initial topic resolution phase, but it typically should not remain non-zero for any significant time. A long-lasting non-zero value might indicate topic deafness, possibly due to a failure of topic resolution, and should trigger an alert to the operator..
lbtrm_unknown_msgs_rcved, lbtru_unknown_msgs_rcved - Messages received over a transport session that was not subscribed to. For RM it typically means overloading multicast address:port across multiple publishers. Ideally these counters should either be zero, or be growing only slowly. Fast growth typically means sub-optimal configuration of publishers' transport sessions, and should trigger an alert.
fragments_unrecoverably_lost - This is supposed to be a count of unrecoverable loss seen at the delivery controller. However, as of UM version 6.16, this field will under-count the unrecoverable loss. In fact, it can be zero even though multiple unrecoverable loss events have been delivered. Nevertheless, if this valus is greater than zero, it should trigger an investigation.

Receiver Statistics

Transport types LBT-RM and LBT-RU require the most proactive monitoring. Other transport types typically do not require active monitoring.

Transport LBT-RM Receiver

msgs_rcved - This shows how many datagrams have been received on the transport session since its creation. Note that it is an aggregate across all topics mapped to that transport session. There is no abnormal value, but tracking message load over time can be useful for detecting increases in traffic, which can eventually lead to overload and data loss.
lost - The number of datagrams on the transport session not received when they were supposed to be. See Packet Loss for an anlaysis of why packets are lost. Note that a lost packet that is successfully recovered via NAK/retransmission will still be counted as a lost packet. It should be an aspirational goal to keep this counter at zero, but few of our customers can fully achieve that. Since this is a cumulative statistic, we advise calculating a loss rate based on the previous value, and alerting if the rate is above a "normal" threshold. That threshold should match your application requirements, and might be anywhere from a few per hour to a few per second.
unrecovered_txw, unrecovered_tmo - Datagrams that were lost and could not be recovered by NAK/retransmission. (Note that OTR might still recover messages that could not be recovered from the transport.) Neither of these counters should be greater than zero; alert the operator if they are. If any of these counts increase, it usually due to interference from a non-UM packet source. Contact UM Support.
lbm_msgs_no_topic_rcved - number of messages received for topics that the receiving application has not subscribed to. An application might subscribe to a topic that is carried on a transport session with ten topics mapped to it. The nine topics on the transport session that the application is not subscribed to will increment this counter. If this value is a significant persentage of the lbm_msgs_rcved value, the topic/transport mappings should be examined.
dgrams_dropped_size, dgrams_dropped_type, dgrams_dropped_version, dgrams_dropped_hdr, dgrams_dropped_other - These represent received datagrams that do not parse correctly. None of these counters should be greater than zero; alert the operator if they are. If any of these counts increase, it usually due to interference from a non-UM packet source. Contact UM Support.

Transport LBT-RU Receiver

msgs_rcved - This shows how many datagrams have been received on the transport session since its creation. Note that it is an aggregate across all topics mapped to that transport session. There is no abnormal value, but tracking message load over time can be useful for detecting increases in traffic, which can eventually lead to overload and data loss.
lost - The number of datagrams on the transport session not received when they were supposed to be. See Packet Loss for an anlaysis of why packets are lost. Note that a lost packet that is successfully recovered via NAK/retransmission will still be counted as a lost packet. It should be an aspirational goal to keep this counter at zero, but few of our customers can fully achieve that. Since this is a cumulative statistic, we advise calculating a loss rate based on the previous value, and alerting if the rate is above a "normal" threshold. That threshold should match your application requirements, and might be anywhere from a few per hour to a few per second.
unrecovered_txw, unrecovered_tmo - Datagrams that were lost and could not be recovered by NAK/retransmission. (Note that OTR might still recover messages that could not be recovered from the transport.) Neither of these counters should be greater than zero; alert the operator if they are. If any of these counts increase, it usually due to interference from a non-UM packet source. Contact UM Support.
lbm_msgs_no_topic_rcved - number of messages received for topics that the receiving application has not subscribed to. An application might subscribe to a topic that is carried on a transport session with ten topics mapped to it. The nine topics on the transport session that the application is not subscribed to will increment this counter. If this value is a significant persentage of the lbm_msgs_rcved value, the topic/transport mappings should be examined.
dgrams_dropped_size, dgrams_dropped_type, dgrams_dropped_version, dgrams_dropped_hdr, dgrams_dropped_sid, dgrams_dropped_other - These represent received datagrams that do not parse correctly. None of these counters should be greater than zero; alert the operator if they are. If any of these counts increase, it usually due to interference from a non-UM packet source. Contact UM Support.

Source Statistics

Transport types LBT-RM and LBT-RU require the most proactive monitoring. Other transport types typically do not require active monitoring.

Transport LBT-RM Source

msgs_sent - This shows how many datagrams have been sent on the transport session since its creation. Note that it is an aggregate across all topics mapped to that transport session. There is no abnormal value, but tracking message load over time can be useful for detecting increases in traffic, which can eventually lead to overload and data loss.
naks_rcved - Number of individual datagrams that were NAKed by receivers. See Packet Loss for an anlaysis of why packets are lost. It should be an aspirational goal to keep this counter at zero, but few of our customers can fully achieve that. Since this is a cumulative statistic, we advise calculating a loss rate based on the previous value, and alerting if the rate is above a "normal" threshold. That threshold should match your application requirements, and might be anywhere from a few per hour to a few per second.
naks_rx_delay_ignored, naks_shed, naks_rx_delay_ignored - Numbers of different types of NCFs sent. NCFs are sent when the publisher is required to refuse a retransmission request. None of these counters should be greater than zero; alert the operator if they are. If any of these counts increase, it usually due to interference from a non-UM packet source. Contact UM Support.

Transport LBT-RU Source

msgs_sent - This shows how many datagrams have been sent on the transport session since its creation. Note that it is an aggregate across all topics mapped to that transport session. There is no abnormal value, but tracking message load over time can be useful for detecting increases in traffic, which can eventually lead to overload and data loss.
naks_rcved - Number of individual datagrams that were NAKed by receivers. See Packet Loss for an anlaysis of why packets are lost. It should be an aspirational goal to keep this counter at zero, but few of our customers can fully achieve that. Since this is a cumulative statistic, we advise calculating a loss rate based on the previous value, and alerting if the rate is above a "normal" threshold. That threshold should match your application requirements, and might be anywhere from a few per hour to a few per second.
naks_rx_delay_ignored, naks_shed, naks_rx_delay_ignored - Numbers of different types of NCFs sent. NCFs are sent when the publisher is required to refuse a retransmission request. None of these counters should be greater than zero; alert the operator if they are. If any of these counts increase, it usually due to interference from a non-UM packet source. Contact UM Support.

Store Statistics

TBD

DRO Statistics

TBD

SRS Statistics

The SRS also publishes statistics related to its operation. These normally do not need to be proactively monitored. If you have problems with TCP-based topic resolution, then the statistics might be useful for diagnosing.

KB Home | Index

UM Home

See Notices for important information.