Operations Guide
|
Messaging systems often employ real-time monitoring and rapid human intervention to prevent the system from becoming unstable. The design of UM encourages stable operation by allowing you to pre-configure how UM will use resources under all traffic and network conditions. Hence manual intervention is not required when those conditions occur.
Monitoring UM still fills important roles other than maintaining stable operation. Chiefly among these are capacity planning and a better understanding of the latency added by UM as it recovers from loss. Collecting accumulated statistics from all sources and all receivers once per day is generally adequate for these purposes.
Monitoring can aid different groups within an organization.
Before discussing the monitoring statistics that are built into UM, we mention two things that are probably more important to monitor: connectivity and latency. UM provides some assistance for monitoring these, but the final responsibility rests with your applications.
UM provides the following methods to monitor your UM activities.
Use the Ultra Messaging System Monitoring Option to monitor components of a Ultra Messaging deployment such as application host, transports, topic resolution domains and application instances. The System Monitoring Option uses its own user interface. You purchase the Ultra Messaging System Monitoring Option separately.
The UM API contains functions that retrieve various statistics for a context, event queue, source or receiver. This section lists the functions and constructors you can use to retrieve statistics, along with the data structures UM uses to deliver the statistics. Refer to the UM API documentation (UM C API, UM Java API or UM .NET API) for specific information about the functions and constructors. Links to the data structures appear in the tables to provide quick access to the specific statistics available.
Context statistics help you monitor topic resolution activity, along with the number of unknown messages received and the number of sends and responses that were blocked or returned EWOULDBLOCK. Context statistics also contain transport statistics for Multicast Immediate Messaging (MIM) activity and transport statistics for all the sources or receivers in a context.
C API Function | Java or .NET API Constructor | Data Structure |
---|---|---|
lbm_context_retrieve_stats() | LBMContextStatistics() | lbm_context_stats_t |
lbm_context_retrieve_rcv_transport_stats() | LBMReceiverStatistics() | lbm_rcv_transport_stats_t |
lbm_context_retrieve_src_transport_stats() | LBMSourceStatistics() | lbm_src_transport_stats_t |
lbm_context_retrieve_im_rcv_transport_stats() | LBMMIMReceiverStatistics() | lbm_rcv_transport_stats_t |
lbm_context_retrieve_im_src_transport_stats() | LBMMIMSourceStatistics() | lbm_src_transport_stats_t |
Event Queue statistics help you monitor the number of events currently on the queue, how long it takes to service them (maximum, minimum and mean service times) and the total number of events for the monitoring period. These statistics are available for the following types of events.
When monitoring Event Queue statistics you must enable the Event Queue UM Configuration Options queue_age_enabled (event_queue), queue_count_enabled (event_queue), queue_service_time_enabled (event_queue). UM disables these options by default, which produces no event queue statistics.
C API Function | Java or .NET API Constructor | Data Structure |
---|---|---|
lbm_event_queue_retrieve_stats() | LBMEventQueueStatistics() | lbm_event_queue_stats_t |
You can retrieve transport statistics for different types of transports. In addition, you can limit these transport statistics to a specific source sending on the particular transport or a specific receiver receiving messages over the transport. Source statistics for LBT-RM, for example, include the number of message datagrams sent and the number of retransmissions sent. For receiver LBT-RM, statistics include, for example, the number of message datagrams received and number of UM messages received.
C API Function | Java or .NET API Constructor | Data Structure |
---|---|---|
lbm_rcv_retrieve_transport_stats() | LBMReceiverStatistics() | lbm_rcv_transport_stats_t |
lbm_rcv_retrieve_all_transport_stats() | LBMReceiverStatistics() | lbm_rcv_transport_stats_t |
lbm_src_retrieve_transport_stats() | LBMSourceStatistics() | lbm_src_transport_stats_t |
The UM Monitoring API (see lbmmon.h or the LBMMonitor classes in the Java API and the .NET API) provides a framework to easily gather UM transport statistics and send them to a monitoring or reporting application. Transport sessions for sources and receivers, along with all transport sessions for a given context can be monitored. This API can be implemented in one of two ways.
An application requesting transport monitoring is called a monitor source, and an application accepting statistics is a monitor receiver. These monitoring objects deal only with transport session statistics and should not be confused with UM sources and UM receivers, which deal with UM messages. Statistics for both UM sources and UM receivers can be forwarded by a monitor source application.
Both a monitor source and monitor receiver comprise three modules:
You can substitute format and transport modules of your own choosing or creation. UM Monitoring provides the following sample modules:
To view the source code for all LBMMON transport modules, see LBMMON Example Source Code found on the Related Pages tab in the C Application Programmer's Interface.
The overall process flow appears in the diagram below.
Your applications only calls functions in the controller modules, which calls the appropriate functions in the transport and format modules.
The segregation of UM Monitoring into control, format, and transport modules provides flexibility for monitor receivers in two ways.
As an example, assume you have a Perl application which currently gathers statistics from other network applications (or, you are simply most comfortable working in Perl for such tasks). There is no Perl binding for UM. However, Perl can handle UDP packets very nicely, and can pick apart CSV data easily. By implementing a UDP transport module to be used by the monitor sources, your Perl application can read the UDP packets and process the statistics.
If you can answer the following questions, you're already on your way.
The following sections present more discussion and sample source code about starting monitor sources, monitor receivers and the LBMMON format and transport modules.
The following examples demonstrate how to use the UM Monitoring API to enable monitoring in your application.
First, create a monitoring source controller:
The above code tacitly assumes that the ctx, src, and rcv variables have been previously assigned via the appropriate UM API calls.
The monitoring source controller object must be passed to subsequent calls to reference a specific source controller. One implication of this is that it is possible to have multiple monitoring source controllers within a single application, each perhaps monitoring a different set of objects.
In the above example, the default CSV format module and default UM transport module are specified via the provided module functions lbmmon_format_csv_module() and lbmmon_transport_lbm_module().
Once a monitoring source controller is created, the application can monitor a specific context using:
The above example indicates that statistics for all transports on the specified context will be gathered and sent every 10 seconds.
A UM source can be monitored using:
Finally, an UM receiver can be monitored using:
The two above examples also request that statistics for all transports on the specified source or receiver be gathered and sent every 10 seconds.
Statistics can also be gathered and sent in an on-demand manner. Passing 0 for the Seconds parameter to lbmmon_context_monitor(), lbmmon_src_monitor(), or lbmmon_rcv_monitor() prevents the automatic gathering and sending of statistics. To trigger the gather/send process, use:
Such a call will perform a single gather/send action on all monitored objects (contexts, sources, and receivers) which were registered as on-demand.
As part of application cleanup, the created monitoring objects should be destroyed. Each individual object can be de-registered using lbmmon_context_unmonitor(), lbmmon_src_unmonitor(), or lbmmon_rcv_unmonitor(). Finally, the monitoring source controller can be destroyed using:
Any objects which are still registered will be automatically de-registered by lbmmon_sctl_destroy().
To make use of the statistics, an application must be running which receives the monitor data. This application creates a monitoring receive controller, and specifies callback functions which are called upon the receipt of source or receiver statistics data.
Use the following to create a monitoring receive controller:
As in the earlier example, the default CSV format module and default UM transport module are specified via the provided module functions lbmmon_format_csv_module() and lbmmon_transport_lbm_module().
As an example of minimal callback functions, consider the following example:
Upon receipt of a statistics message, the appropriate callback function is called. The application can then do whatever is desired with the statistics data, which might include writing it to a file or database, performing calculations, or whatever is appropriate.
Beyond the actual statistics, several additional pieces of data are sent with each statistics packet. These data are stored in an attribute block, and are accessible via the lbmmon_attr_get_*() functions. Currently, these data include the IPV4 address of machine which sent the statistics data, the timestamp (as a time_t) at which the statistics were generated, and the application ID string supplied by the sending application at the time the object was registered for monitoring. See lbmmon_attr_get_ipv4sender(), lbmmon_attr_get_timestamp(), and lbmmon_attr_get_appsourceid() for more information.
The lbmmon library comes with three pre-written transport modules:
The LBM transport module understands several options which may be used to customize your use of the module. The options are passed via the TransportOptions parameter to the lbmmon_sctl_create() and lbmmon_rctl_create() functions, as a null-terminated string containing semicolon-separated name/value pairs.
The lbmmon.c example program's command-line option "--transport-opts" is used for the TransportOptions string for the LBM transport module.
The following name/value pairs are available:
/29west/statistics
is used. As an example, assume your application needs to use a special configuration file for statistics. The following call allows your application to customize the UM transport module using the configuration file stats.cfg.
If your application also needs to use a specific topic for statistics, the following code specifies that, in addition to the configuration file, the topic StatisticsTopic be used for statistics.
It is important to use the same topic and configuration for both monitor sources and receivers. Otherwise your applications may send the statistics, but the monitor receiver won't be able to receive them.
To view the source code for the LBM transport module, see Source code for lbmmontrlbm.c.
The UDP transport module understands several options which may be used to customize your use of the module. The options are passed via the TransportOptions parameter to the lbmmon_sctl_create() and lbmmon_rctl_create() functions, as a null-terminated string containing semicolon-separated name/value pairs.
The lbmmon.c example program's command-line option "--transport-opts" is used for the TransportOptions string for the UDP transport module.
The UDP module supports sending and receiving via UDP unicast, UDP broadcast, and UDP multicast. The following name/value pairs are available:
To view the source code for the UDP transport module, see Source code for lbmmontrudp.c.
The SNMP transport modules operates in identical fashion to The LBM Transport Module. Some configuration options are hard-coded to be compatible with the UM SNMP Agent component.
To view the source code for the UDP transport module, see Source code for lbmmontrlbmsnmp.c.
The lbmmon library comes with one pre-written format module:
The CSV format module sends the statistics as simple comma-separated values. Options are passed via the FormatOptions parameter to the lbmmon_sctl_create() and lbmmon_rctl_create() functions, as a null-terminated string containing semicolon-separated name/value pairs.
The lbmmon.c example program's command-line option "--format-opts" is used for the FormatOptions string for the CSV format module.
The following name/value pairs are available:
To view the source code for the CSV format module, see Source code for lbmmonfmtcsv.c.
Instead of building a monitoring capability into your application using the UM Monitoring API, automatic monitoring allows you to easily produce monitoring statistics with the UM Monitoring API by setting a few simple UM configuration options. Automatic monitoring does not require any changes to your application. See Automatic Monitoring Options for more information.
You can enable Automatic Monitoring for either or both of the following.
You can also set environment variables to turn on automatic monitoring for all UM contexts (transports and event queues).
This section demonstrates the use of the two UM monitoring example applications described in the C examples. We present advice based on what we have seen productively monitored by customers and our own knowledge of transport statistics that might be of interest. Of course, what you choose to monitor depends on your needs so merge these thoughts with your own needs to determine what is best for you.
The example application lbmmon.c acts as a Monitor Receiver and is provided in both executable and source form. It writes monitoring statistics to the screen and can be used in conjunction with other example applications (which act as the Monitor Sources). The following procedure uses lbmrcv and lbmsrc to create messaging traffic and adds a configuration file in order to specify the LBT-RM transport instead of the TCP default. (The LBT-RM transport displays more statistics than TCP.)
Since UM does not generate monitoring statistics by default, you must activate monitoring in your application. For the example application, use the --monitor-ctx=n
option where n is the number of seconds between reports. The following procedure activates monitoring on the receiver, specifying the context (ctx) to create a complete set of receiver statistics. You could activate monitoring in a similar fashion on the source and create source statistics.
To use lbmmon to view statistics from sample application output:
lbmmon --transport-opts="config=LBTRM.cfg"
lbmrcv -c LBTRM.cfg --monitor-ctx="5" Arizona
lbmsrc -c LBTRM.cfg Arizona
After lbmsrc completes, the final output for lbmmon should closely resemble the following:
Notes:
--transport-opt
command-line option is a string passed to the desired transport module. The format of this string depends on the specific transport module selected (defaults to "lbm"). See Monitoring Transport Modules for details. --format-opt
command-line option is a string passed to the desired format module. The format of this string depends on the specific format module selected (defaults to "csv"). See Monitoring Format Modules for details. --monitor-ctx
) to create a complete set of receiver transport statistics. You could activate monitoring in a similar fashion on the source and create source statistics. Each set of statistics shows one side of the transmission. For example, source statistics contain information about NAKs received by the source (ignored, shed, retransmit delay, etc.) where receiver statistics contain data about NCFs received. Each view can be helpful. --monitor-rcv
or --monitor-src
.
The example application, lbmmonudp.c receives UM statistics and forwards them as CSV data over a UDP transport. The Perl script, lbmmondiag.pl can read UDP packets and process the statistics, reporting Severity 1 and Severity 2 events. This script only reports on LBT-RM transports.
To run lbmmonudp.c with lbmmondiag.pl, use the following procedure.
lbmmonudp -a 127.0.0.1 --transport-opts="config=LBTRM.cfg"
lbmrcv -c LBTRM.cfg --monitor-ctx="5" Arizona
lbmsrc -c LBTRM.cfg Arizona
lbmmondiag.pl
The following sections discuss some of the possible results of this procedure. Your results will vary depending upon conditions in your network or if you run the procedure on a single machine.
Severity 1 — Monitoring Unrecoverable Loss
The most severe system problems are often due to unrecoverable datagram loss at the reliable transport level. These are reported as severity 1 events by the lbmmondiag.pl example script. Many of the scalability and latency benefits of UM come from the use of reliable transport protocols like LBT-RM and LBT-RU. These protocols provide loss detection, retransmission, and recovery up to the limits specified by an application. Unrecoverable loss is reported by the transport when loss repair is impossible within the specified limits.
Unrecoverable transport loss often (but not always) leads to unrecoverable message loss so it is very significant to applications that benefit from lossless message delivery.
Unrecoverable loss can be declared by receivers when the transport_lbtrm_nak_generation_interval (receiver) has ended without receipt of repair. Each such loss event is recorded by incrementing the unrecovered_tmo field in lbm_rcv_transport_stats_t. Output from lbmmondiag.pl might look like this:
Unrecoverable loss can also be triggered at receivers by notice from a source that the lost datagram has passed out of the source's transmission window. Each such loss event is recorded by incrementing the unrecovered_txw field in lbm_rcv_transport_stats_t. Output from lbmmondiag.pl might look like this:
Severity 2 — Monitoring Rate Controller Activity
The data and retransmission rate controllers built into LBT-RM provide for stable operation under all traffic conditions. These rate controllers introduce some latency at the source since that is generally preferable to the alternative of NAK storms or other unstable states. The lbmmondiag.pl example script reports this activity as a severity 2 event since latency is normally the only effect of their operation.
Activity of the rate controller indicates that a source tried to send faster than the configured transport_lbtrm_data_rate_limit (context). Normally, this limit is set to the speed of the fastest receivers. Sending faster than this rate would induce loss in all receivers so it is generally best to add latency at the source or avoid sending in such situations.
The current number of datagrams queued by the rate controller is given in the rctlr_data_msgs field in lbm_src_transport_stats_t. No more than 10 datagrams are ever queued. Output from lbmmondiag.pl might look like this:
Activity of the retransmission rate controller indicates that receivers have requested retransmissions in excess of the configured transport_lbtrm_retransmit_rate_limit (context). Latency is added to retransmission requests in excess of the limit to control the amount of latency they may add to messages being sent the first time. This behavior avoids NAK storms.
The current number of datagrams queued by the retransmission rate controller is given in the rctlr_rx_msgs field in lbm_src_transport_stats_t. No more than 101 datagrams are ever queued. Output from lbmmondiag.pl might look like this:
Severity 2 — Monitoring Loss Recovery Activity for a Receiver
It is important to monitor loss recovery activity because it always adds latency if the loss is successfully repaired. UM defaults generally provide for quite a bit of loss recovery activity before loss would become unrecoverable. Statistics on such activity are maintained at both the source and receiver. Unrecoverable loss will normally be preceded by a burst of such activity.
UM receivers measure the amount of time required to repair each loss detected. For each transport session, an exponentially weighted moving average is computed from repair times and the maximum and minimum times are tracked.
The total number of losses detected appears in the lost field in lbm_rcv_transport_stats_t. It may be multiplied by the average repair time given in the nak_stm_mean field in lbm_rcv_transport_stats_t to estimate of the amount of latency that was added to repair loss. This is probably the single most important metric to track for those interested in minimizing repair latency. The lbmmondiag.pl script reports this whenever the lost field changes and the average repair time is nonzero. Output might look like this:
See the ordered_delivery (receiver) option to control this.
In addition to counting losses detected, UM reliable receivers also count the number of NAKs generated in the naks_sent field in lbm_rcv_transport_stats_t. Output from lbmmondiag.pl might look like this:
Those who are new to reliable multicast protocols are sometimes surprised to learn that losses detected do not always lead to NAK generation. If a datagram is lost in the network close to the source, it is common for many receivers to detect loss simultaneously when a datagram following the loss arrives. Scalability would suffer if all receivers that detected loss reported it by generating a NAK at the same time. To improve scalability, a random delay is added to NAK generation at each receiver. Since retransmissions are multicast, often only one NAK is generated to repair the loss for all receivers. Thus it is common for the number of losses detected to be much larger than the number of NAKs sent, especially when there are many receivers with similar loss patterns.
Severity 2 — Monitoring Loss Recovery Activity for a Source
For sources, the principal concern is often understanding how much the retransmission of messages already sent at least once slowed down the source. Obviously, bandwidth and CPU time spent servicing retransmission requests cannot be used to send new messages. This is the way that lossy receivers add latency for lossless receivers.
UM sources track the number of NAKs received in the naks_rcved field in lbm_src_transport_stats_t. The number of datagrams that they retransmit to repair loss is recorded in the rxs_sent field in lbm_src_transport_stats_t.
The number of retransmitted datagrams may be multiplied by the average datagram size and divided by the wire speed to estimate the amount of latency added to new messages by retransmission. Output from the example lbmmondiag.pl script might look like this:
LBT-RM sources maintain many statistics that can be useful in diagnosing reliable multicast problems. See the UM API documentation lbm_src_transport_stats_lbtrm_t Structure Reference for a description of the fields. The remainder of this section gives advice on interpreting the statistics.
Divide naks_rcved by msgs_sent to find the likelihood that sending a message resulted in a NAK being received. Expect no more than a few percent on a network with reasonable loss levels.
Divide rxs_sent by msgs_sent to find the ratio of retransmissions to new data. Many NAKs arriving at a source will cause many retransmissions.
Divide naks_shed by naks_rcved to find the likelihood that excessive NAKs were ignored. Consider reducing loss to avoid NAK generation.
Divide naks_rcved by nak_pckts_rcved to find the likelihood that NAKs arrived individually (~1 -> individual NAKs likely; ~0 -> NAKs likely to have arrived grouped in a single packet). Individual NAKs often indicate sporadic loss while grouped NAKs often indicate burst loss.
Divide naks_rx_delay_ignored by naks_ignored to find the likelihood that NAKs arrived during the ignore interval following a retransmission. The configuration option transport_lbtrm_ignore_interval (source) controls the length of this interval.