Operations Guide
|
It is important to monitor Ultra Messaging to ensure smooth operation. By tracking the changes in UM statistics over time, you may be able to predict and avoid future overloads. When contacting support to report anomalous behavior, recording UM statistics can greatly assist the support engineers' root cause analysis.
Monitoring the activity on your UM transport sessions is the most important component of your UM monitoring effort. UM provides the following four methods to monitor your UM activities.
Automatic Monitoring is the easiest method to implement using configuration options or environment variables. Since many topics can use multiple transport sessions, UM Monitoring doesn't provide transport information for individual topics. From an Operations point of view, however, the health and behavior of your transport sessions is more correlated to system performance. Although UM Monitoring also provides statistics on event queues, these statistics are more specific to a single application and not a system wide health indication.
The interval for collecting statistics should be as short as practical. A too-long interval can hide microbursts of traffic. However, a too-short interval can lead to massive amounts of statistical data which needs to be stored and processed.
Note that certain statistics are initialized to the maximum unsigned value for the fields, i.e. all bits set (-1 if printed signed). This special value indicates that the field has not yet been calculated. This is used for the "min" statistic in a "minimum / maximum" statistics pair. For example, nak_tx_min is initialized to the maximum unsigned long, while nak_tx_max is initialized to zero.
This section lists some of the more important transport statistics to monitor listed by transport type.
Essentially, aside from msg_rcved and bytes_rcved, if any receiver statistics increment, a problem may exist. The following lists the most important statistics.
For additional information, see Monitoring Receiving Statistics.
The following lists the most important statistics.
For additional information, see Monitoring Sending Statistics.
Receiver statistic lbm_msgs_no_topic_rcvd indicates that receivers may be doing too much topic filtering (wasting CPU resource) because they are processing messages in which they have no interest. If this statistic is greater than 25% of msgs_rcvd, a problem may exist or topics may need to be distributed to different transport sessions.
Receiver statistic lbm_msgs_no_topic_rcvd indicates that receivers may be doing too much topic filtering (wasting CPU resource) because they are processing messages in which they have no interest. If this statistic is greater than 25% of msgs_rcvd, a problem may exist or topics may need to be distributed to different transport sessions.
The following lists the most important statistics.
For more information, see Monitoring Event Queue Statistics.
UM returns log messages to your application when conditions warrant. Your applications can decide what messages to collect and log. Most UM development teams are concerned with the efficiency and performance of their applications and therefore will log any messages returned to their applications from UM. It may be helpful to meet with your UM development team to learn exactly what they log and how best to monitor the log files. Ideally your UM development team includes a timestamp when they log a message, which can be important for comparison of disparate data, such as CPU information to transport statistics.
See the UM Log Messages section for individual messages and descriptions.
UM daemons (lbmrd, umestored, tnwgd) automatically log messages to the log files specified in their XML configuration files.
With the UMP/UMQ products, the Persistent Store provides persistence services to UM sources and receivers. Multiple stores are typically configured in Quorum/Consensus. Monitor every store process.
Monitor the following for all stores.
The store generates log messages that are used to monitor its health and operation. There is some flexibility on where those log messages are written; see Store Log File. Each store daemon should have its own log file.
To prevent unbounded disk file growth, the Persistent Store log file can be configured to automatically roll. See Store Rolling Logs for more information.
The following lists critical things to monitor in a store log file:
In application log files, look for LBM_SRC_EVENT_UME_REGISTRATION_ERROR messages. These can indicate many different problems that will prevent message persistence. See the UM Log Messages section for details.
Since umestored is a proprietary UM application developed with the UM API library, you can configure the daemon with automatic monitoring and then access transport statistics for the daemon's internal sources and receivers. To accomplish this, follow the procedure below.
For information about umestored statistics see Store Web Monitor.
The web address of th Store Web Monitor is configured in the store XML configuration file. See <daemon>.
You can monitor the following information on the umestored Web Monitor:
TIP: You can build a script that executes the Linux wget command at a 5 second interval to get a web monitor screen shot and save it to a directory or file.
The Persistent Store daemon has a simple web server which provides operational information. However, while the web-based presentation is convenient for manual, on-demand monitoring, it is not suitable for automated collection and recording of operational information for historical analysis.
Starting with UM version 6.11, a feature called "Daemon Statistics" has been added to the Store daemon. This feature supports the background publishing of their operational information via UM messages. System designers can now subscribe to this information for their own automated monitoring systems.
See Store Daemon Statistics for general information on Daemon Statistics, followed by specific information regarding the Store.
You can detect the loss of a store with the following.
Stores can also be "too busy" and therefore cannot service source and receiving applications. Sources declare a store inactive with the LBM_SRC_EVENT_UME_STORE_UNRESPONSIVE event when the store's activity timeout expires. This can be caused by the following.
The Ultra Messaging UM Router links disjoint topic resolution domains by forwarding multicast and/or unicast topic resolution traffic ensuring that receivers on the "other" side of the UM Router receive the topics to which they subscribe. See the UM Dynamic Routing Guide for more details.
Understand UM Router (tnwgd) output traffic and WAN impacts - especially the use of rate limiters.
Review and understand loss conditions unique to using a UM Router. Collaborate with your UM development team to ensure the correct tuning and configurations are applied for your messaging system. Also monitor latency over the UM Router with the UM sample application lbmpong routinely and monitor output.
Monitor the following for UM Routers.
The UM router generates log messages that are used to monitor its health and operation. There is some flexibility on where those log messages are written; see UM Router Log Messages. Each UM router should have its own log file.
To prevent unbounded disk file growth, the UM Router log file can be configured to automatically roll. See UM Router Rolling Logs for more information.
The following are important UM Router (tnwgd) log messages.
Connection Failure Messages to Monitor:
peer portal [name] failed to connect to peer at [IP:port] via [interface]
Lost Connection Messages to Monitor:
Peer Messages to Monitor: Dual TCP:
Single TCP:
Using the "<monitor>" element in a UM Router's XML configuration file, you can monitor the transport activity between the UM Router and its Topic Resolution Domain. The configuration also provides Context and Event Queue statistics. The statistics output identifies individual portals by name.
The UM Router web monitor provides access to a UM Router's portal and general statistics and information. The UM Router XML configuration file contains the location of the gateway web monitor. The default port is 15305.
A UM Router Web Monitor provides a web page for each endpoint and peer portal configured for the UM Router. Peer portals connect UM Routers and communicate only with other peer portals. Endpoint portals communicate with topic resolution domains. Each statistic display a value for units (messages or fragments) and bytes.
Important statistics you can monitor on the tnwgd Web Monitor include the following.
Endpoint Send Statistics
Increases in the Endpoint Send Statistics values indicate errors and problems. A separate statistic appears for each of the three types of topic message: transport topic, immediate topic, immediate topicless.
Peer Send Statistics
Increases in the Peer Send Statistics values indicate errors and problems.
Messages or bytes Received / Fragments or bytes Forwarded - Increasing counters indicate communicating peers. Stagnant counters indicate a lack of traffic flow. A sender could be down, receivers on the remote side could have no interest for the topics, the peer connection could have failed.
The UM Router daemon has a simple web server which provides operational information. However, while the web-based presentation is convenient for manual, on-demand monitoring, it is not suitable for automated collection and recording of operational information for historical analysis.
Starting with UM version 6.11, a feature called "Daemon Statistics" has been added to the UM Router. This feature supports the background publishing of their operational information via UM messages. System designers can now subscribe to this information for their own automated monitoring systems.
See Store Daemon Statistics for general information on Daemon Statistics, followed by specific information regarding the UM Router.
You can detect the loss of a UM Router by the following.
In addition to monitoring UM activity, you must also consider the health and activity of your system resources.
Consider the following system issues regarding Persistent Store monitoring.
Make sure that the environment in which a Persistent Store daemon (umestored) is started has enough available file descriptors for the number of sources in your environment. UM uses a minimum of 2 file descriptors per UM source in addition to normal UM file descriptors for transports and other objects. You can use ulimit in Linux and Process Explorer on Microsoft® Windows® to monitor file handles.
Note: The reduced-fd repository type uses 5 File Descriptors for the entire store, regardless of the number of topics, in addition to normal UM file descriptors for transports and other objects. Use of this repository type may impact performance.
The following are common sources of latency.
Use the following to validate a healthy system.
Use the following to validate the system is operating within acceptable limits.