Operations Guide
Monitoring

For an Ultra Messaging deployment, "monitoring" is the process of overseeing the operation of UM and the resources it uses to determine its health and performance.

Informatica strongly recommends that users of Ultra Messaging actively monitor its operation.


Monitoring Introduction  <-

Many Ultra Messaging deployments operate without any problem. Why monitor a system that is working fine? Monitoring lets you:

  • Evaluate the health of the messaging system.
  • Identify inefficiencies that can cause latency and limit capacity.
  • Track utilization trends and prevent future overload.
  • Provide forensic data to help diagnose system trouble. (If you wait to turn on monitoring until after you have a problem, it's too late.)

UM deployments are often large-scale, loosely-coupled systems, used by many diverse end-user groups who may have little knowledge of each others usage patterns or future plans. Especially when there are resources shared by these groups, it is important to monitor UM's operation and performance characteristics. This can help you avoid problems, and when problems do arise, having monitoring data can be critical to diagnosing the root causes.

At a high level, Informatica strongly recommends:

  • Network and Host Equipment Monitoring.
  • UM Monitoring.


Network and Host Equipment Monitoring  <-

An effective monitoring strategy should include monitoring of the host and network resources that UM depends on. Informatica requests that you monitor and record data from the following sources:

  • Network equipment (switches, routers, firewalls). Includes packet drops at ports and trunks. Often also includes always-running packet capture (for example, Corvil).
  • Hosts. Includes CPU load, memory usage, hardware errors, and packet drops (kernel and user sockets). Where possible, monitor resource usage per-process.

There are many vendors of monitoring technology that do a good job of recording and visualizing network and host statistics and logged events. Your operations staff should have fast and easy access to this monitoring data. Ideally, your monitoring tools will raise alerts when problems are detected.

Detailed discussion of host and network monitoring is beyond the scope of the Ultra Messaging documentation.


Monitoring UM  <-

Broadly speaking, there are three kinds of UM operational data that should be monitored and recorded:

  • UM events. This can be from either user applications or UM daemons (Store, DRO, etc). Events are typically logged and saved to local log files.
  • UM library statistical data. This can be from user applications or from UM daemons, and mostly contains information about UM contexts and transports.
  • UM Daemon Statistical data. This is daemon-specific operational data (Store, DRO, etc).


Monitoring UM Applications  <-

These are application programs that you, the UM user, have developed. As you write your code, Informatica requests that all applications do the following:

  1. Record all failed API calls with informational string, typically to a log file.
  2. Record all non-message events delivered to the application, typically to a log file,
  3. Record all asynchronous UM log messages (strings), typically to a log file.
  4. Record UM library statistics, typically using UM automatic monitoring.

Items 1-3 typically involve writing to log file on the local host. See Application Log File.

Item 4 typically involves configuring the UM library to publish its statistics to a Centralized Collector. The statistics are sampled and published using UM's Automatic Monitoring feature. (There are alternatives; see Legacy Monitoring.)

Note that Informatica recommends the use of the Monitoring Collector Service (MCS), which is a relatively recent addition to UM. Be aware that the MCS is able to collect stats from all UM versions 5.x and 6.x. It is not necessary to upgrade applications to take advantage of the latest UM monitoring capabilities.


Application Log File  <-

During the design and development of user applications, Informatica requests that users implement the following logging items:

  1. The UM library delivers events to the application. Informatica requests that applications log all informational and exceptional events. Logs are classified as to their severity, which is useful for automatic log file scanners to detect potential problems and raise alerts.

    • Informatica requests that logs be written with a time stamp with at least second resolution (preferably millisecond), the log severity level, and the descriptive text of the event being logged.

    Be aware that different threads might need to write logs asynchronously to each other. Your logger code must be thread-safe.

  2. Some of UM events represent the delivery of user data messages. These events do not need to be logged. Other events are informational and are delivered during normal operation, like Beginning of Session (BOS). These should be logged because they can be essential to diagnose problems.

    • Informatica requests that all informational (non-message) events be logged.

  3. Many UM event callbacks can deliver a variety of different event types. For example, a receiver callback can be delivered user data messages, Beginning of Session (BOS) events, etc. The application code typically recognizes those events that it is written to recognize, and ignores the others.

    • Informatica requests that unrecognized events be logged with their event types.

  4. The UM library can sometimes detect internal conditions that are not associated with deliverable events or API calls. These are delivered to the application with the "logger callback" (lbm_log() for C, LBM.setLogger() for Java and .NET).

    • Informatica requests that all logger callbacks be logged.

  5. The application calls UM API functions. Each function can return a failed status. In C, the failure is almost always indicated by a return value of -1. In Java and .NET, the failure is almost always indicated by throwing an exception. In either case, UM provides a text string that describes the failure. See lbm_errmsg() for C, the exception's toString() method for Java, or the exception's Message property for .NET).

    • Informatica requests that all failed API calls be logged with their descriptive text.

  6. Log files are typically "rolled" (saved off and re-created) on a periodic basis (usually daily), and are kept for some period of time before being purged.

    • Informatica requests that full logs be retained for at least 1 week.

  7. Recording and saving all this information is necessary to diagnose many user-visible problems. However, errors can be written to the log which are not associated with a user-visible problem. Users should have tools and procedures that alert operators of abnormal logs. This can prevent small problems from becoming big problems.

    • Informatica requests the use of a log file analysis tool that scans the live log files and reports unusual or exceptional logs and raises alerts that will be noticed by the operations staff.

    Ideally, log file monitoring would support the following:

    • Archive all log messages for all applications for at least a week, preferably a month.
    • Provide rapid access to operations staff to view the latest log messages from an application.
    • Periodic scans of the log file to detect errors and raise alerts to operations staff.

    There are many third party real-time log file analysis tools available. A discussion of possible tools is beyond the scope of UM documentation.


Monitoring UM Daemons  <-

Monitoring the UM daemons (Stores, DROs, etc) is similar to monitoring user applications.

For an overview of Informatica monitoring recommendations, see Monitoring.

For most UM daemons, there are three types of monitoring data that should be collected and monitored:

  • Log files.
  • UM Library Statistics. These are stats for contexts, transports, etc.
  • Daemon Statistics. These are stats specific to each daemon (Store, DRO, etc).

Informatica requests that all UM daemons being used in a deployment be monitored for all three types of data.

The different UM daemons have different methods of enabling monitoring. See:


Automatic Monitoring  <-

UM applications and some UM daemons (Store, DRO) make use of the UM library, which maintains a rich set of statistics. The library statistics consists of data about the health and operation of UM contexts, transport sessions, and event queues.

The recommended method for monitoring UM library statistics is the automatic monitoring feature. This should be enabled via configuration options on user applications and also Store daemons. See Automatic Monitoring Options. When enabled, a background context will sample and send the monitoring data. Note that a single monitoring context will be created and configured with the monitoring options. For example, if your application has multiple contexts, it is not possible to have different automatic monitoring settings for each context. For more information on monitoring applications with multiple contexts, see https://knowledge.informatica.com/s/article/151305.

Enabling the automatic monitoring feature will create a monitoring context that is configured for reduced system resource usage (sockets and memory). This monitoring context periodically wakes up and samples the library statistics in the current process and publishes them. The context is created with the name "29west_statistics_context" to simplify configuring the context with an XML configuration file.

Informatica recommends setting up a separate Topic Resolution Domain for monitoring data. If possible, that TRD should be hosted on a separate network from the main production network (perhaps an administrative "command and control" network). This minimizes the impact of monitoring on application throughput and latency.

To ease the deployment of monitoring data on an alternate network, Informatica recommends not using multicast. Topic resolution should use Unicast UDP TR or TCP TR (or both), and the monitoring data should be sent via the tcp transport.


Automatic Monitoring Sample  <-

Here is an excerpt from a sample application configuration file that shows how the above recommendations implemented:

<?xml version="1.0" encoding="UTF-8" ?>
<um-configuration version="1.0">
<templates>
...
<template name="automonitor">
<options type="context">
<option name="monitor_format" default-value="pb"/>
<option name="monitor_interval" default-value="600"/>
<option name="monitor_transport" default-value="lbm"/>
</options>
</template>
...
<template name="mon_ctx">
<options type="context">
<option name="resolver_unicast_daemon" default-value="10.29.3.101:12801"/>
<option name="default_interface" default-value="10.29.3.0/24"/>
<option name="mim_incoming_address" default-value="0.0.0.0"/>
...
</options>
<options type="source">
<option name="transport" default-value="tcp"/>
</options>
...
</template>
...
</templates>
<applications>
...
<application name="myapplication">
<contexts>
<context name="mycontext" template="mytemplate,automonitor">
<sources/>
</context>
<context name="29west_statistics_context" template="mon_ctx">
<sources/>
</context>
</contexts>
</application>
...

Here is a pair of "flat" configuration files to do the same thing:

myapplication.cfg:

context monitor_format pb
context monitor_interval 600
context monitor_transport lbm
context monitor_transport_opts config=mon.cfg
...

mon.cfg:

context resolver_unicast_daemon 10.29.3.101:12801
context default_interface 10.29.3.0/24
context mim_incoming_address 0.0.0.0
source transport tcp

Notes:

  1. The monitor_format (context) value "pb" selects the protobuf format and is available for the Store in UM version 6.14 and beyond. For applications running on earlier versions, omit monitor_format (context) (the format will be "csv"). UM supports a mixture of different versions, with the centralized collector accepting both "csv" and "pb".

  2. For applications that use event queues, the corresponding "event_queue monitor_..." options should be added.

  3. For a list of possible protobuf messages for UM library monitoring, see the "ums_mon.proto" file at Example ums_mon.proto.

  4. The application context named "mycontext" is configured with the "automonitor" template, which sets the automatic monitoring options. The monitor_interval (context) option enables automatic monitoring and defines the statistics sampling period. In the above example, 600 seconds (10 minutes) is chosen somewhat arbitrarily. Shorter times produce more data, but not much additional benefit. However, UM networks with many thousands of applications may need a longer interval (perhaps 30 or 60 minutes) to maintain a reasonable load on the network and monitoring data storage.

  5. When automatic monitoring is enabled, it creates a context named "29west_statistics_context". It is configured with the "mon_ctx" template, which sets options for the monitoring data TRD. (Alternatively, you can configure the monitoring context using monitor_transport_opts (context).) When possible, Informatica recommends directing monitoring data to an administrative network, separate from the application data network. This prevents monitoring data from interfering with application data latency or throughput. In this example, the monitoring context is configured to use an interface matching 10.29.3.0/24.

  6. The monitoring data is sent out via UM using the TCP transport.

  7. These settings were chosen to conform to the recommendations above.

For a full demonstration of monitoring, see: https://github.com/UltraMessaging/mcs_demo


Centralized Collector  <-

Once you have applications and daemons publishing statistics, you need an independent program (the monitoring collector) that subscribes to the statistics and records them for subsequent analysis and display.

UM supports two approaches to centralized monitoring:

  • Monitoring Collector Service (MCS). Informatica's monitoring program that collects library and daemon statistics for storage in a database of your choosing.
  • User-Developed Collector. Your own centralized monitoring program that collects library and daemon statistics for storage and analysis.


Monitoring Collector Service (MCS)  <-

The MCS is an independent monitoring data collector program. It subscribes to UM library and daemon statistics and writes the data to a database.

MCS requires Java 9 or greater.

The MCS needs to be configured. It has an XML configuration file; see MCS Configuration File. In this configuration file, the MCS Element "<config-file>" references an LBM configuration file. This is used to configure the UM library so that it can subscribe to published monitoring data.

The MCS also needs its database to be created prior to starting it.

See Man Pages for MCS for information on running the MCS.

As of UM 6.14, the MCS supports writing to SQLite (the default) or HDFS (Apache Hadoop), selectable by configuration. Note that UM does not ship with a copy of either database software package. The user must install a copy of the desired database on the host where MCS will run.

SQLite

To use SQLite, you must install it yourself on the host where MCS is intended to run. You should have some familiarity with administering the system. SQLite training is beyond the scope of UM documentation.

MCS is written to convert incoming proto buff messages into JSON, and write the JSON to SQLite using SQLite's jdbc driver. This allows flexible querying of the data.

You will find the MCS executable the UM package's "MCS/bin" directory. Also in that directory, you will find the file "ummon_db.sql" which can be used to create the database. For example, enter the following command from the Linux shell:

sqlite3 /mcs/mcs.db <$HOME/UMP_6.15/MCS/bin/ummon_db.sql

Here is the contents of "ummon_db.sql":

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE umsmonmsg(message json);
CREATE TABLE umpmonmsg(message json);
CREATE TABLE dromonmsg(message json);
CREATE TABLE srsmonmsg(message json);
COMMIT;

To use SQLite, set MCS's XML configuration file's element <connector> to "sqlite". Also set MCS Element "<properties-file>" to a file which contains properties that the MCS's "sqlite" connector expects:

  • sqlite_database_path - file path to the SQLite database file. If on Windows, you may use forward slashes or escaped (double) back slashes to separate directories.

Example sqlite properties file:

sqlite_database_path=/mcs/ummon.db

For a full demonstration of monitoring with the MCS and sqlite, see: https://github.com/UltraMessaging/mcs_demo

Listing best practices for SQLite deployments is beyond the scope of UM documentation. However, this paragraph from https://www.sqlite.org/lockingv3.html did catch our eye:

SQLite uses POSIX advisory locks to implement locking on Unix. On Windows it uses the LockFile(), LockFileEx(), and UnlockFile() system calls. SQLite assumes that these system calls all work as advertised. If that is not the case, then database corruption can result. One should note that POSIX advisory locking is known to be buggy or even unimplemented on many NFS implementations (including recent versions of Mac OS X) and that there are reports of locking problems for network filesystems under Windows. Your best defense is to not use SQLite for files on a network filesystem.

(emphasis ours)

See https://www.sqlite.org/ for full information on SQLite.

HDFS

To use HDFS, you must install it yourself on the host where MCS is intended to run. You should have some familiarity with administering the system. Training you on HDFS's use is beyond the scope of UM documentation.

MCS is written to write the protocol buffers directly to HDFS. This allows flexible querying of the data.

You will find the MCS executable the UM package's "bin" directory, along with the other daemon binaries.

To use HDFS, set MCS Element "<connector>" to "hdfs". Also set MCS Element "<properties-file>" to a file which contains properties that the MCS's "hdfs" connector expects:

  • hdfs_core_site_file_path - path name of XML resource file containing read-only configuration defaults for hadoop.
  • hdfs_hdfs_site_file_path - path name of XML resource file containing site-specific configuration for a given hadoop installation.

See hadoop documentation for details on the resource files.

Example hdfs properties file:

hdfs_core_site_file_path=/mcs/hdfs_core.xml
hdfs_hdfs_site_file_path=/mcs/hdfs_site.xml


User-Developed Collector  <-

For UM users have taken either the Example lbmmon.c or the Example lbmmon.java source file and modified it to create a monitoring application, you may want to enhance your program to take advantage of new capabilities introduced in 6.14. Note that there is also a .NET example: Example lbmmon.cs, which is still functional but is no longer being enhanced. Users are requested to write their collector using C/C++ or Java.

Prior to UM version 6.14, different types of published data required different methods for receiving and accessing them. UM library statistics were received using the "lbmmon" library, Store and DRO daemon stats were received as C-style binary data structures, and SRS stats were received as JSON.

Starting with 6.14, library, Store, and DRO stats have been unified with Google Protocol Buffers. Starting with 6.15, the SRS joined them in supporting protocol buffers. See Monitoring UM Daemons.

Also, the lbmmon library is enhanced with a "passthrough" mode that gives the monitoring application direct access to the protocol buffer messages. (The SRS daemon statistics are not yet available as protocol buffers.)

The older methods are still functional, but they may not be further enhanced in the future. Informatica recommends migrating to the passthrough mode and the google protocol buffers. To take full advantage of its new capabilities, lbmmon should be run with command-line options:

  --format=pb --format-opts="passthrough=convert"

This allows lbmmon to receive both proto buff formatted data as well as CSV formatted data. CSV data is converted to proto buff before delivery to the monitoring application.

See Example lbmmon.c and/or Example lbmmon.java for details on using both the new and older lbmmon APIs.

You may notice that Example lbmmon.java no longer contains print lines of the individual fields of the monitoring data objects. These accesses were moved into the lbmmon Jar file as displayString() methods for each statistics data object to simplify the example source code. For convenience, those methods' source code has been collected into an example file, Example displayString.java.


Monitoring Formats  <-

The UM library and UM daemons publish their statistics using several different message formats, depending on the type of data being sent. Starting with UM version 6.14, most of those formats are deprecated in favor of a unified message format based on Google Protocol Buffers.

The data formats are:

  • Protocol Buffers. Option for UM library statistics, and Store, DRO, and SRS daemon statistics. Recommended.
  • CSV - comma-separated values. Option for UM library statistics. Deprecated, but retained for backward compatibility.
  • Binary C-style data structures. Option for UM daemons Store and DRO daemon statistics. Deprecated, but retained for backward compatibility.
  • JSON. Option for SRS daemon statistics. Deprecated, but retained for backward compatibility.

For applications and UM daemons, Informatica recommends protocol buffers. Users of binary or JSON formats are recommended to migrate to protocol buffers.


Protocol Buffer Format  <-

You can view and use the proto buff definition files at Example Protocol Files. Note that most integer fields are defined as "uint64", a 64-bit unsigned value. However, within the UM library, almost all of the integer statistics are maintained as C "unsigned long int" values. Note that different compilers interpret "unsigned long int" in different ways.

Platform Unsigned Long Size
Linux 64-bit 64 bits
Linux 32-bit 32 bits
Windows 64-bit 32 bits
Windows 32-bit 32 bits

This can make a difference for statistics that increase in value quickly, like lbm_src_transport_stats_lbtrm_t_stct::bytes_sent. On platforms where a "long int" is 32 bits, a high-rate transport session running for a few hours can reach that field's maximum value (4,294,967,295) and then "wrap" back to zero.

Also note that a few statistics represent the "minimum value seen", and are initialized to the maximum possible value. For example, lbm_rcv_transport_stats_lbtrm_t_stct::nak_tx_min is initialized to the maximum value for "unsigned long int". Thus, GCC on a 32-bit build or the Microsoft compiler will initialize it to 4,294,967,295 (2**32 - 1). GCC on a 64-bit build will initialize it to 18,446,744,073,709,551,615 (2**64 - 1).

To see the underlying data types and descriptions for UM library statistics, look at the C language definitions of the statistics structures. See: