Operations Guide
|
For an Ultra Messaging deployment, "monitoring" is the process of overseeing the operation of UM and the resources it uses to determine its health and performance.
Informatica strongly recommends that users of Ultra Messaging actively monitor its operation.
Many Ultra Messaging deployments operate without any problem. Why monitor a system that is working fine? Monitoring lets you:
UM deployments are often large-scale, loosely-coupled systems, used by many diverse end-user groups who may have little knowledge of each others usage patterns or future plans. Especially when there are resources shared by these groups, it is important to monitor UM's operation and performance characteristics. This can help you avoid problems, and when problems do arise, having monitoring data can be critical to diagnosing the root causes.
At a high level, Informatica strongly recommends:
An effective monitoring strategy should include monitoring of the host and network resources that UM depends on. Informatica requests that you monitor and record data from the following sources:
There are many vendors of monitoring technology that do a good job of recording and visualizing network and host statistics and logged events. Your operations staff should have fast and easy access to this monitoring data. Ideally, your monitoring tools will raise alerts when problems are detected.
Detailed discussion of host and network monitoring is beyond the scope of the Ultra Messaging documentation.
Broadly speaking, there are three kinds of UM operational data that should be monitored and recorded:
These are application programs that you, the UM user, have developed. As you write your code, Informatica requests that all applications do the following:
Items 1-3 typically involve writing to log file on the local host. See Application Log File.
Item 4 typically involves configuring the UM library to publish its statistics to a Centralized Collector. The statistics are sampled and published using UM's Automatic Monitoring feature. (There are alternatives; see Legacy Monitoring.)
Note that Informatica recommends the use of the Monitoring Collector Service (MCS), which is a relatively recent addition to UM. Be aware that the MCS is able to collect stats from all UM versions 5.x and 6.x. It is not necessary to upgrade applications to take advantage of the latest UM monitoring capabilities.
During the design and development of user applications, Informatica requests that users implement the following logging items:
The UM library delivers events to the application. Informatica requests that applications log all informational and exceptional events. Logs are classified as to their severity, which is useful for automatic log file scanners to detect potential problems and raise alerts.
Be aware that different threads might need to write logs asynchronously to each other. Your logger code must be thread-safe.
Some of UM events represent the delivery of user data messages. These events do not need to be logged. Other events are informational and are delivered during normal operation, like Beginning of Session (BOS). These should be logged because they can be essential to diagnose problems.
Many UM event callbacks can deliver a variety of different event types. For example, a receiver callback can be delivered user data messages, Beginning of Session (BOS) events, etc. The application code typically recognizes those events that it is written to recognize, and ignores the others.
The UM library can sometimes detect internal conditions that are not associated with deliverable events or API calls. These are delivered to the application with the "logger callback" (lbm_log() for C, LBM.setLogger() for Java and .NET).
The application calls UM API functions. Each function can return a failed status. In C, the failure is almost always indicated by a return value of -1. In Java and .NET, the failure is almost always indicated by throwing an exception. In either case, UM provides a text string that describes the failure. See lbm_errmsg() for C, the exception's toString() method for Java, or the exception's Message property for .NET).
Log files are typically "rolled" (saved off and re-created) on a periodic basis (usually daily), and are kept for some period of time before being purged.
Recording and saving all this information is necessary to diagnose many user-visible problems. However, errors can be written to the log which are not associated with a user-visible problem. Users should have tools and procedures that alert operators of abnormal logs. This can prevent small problems from becoming big problems.
Ideally, log file monitoring would support the following:
There are many third party real-time log file analysis tools available. A discussion of possible tools is beyond the scope of UM documentation.
Monitoring the UM daemons (Stores, DROs, etc) is similar to monitoring user applications.
For an overview of Informatica monitoring recommendations, see Monitoring.
For most UM daemons, there are three types of monitoring data that should be collected and monitored:
Informatica requests that all UM daemons being used in a deployment be monitored for all three types of data.
The different UM daemons have different methods of enabling monitoring. See:
UM applications and some UM daemons (Store, DRO) make use of the UM library, which maintains a rich set of statistics. The library statistics consists of data about the health and operation of UM contexts, transport sessions, and event queues.
The recommended method for monitoring UM library statistics is the automatic monitoring feature. This should be enabled via configuration options on user applications and also Store daemons. See Automatic Monitoring Options. When enabled, a background context will sample and send the monitoring data. Note that a single monitoring context will be created and configured with the monitoring options. For example, if your application has multiple contexts, it is not possible to have different automatic monitoring settings for each context. For more information on monitoring applications with multiple contexts, see https://knowledge.informatica.com/s/article/151305.
Enabling the automatic monitoring feature will create a monitoring context that is configured for reduced system resource usage (sockets and memory). This monitoring context periodically wakes up and samples the library statistics in the current process and publishes them. The context is created with the name "29west_statistics_context"
to simplify configuring the context with an XML configuration file.
Informatica recommends setting up a separate Topic Resolution Domain for monitoring data. If possible, that TRD should be hosted on a separate network from the main production network (perhaps an administrative "command and control" network). This minimizes the impact of monitoring on application throughput and latency.
To ease the deployment of monitoring data on an alternate network, Informatica recommends not using multicast. Topic resolution should use Unicast UDP TR or TCP TR (or both), and the monitoring data should be sent via the tcp transport.
Here is an excerpt from a sample application configuration file that shows how the above recommendations implemented:
Here is a pair of "flat" configuration files to do the same thing:
myapplication.cfg:
context monitor_format pb context monitor_interval 600 context monitor_transport lbm context monitor_transport_opts config=mon.cfg ...
mon.cfg:
context resolver_unicast_daemon 10.29.3.101:12801 context default_interface 10.29.3.0/24 context mim_incoming_address 0.0.0.0 source transport tcp
Notes:
The monitor_format (context) value "pb" selects the protobuf format and is available for the Store in UM version 6.14 and beyond. For applications running on earlier versions, omit monitor_format (context) (the format will be "csv"). UM supports a mixture of different versions, with the centralized collector accepting both "csv" and "pb".
For applications that use event queues, the corresponding "event_queue monitor_..."
options should be added.
For a list of possible protobuf messages for UM library monitoring, see the "ums_mon.proto" file at Example ums_mon.proto.
The application context named "mycontext"
is configured with the "automonitor"
template, which sets the automatic monitoring options. The monitor_interval (context) option enables automatic monitoring and defines the statistics sampling period. In the above example, 600 seconds (10 minutes) is chosen somewhat arbitrarily. Shorter times produce more data, but not much additional benefit. However, UM networks with many thousands of applications may need a longer interval (perhaps 30 or 60 minutes) to maintain a reasonable load on the network and monitoring data storage.
When automatic monitoring is enabled, it creates a context named "29west_statistics_context". It is configured with the "mon_ctx" template, which sets options for the monitoring data TRD. (Alternatively, you can configure the monitoring context using monitor_transport_opts (context).) When possible, Informatica recommends directing monitoring data to an administrative network, separate from the application data network. This prevents monitoring data from interfering with application data latency or throughput. In this example, the monitoring context is configured to use an interface matching 10.29.3.0/24
.
The monitoring data is sent out via UM using the TCP transport.
These settings were chosen to conform to the recommendations above.
For a full demonstration of monitoring, see: https://github.com/UltraMessaging/mcs_demo
Once you have applications and daemons publishing statistics, you need an independent program (the monitoring collector) that subscribes to the statistics and records them for subsequent analysis and display.
UM supports two approaches to centralized monitoring:
The MCS is an independent monitoring data collector program. It subscribes to UM library and daemon statistics and writes the data to a database.
MCS requires Java 9 or greater.
The MCS needs to be configured. It has an XML configuration file; see MCS Configuration File. In this configuration file, the MCS Element "<config-file>" references an LBM configuration file. This is used to configure the UM library so that it can subscribe to published monitoring data.
The MCS also needs its database to be created prior to starting it.
See Man Pages for MCS for information on running the MCS.
As of UM 6.14, the MCS supports writing to SQLite (the default) or HDFS (Apache Hadoop), selectable by configuration. Note that UM does not ship with a copy of either database software package. The user must install a copy of the desired database on the host where MCS will run.
To use SQLite, you must install it yourself on the host where MCS is intended to run. You should have some familiarity with administering the system. SQLite training is beyond the scope of UM documentation.
MCS is written to convert incoming proto buff messages into JSON, and write the JSON to SQLite using SQLite's jdbc driver. This allows flexible querying of the data.
You will find the MCS executable the UM package's "MCS/bin" directory. Also in that directory, you will find the file "ummon_db.sql" which can be used to create the database. For example, enter the following command from the Linux shell:
Here is the contents of "ummon_db.sql":
PRAGMA foreign_keys=OFF; BEGIN TRANSACTION; CREATE TABLE umsmonmsg(message json); CREATE TABLE umpmonmsg(message json); CREATE TABLE dromonmsg(message json); CREATE TABLE srsmonmsg(message json); COMMIT;
To use SQLite, set MCS's XML configuration file's element <connector> to "sqlite". Also set MCS Element "<properties-file>" to a file which contains properties that the MCS's "sqlite" connector expects:
Example sqlite properties file:
sqlite_database_path=/mcs/ummon.db
For a full demonstration of monitoring with the MCS and sqlite, see: https://github.com/UltraMessaging/mcs_demo
Listing best practices for SQLite deployments is beyond the scope of UM documentation. However, this paragraph from https://www.sqlite.org/lockingv3.html did catch our eye:
SQLite uses POSIX advisory locks to implement locking on Unix. On Windows it uses the LockFile(), LockFileEx(), and UnlockFile() system calls. SQLite assumes that these system calls all work as advertised. If that is not the case, then database corruption can result. One should note that POSIX advisory locking is known to be buggy or even unimplemented on many NFS implementations (including recent versions of Mac OS X) and that there are reports of locking problems for network filesystems under Windows. Your best defense is to not use SQLite for files on a network filesystem.
(emphasis ours)
See https://www.sqlite.org/ for full information on SQLite.
To use HDFS, you must install it yourself on the host where MCS is intended to run. You should have some familiarity with administering the system. Training you on HDFS's use is beyond the scope of UM documentation.
MCS is written to write the protocol buffers directly to HDFS. This allows flexible querying of the data.
You will find the MCS executable the UM package's "bin" directory, along with the other daemon binaries.
To use HDFS, set MCS Element "<connector>" to "hdfs". Also set MCS Element "<properties-file>" to a file which contains properties that the MCS's "hdfs" connector expects:
See hadoop documentation for details on the resource files.
Example hdfs properties file:
hdfs_core_site_file_path=/mcs/hdfs_core.xml hdfs_hdfs_site_file_path=/mcs/hdfs_site.xml
For UM users have taken either the Example lbmmon.c or the Example lbmmon.java source file and modified it to create a monitoring application, you may want to enhance your program to take advantage of new capabilities introduced in 6.14. Note that there is also a .NET example: Example lbmmon.cs, which is still functional but is no longer being enhanced. Users are requested to write their collector using C/C++ or Java.
Prior to UM version 6.14, different types of published data required different methods for receiving and accessing them. UM library statistics were received using the "lbmmon" library, Store and DRO daemon stats were received as C-style binary data structures, and SRS stats were received as JSON.
Starting with 6.14, library, Store, and DRO stats have been unified with Google Protocol Buffers. Starting with 6.15, the SRS joined them in supporting protocol buffers. See Monitoring UM Daemons.
Also, the lbmmon library is enhanced with a "passthrough" mode that gives the monitoring application direct access to the protocol buffer messages. (The SRS daemon statistics are not yet available as protocol buffers.)
The older methods are still functional, but they may not be further enhanced in the future. Informatica recommends migrating to the passthrough mode and the google protocol buffers. To take full advantage of its new capabilities, lbmmon should be run with command-line options:
--format=pb --format-opts="passthrough=convert"
This allows lbmmon to receive both proto buff formatted data as well as CSV formatted data. CSV data is converted to proto buff before delivery to the monitoring application.
See Example lbmmon.c and/or Example lbmmon.java for details on using both the new and older lbmmon APIs.
You may notice that Example lbmmon.java no longer contains print lines of the individual fields of the monitoring data objects. These accesses were moved into the lbmmon Jar file as displayString() methods for each statistics data object to simplify the example source code. For convenience, those methods' source code has been collected into an example file, Example displayString.java.
The UM library and UM daemons publish their statistics using several different message formats, depending on the type of data being sent. Starting with UM version 6.14, most of those formats are deprecated in favor of a unified message format based on Google Protocol Buffers.
The data formats are:
For applications and UM daemons, Informatica recommends protocol buffers. Users of binary or JSON formats are recommended to migrate to protocol buffers.
You can view and use the proto buff definition files at Example Protocol Files. Note that most integer fields are defined as "uint64", a 64-bit unsigned value. However, within the UM library, almost all of the integer statistics are maintained as C "unsigned long int" values. Note that different compilers interpret "unsigned long int" in different ways.
Platform | Unsigned Long Size |
---|---|
Linux 64-bit | 64 bits |
Linux 32-bit | 32 bits |
Windows 64-bit | 32 bits |
Windows 32-bit | 32 bits |
This can make a difference for statistics that increase in value quickly, like lbm_src_transport_stats_lbtrm_t_stct::bytes_sent. On platforms where a "long int" is 32 bits, a high-rate transport session running for a few hours can reach that field's maximum value (4,294,967,295) and then "wrap" back to zero.
Also note that a few statistics represent the "minimum value seen", and are initialized to the maximum possible value. For example, lbm_rcv_transport_stats_lbtrm_t_stct::nak_tx_min is initialized to the maximum value for "unsigned long int". Thus, GCC on a 32-bit build or the Microsoft compiler will initialize it to 4,294,967,295 (2**32 - 1). GCC on a 64-bit build will initialize it to 18,446,744,073,709,551,615 (2**64 - 1).
To see the underlying data types and descriptions for UM library statistics, look at the C language definitions of the statistics structures. See: