Operations Guide
Startup/Shutdown Procedures

In a multicast environment, only the applications and monitoring tools need to be started. If using Persistence, the Store daemon (umestored) also needs to be started. Likewise, use of the DRO requires starting the DRO daemon (tnwgd).

In a unicast-only environment, one or more resolver daemons (lbmrd) are typically required. It is recommended that you start the lbmrd before starting the applications.

Informatica recommends that you shutdown applications using UM sources and receivers cleanly, even though UM is able to cope with the ungraceful shutdown and restart of applications and UM daemons.

A failed assertion could lead to immediate application shutdown. If opting to restart a UM client or lbmrd, no other components need be restarted. Failed assertions should be logged with Informatica support.


Topic Resolution  <-

Topic Resolution (TR) is the process by which subscribers discover a topic's publisher's transport session information. The subscriber uses this information to join the source's transport session, which then delivers published messages to the subscriber. With Multicast UDP TR, topic resolution does not need to be started or shutdown. But TR does require network resources, and if not configured properly can lead to temporary or permanent deafness in receivers, as well as other problems. See Topic Resolution Overview for more detailed information.

Applications cannot deliver messages until topic resolution completes. UM monitoring statistics are active before all topics resolve. In a large topic space (e.g. 10,000 topics) topic resolution messages may be 'staggered' or rate controlled, taking potentially several seconds to complete.

For example, 10,000 topics at the default value of 1,000 for resolver_initial_advertisements_per_second (context) will take 10 seconds to send out an advertisement for every topic. If all receiving applications have been started first, fully resolving all topics may not take much more than 10 seconds. The rate of topic resolution can also be controlled with the resolver_initial_advertisement_bps (context) configuration option. Topic advertisements contain the topic string and approximately 110 bytes overhead. Topic queries from receivers contain no overhead, only the topic string.

Your UM development or administration team should anticipate the time and bandwidth required to resolve all topics when all applications initially start. This team should also establish any restarting restrictions.

Operations staff typically don't have any operational tasks related to topic resolution aside from monitoring hosts' CPU and bandwidth usage.


UM Applications  <-

Your UM development team should provide you with the application names, resident machines and startup parameters, along with a sequence of application/daemon startups and shutdowns.

The following lists typical application startup errors.

  • Lack of resources
  • License not configured - LOG Level 3: CRITICAL: LBM license invalid [LBM_LICENSE_FILENAME nor LBM_LICENSE_INFO are set]
  • Cannot bind port - lbm_context_create: could not find open TCP server port in range.

    Too many applications may be running using the UM context's configured port range on this machine. This possibility should be escalated to your UM development team.

    Application is possibly already running. It is possible to start more than one instance of the same UM application.

  • Invalid network interface name / mask - lbm_config: line 1: no interfaces matching criteria
  • Multiple interfaces detected - LOG Level 5: WARNING: Host has multiple multicast-capable interfaces; going to use [en1][10.10.10.102]

This message appears for multi-homed machines. UM is not explicitly configured to use a single interface. This may not cause an issue but requires configuration review by your UM development team.


Indications of Possible Application Shutdown  <-

A UM application shutdown may not be obvious immediately, especially if you are monitoring scores of applications. The following lists events that may indicate an application has shutdown.

  • The Process ID disappears. Consider a method to monitor all process IDs (PIDs).
  • You notice the existence of a core dump file on the machine.
  • UM statistics appear to reduce in volume or stop flowing.
  • In an Application Log, one or more End Of Session (EOS) events signaling the cessation of a transport session. This may indicate a source application may have shut down. Your UM development team must explicitly log LBM_MSG_EOS events. Some EOS events may be delayed for some transports.
  • In an Application Log, disconnect events (LBM_SRC_EVENT_DISCONNECT) for unicast transports (if implemented) indicate UM receiver applications have shutdown.


Unicast Topic Resolver (lbmrd)  <-

If not using multicast topic resolution, one or more instances of lbmrd must be started prior to stating applications. Unicast resolver daemons require an XML configuration file and multiple resolver daemons can be specified by your UM development team for resiliency. For more information on Unicast Topic Resolution, see Unicast UDP TR.

Execute the following command on the appropriate machine to start a unicast topic resolver (lbmrd) from command line:

lbmrd --interface=ADDR -L daemon_logfile.out -p PORT lbmrd.cfg

For more information on the lbmrd command-line, see Lbmrd Man Page.

To stop an lbmrd that is running as a Windows service, use the Windows service control panel to stop it. Otherwise, kill the PID. If an lbmrd terminates, you need to restart it.

Observe the lbmrd logfile for errors and warnings

To make the lbmrd a Windows Service, see UM Daemons as Windows Services.

If running multiple lbmrds and an lbmrd in the list becomes inactive, the following message appears in the clients' log files:

unicast resolver "<ip>:<port>" went inactive

If all unicast resolver daemons become inactive, the following message appears in the clients' log files:

No active resolver instances, sending via inactive instance

After all topics are resolved, daemons do not strictly need to be running unless you restart applications. Resolver daemons do not cache or persist state and do not require other shutdown maintenance.


Running Persistent Stores (umestored)  <-

Stores can operate in disk-backed or memory-only mode specified in the Store's XML configuration file. Disk backed Stores are subject to the limitations of the disk hardware. For high performance applications, Informatica recommends solid-state disks local to the physical host running the Store. Those solid-state disks should be optimized for writing and should be dedicated to UM Store usage only (don't share it with other databases or disk-intensive programs).

There are a few scenarios where you need to start a Store:

  1. Bringing a whole system up from scratch (applications, Stores, etc.). In this case, you typically want to Eliminate Past State.
  2. Restarting a Store after a problem (unplanned shutdown, crash, power outage, etc.). In this case, you typically want to Retain Past State.
  3. Restarting a Store after a planned maintenance window. In this case, you might Eliminate Past State or Retain Past State; the choice is yours.

For details on starting a Store, see Starting a Store.

Similarly, there are a few scenarios where you need to shut down a Store:

  1. Regular system shutdown, as for a planned maintenance window.
  2. A problem is detected and you have decided restarting the Stores will help.

For details on shutting down a Store, see Shutting Down a Store.

When you shut down a Store with the intention of restarting it, or if a Store fails, you should wait to restart for more than the time specified by the ume_store_activity_timeout (source) (defaults to 10 seconds). I.e. you should wait for the publishers to time out the old Store instance before you bring up the new instance.


Eliminate Past State  <-

Sometimes when you start a Store, you want to eliminate past state. That is, you don't want any subscribers to recover older messages. For example, if bringing a whole system up from scratch. Or many users do a clean start after every maintenance window.

In this case, you must cleanly restart all publishers, subscribers, and Stores. It is recommended to have every component in the down state simultaneously before restarting any of them. Otherwise you risk having a newly restarted component learning old state from a not-yet-restarted component.

Part of cleanly restarting all Stores is deleting their state and cache files. If you fail to delete the state and cache files from one or more Stores, subscribers might start up and recover (replay) old messages that it had already processed. Or subscribers can be temporarily deaf to some set (potentially large) set of newly published messages. Please ensure proper deletion of state and cache files prior to restarting Stores when you intend to eliminate past state.


Retain Past State  <-

Sometimes when you start a Store, you want to retain past state. That is, you want to restart Stores in such a way that applications can pick up where they left off, and recover lost messages if necessary. For example, if restarting a store after a problem.

You should restart the Store with the state and cache files intact. When the Store restarts, it will read and restore the previous state. It will then attempt to recover as many missed messages from the sources, as possible. Then the Store resumes normal operation.

Note that if the cache files are very large, it might take a significant amount of time for the Store to initialize. If this time is objectionable, you may choose a Limit Initial Restore with Restore-Last.

There are times where it is not possible to retain the past state. For example, a failure of the disk holding the state and cache files. Or a failure of the Store's host, and the Store is restarted on a new host. In that case, care must be taken to ensure that the applications will properly restore as much of their previous state as possible.

For example, let's say that two Stores out of a three-Store QC group fail and lose their state and cache files. You have no choice but to restart those Stores cleanly, without state and cache. Note that in this example, since only one Store is running (the "stateful" Store), the applications lose quorum and pause their execution until Quorum is restored.

In this dual-failure case, it is important *not* to restart both failed Stores at the same time. To do so could result in applications registering with the two cleaned Stores first, and resetting their own internal states. This could lead to subscriber deafness, or replay of previously processed messages.

Instead, you must restart one of the cleaned Stores and allow the applications to register with the clean Store and the stateful Store. Allow some time for the the cleaned Store to collect some state from the applications (typically requires several seconds). Then the other cleaned Store can be restarted.


Starting a Store  <-

Unix users can start a Store with following command from a shell prompt or script:

umestored config-file.xml

See Umestored Man Page for details on the umestored command-line.

Windows users typically install the Store as a Windows service (see UM Daemons as Windows Services). Windows has a control panel for starting services.

Memory-only Stores do not save state in disk files; they always start up without past state.

Disk-based Stores create two types of state files:

  • Cache file - contains the actual persisted messages, and can grow to be very large over time. Each cache file created corresponds to a specific publisher source. It is important to ensure that there is enough disk space to record the desired amount of persisted data. See Topic Option "repository-disk-file-size-limit".
  • State file - contains information about the current state of each client connection. These files is much smaller.

A Store signals it has completed initialization by logging a message of the form:

Store-5688-5546: Store "StoreName" ready to accept registrations

If a Store process is configured to have more than one Store instance, this message will be logged for each configured Store instance (see Store Processes and Instances).

The time required for a Store to complete initialization depends on the size of the cache file. For very large cache files, the initialization time can become objectionably long. see Limit Initial Restore with Restore-Last for a solution.


Shutting Down a Store  <-

If the Store is running as a Windows service, use the Service Manager to stop the Windows service.

Otherwise kill the PID. You can find the PID in the configured PID file; see UMP Element "<pidfile>".

Warning
Do not perform a "forced kill" (in Unix "kill -9"). This can cause corruption of the state and/or cache files.


Common Startup and Shutdown Issues  <-

  • Cache and state directories don't exist. It is the user's responsibility to pre-create those directories. See Store Option "disk-state-directory" and Store Option "disk-cache-directory".
  • Disk space - Cache files contain the actual persisted messages, and can grow to be very large over time. It is important to ensure that there is enough disk space to record the appropriate amount of persisted data. See Topic Option "repository-disk-file-size-limit".
  • Configuration error - UM parses a Store's XML configuration file at startup, reporting errors to standard error.
  • Configuration error - UM reports other configuration errors the Store's log file.
  • Missing license details.


DRO (tnwgd)  <-

When a DRO starts it discovers all sources and receivers in the topic resolution domains to which it connects. This results in a measurable increase and overall volume of topic resolution traffic and can take some time to complete depending upon the number of sources, receivers, and topics. The rate limits set on topic resolution also affect the time to resolve all topics.

See also Topic Resolution.


Starting a DRO  <-

Execute the following command on the appropriate machine to start a DRO (tnwgd) from command line:

tnwgd config-file.xml

Informatica recommends:

  • Record tnwgd PID to monitor process presence for failure detection.
  • Monitor the tnwgd logfile for errors and warnings.

For more information on the tnwgd command-line, see Tnwgd Man Page. To make the DRO a Windows Service, see UM Daemons as Windows Services.


Restarting a DRO  <-

Perform the following procedure to restart a DRO.

  1. If the DRO is still running as a Windows service, use the Windows service control panel to stop the process. Otherwise, kill the PID.
  2. Wait 20-30 seconds to let timeouts expire. After a restart new proxy sources and receivers must be created on the DRO. Applications will not use the new proxies until the transport timeout setting expires for the old connections. Until this happens, applications may appear to be deaf since they are still considering themselves as connected to the "old" DRO proxies. Therefore, do not rapidly restart the DRO.
  3. Run the command: tnwgd config-file.xml


UM Daemons as Windows Services  <-

On the Microsoft Windows platform, the UM daemons can be used either from the command line or as Windows Services.

Attention
Do not use the task manager or the "kill" command to stop a UM daemon running as a Windows service. Use the Windows service control panel to stop the service. In particular, if the persistent Store is killed non-gracefully, it can leave its files in an inconsistent state.

The UM daemons available as Windows Services are:

Executable File Description Service Display Name Man Page
lbmrds.exe UDP-based Unicast Topic Resolver "LBMR Store Daemon" man page
srsds.exe TCP-based Topic Resolver "UM Stateful Topic Resolution Service" man page
mcsds.exe Monitoring Collector Service "UM Monitoring Collector Service" man page
storeds.exe Persistent Store "UME Store Daemon" man page
tnwgds.exe Dynamic Router (DRO) "Ultra Messaging Gateway" man page

Note that the Ultra Messaging Manager daemon ("ummd") is not offered as a Windows Service at this time.

Also note that the UM daemons were not designed to run multiple instances of the service on the same host. See known limitation 9337.

As of UM version 6.12 and beyond, the above UM daemons work similarly with respect to running as a Windows Service. See the individual man pages for differences.

For each service, the executable file (e.g. "storeds.exe") is used for two purposes:

That second purpose, configuring the UM Windows Service, consists of running the executable with one or more command-line options to store desired operational parameters into the Windows registry. This makes those parameters available to the service when Windows starts the service.

First, make sure that your UM license key is provided in a way that the service can access it. In particular, if you are using an environment variable to set the license key, it must be a system environment variable, not user.

Once your license key is ready, there are 4 overall steps to running a UM daemon as a Windows Service:

  1. Install the Windows Service
  2. Configure the Daemon
  3. Configure the Windows Service
  4. Start the Windows Service

All 4 steps must be completed before the Service can be used.


Install the Windows Service  <-

There are two ways to install a UM daemon as a Windows Service:

  • Product package installer.
  • Command line.

Product package installer

When installing the product using the package installer, the dialog box titled "Choose Components" provides one or more check boxes for UM daemons to be installed as services. You may check any number of the boxes and proceed with the installation.

Note that for any box not checked, the software for that daemon is still copied onto the machine. This allows for installation as a Windows Service at a later time using the Command Line method.

Command line

If a daemon was not installed as a Windows Service from the product's package installer (possibly because the package installer was not used), daemons can be installed at a later time from the command line.

  1. Open a Windows Command Prompt, enabling Administrator access. (One way to do this is to right-click on the Command Prompt icon and select "More > Run as administrator".)

  2. Run the Service executable with the "-s install" command-line option. For example:
    umestoreds -s install
    (Note: lbmrds uses upper-case "-S".)


Configure the Daemon  <-

UM daemons are configured via XML configuration files. These files must be created and managed by the user. Each individual daemon needs its own separate XML configuration file.

Informatica recommends developing and testing the daemon configuration files interactively, using the command-line interface of each daemon. Do not run the daemon as a Windows Service until the daemon configuration has been validated and tested. This provides the fastest test cycle while the configuration is being developed and finalized.

The configuration files should be located on the hosts that are intended to run the daemons in files/folders of the user's choosing.

For more information on configuring and running the daemons interactively, see:

Executable File Description Configuration Details Man Page
lbmrds.exe UDP-based Unicast Topic Resolver lbmrd Configuration File man page
srsds.exe TCP-based Topic Resolver SRS Configuration File man page
mcsds.exe Monitoring Collector Service MCS Configuration File man page
storeds.exe Persistent Store Configuration Reference for Umestored man page
tnwgds.exe Dynamic Router (DRO) DRO Configuration Reference man page


Configure the Windows Service  <-

"Configure the Windows Service" is different from Configure the Daemon. Configuring a UM Service provides Windows-specific operational parameters to the UM Daemon, which are not configurable via the Daemon Configuration. For example, you need to tell the Service where to find the Daemon Configuration file.

At this point, you should have the Daemon XML configuration file(s) prepared and available on the host which is to run the desired daemon(s) (See Configure the Daemon). And you should have tested the configuration using the daemon interactively to verify is correct operation.

Configuring the Service consists of running the Service executable from a Windows Command Prompt with one or more command-line options to store desired operational parameters into the Windows registry. This makes those parameters available to the service when Windows starts the service.

Attention
You need the Command Prompt window running as Administrator. (One way to do this is to right-click on the Command Prompt icon and select "More > Run as administrator".)

There are several operational parameters that are common across all of the UM Windows Services:

  • Setting up the Windows Event Logger
  • Setting up the Environment Variables file
  • Setting up the Daemon Configuration file
  • Service installation or removal

Each Service has additional parameters that are specific to that Service; see each Service's man page:

Executable File Description Configuration Details Man Page
lbmrds.exe UDP-based Unicast Topic Resolver lbmrd Configuration File man page
srsds.exe TCP-based Topic Resolver SRS Configuration File man page
mcsds.exe TCP-based Topic Resolver MCS Configuration File man page
storeds.exe Persistent Store Configuration Reference for Umestored man page
tnwgds.exe Dynamic Router (DRO) DRO Configuration Reference man page

Daemon Configuration File

The UM daemons require a configuration file. You must configure the Windows Service with the path to the daemon's configuration file. This is done with the "-s config" command-line option. For example:

umestoreds -s config c:\UM\store_config.xml

This saves the file path into the Windows registry. Subsequently, each time the UME Store Windows Service starts, it will read that file to configure the Store daemon. (Note: lbmrds uses upper-case "-S".)

Windows Event Logger

The UM daemons write their log messages to their log files. The UM Windows Services have the option of also writing the log messages to the Windows Event Logger.

UM Log messages are categorized into different severity levels: "info", "notice", "warning", "err", "alert", "emerg". By default, the UM Windows Services will write log messages of category "warning" and above to the Windows Event Log.

If desired, the category can be configured with the "-e" command-line option. For example:

umestoreds -e notice

This saves the "notice" category into the Windows registry. Subsequently, each time the UME Store Windows Service starts, it will log message of "notice" and above to the Windows Event Log.

Be aware that setting the severity level below "warning" can result in very many messages being written to the Windows Event Log. Also be aware that messages of all severity levels are written to the daemon's log file, independent of the "-e" setting.

Environment Variables File

The UM daemons occasionally can have useful features enabled through the use of environment variables. Most of the UM Windows Services allow the use of an optional disk file containing environment variable assignments. Each time the Service starts, that file is read and the environment variables are set for that daemon process. The SRS Windows service does not currently support this.

If desired, the Environment Variable file path can be configured with the "-E" command-line option. For example:

umestoreds -E c:\UM\store_env.txt

This saves the file path into the Windows registry. Subsequently, each time the UME Store Windows Service starts, it will read that file and set its environment variables.

The format of that file is shown by this example:

# Environment Variable File for UME Store
LBM_DEBUG_MASK="0xC384"
LBM_DEBUG_FILENAME="c:\temp\store_debug.txt"

The quote marks are required.

If you want to stop using an environment variable file, you must remove its entry from the Windows registry with the "-U" command-line option. For example:

umestoreds -U

This removes the environment variable file path from the Windows registry. The next time the UME Store Windows Service starts, it will not set its environment. Note that this operation does not remove the environment file.


Start the Windows Service  <-

Windows Services are controlled by the "Services" control panel. See your Windows documentation for information on controlling Windows Services.


Remove the Windows Service  <-

There are two ways to remove the UM daemons as Windows Services:

  • Use the Windows uninstaller.
  • Manually remove the service using the daemon's executable program.

Windows Uninstaller

If the UM package was installed by the Package Installer, using the normal Windows "Add or Remove Programs" control panel removes the Windows Service, as well as also removing the installed files.

Manual Service Removal

You can remove a UM Windows Service manually using the "-s remove" command-line option. For example:

umestoreds -s remove

This removes the UM Store as a Windows Service. (Note: lbmrds uses upper-case "-S".)

Note that this does not uninstall any of the UM software. It only removes the the daemon as a Windows Service.


UM Analysis Tools  <-

Tools available to analyze UM activity and performance.


Packet Capture Tools  <-

  • Wireshark is an open-source network packet analysis tool, for which Informatica provides 'dissectors' describing our packet formats. It is used to open and sift through packet capture files, which can be gathered by a variety of both software and hardware tools.
  • Tshark is a command-line version of Wireshark.
  • Tcpdump is the primary software method for gathering packet capture data from a specific host. It is available on most Unix-based systems, though generally gathering packet captures with the tool requires super-user permissions.

For more information about Wireshark please visit https://www.wireshark.org/. (The UM plugins are part of the current release.)


Resource Monitors  <-

  • Top is a system resource monitor available on Linux/Unix that presents a variety of useful data, such as CPU use (both average and per-CPU), including time spent in user mode, system mode, time processing interrupts, time spent waiting on I/O, etc.
  • Microsoft Windows System Resource Manager manages Windows Server 2008 processor and memory usage with built-in or custom resource policies.
  • prstat is a resource manager for Solaris that provides similar CPU and memory usage information.


Process Analysis Tools  <-

  • pstack dumps a stack trace for a process (pid). If the process named is part of a thread group, then

    pstack traces all the threads in the group.

  • gcore generates a core dump for a Solaris, Linux, and HP-UX process. The process continues after core has been dumped. Thus, gcore is especially useful for taking a snapshot of a running process.


Network Tools  <-

  • netstat provides network statistics for a computer's configured network interfaces. This extensive command-line tool is available on Linux/Unix based systems and Windows operating systems.
  • wget is a Linux tool that captures content from a web interface, such as a UM daemon web monitor. Its features include recursive download, conversion of links for off-line viewing of local HTML, support for proxies, and more.
  • netsh is a Windows utility that allows local or remote configuration of network devices such as the interface.


UM Tools  <-

  • lbmmoncache is a utility that monitors both source notification and source/receiver statistics. Contact UM Support for more information about this utility.
  • lbtrreq restarts the topic resolution process. Contact UM Support for more information about this utility.


UM Debug Flags  <-

The use of UM debug flags requires the assistance of UM Support. Also refer to the following Knowledge Base articles for more information about using debug flags.