Topic Resolution ("TR") is a set of protocols and algorithms used internally by Ultra Messaging to establish and maintain shared state. Here are the basic functions of TR:

Receiver discovery of sources.
DRO route maintenance and distribution.
Persistent Store name resolution.
Redundancy.

UM performs TR automatically; there are no API functions specific to normal TR operation. However, you can influence topic resolution by configuration. Moreover, you can set configuration options differently for individual topics, either by using XML Configuration Files (the <topic> element), or by using the API functions for setting configuration options programmatically (e.g. lbm_rcv_topic_attr_setopt() and lbm_src_topic_attr_setopt()). See UDP Topic Resolution Configuration Options for details.

An important design point of Topic Resolution is that information related to sources is distributed to all contexts in a UM network. This is done so that when a receiver object is created within a context, it can discover sources for the topic and join those sources. In support of this discovery process, each context maintains a memory-based "resolver cache", which stores source information. The TR protocols and algorithms are largely in support of maintaining each context's resolver cache.

Topic Resolution also occurs across a DRO, which means between Topic Resolution Domains (TRDs). A receiver in one TRD will discover a source in a different TRD, potentially across many DRO hops. In this case, the DROs actively assist in TR. The sources and receivers in different TRDs do not exchange TR with each other directly, but rather with the assistance of the DRO.

Note: With the UMQ product, Topic Resolution does not apply to brokered queuing sources, receivers, or the brokers themselves. However, ULB queuing does make use of topic resolution.

There are three different possible protocols used to provide Topic Resolution:

Multicast UDP TR,
Unicast UDP TR (with "lbmrd" service),
TCP TR (with "SRS" service).

Of those three, Multicast UDP and Unicast UDP are mutually exclusive. It is not possible to configure UM to use both within a single TRD. Multicast is generally preferred over Unicast, with Unicast being selected when there are policy or environment reasons to avoid Multicast (e.g. cloud computing).

TCP-based TR (with "SRS" service) is a more recent addition to UM. It supports source discovery, and tracking of receivers.

TCP-based TR is often paired with one of the UDP-based TR protocols (multicast or unicast). This is done to support interoperability with pre-6.12 versions of UM, and supply TR functionality not yet available in TCP TR. The TCP-based and UDP-based TR protocols run in parallel, with the UDP-based TR protocol supporting interoperability with pre-6.12 components, and supplying the functionality missing from TCP TR.

When all of your UM components are upgraded to UM 6.14 or beyond, you can use the resolver_disable_udp_topic_resolution (context) configuration option to turn off UDP-based TR.

The advantage of TCP-based TR is greater reliability and reduced network and CPU load. UDP-based TR is susceptible to deafness issues due to transient network failures. Avoiding those deafness issues requires configuring UDP-based TR to use significant network and CPU resources. In contrast, TCP-based TR is designed to be reliable with much less network and CPU load, even in the face of transient network failures.

Resolver Caches <-

Independent of the TR protocol used, a context maintains two resolver caches: a source-side cache and a receiver-side cache.

Source Resolver Cache <-

The source-side cache holds information about sources that the application created. It is used primarily by the context to respond to TR Queries.

Receiver Resolver Cache <-

The receiver-side cache holds information about all sources in the UM network:

Sources in the current application. For example, if Monitoring is turned on and configured for the same TRD as the context being monitored, the source used to publish the monitoring data is included.
Sources in local applications, in the same TRD.
Sources in remote applications that are reachable by a DRO network.

Thus, the receiver-side cache can become large. In very large deployments, it may be necessary to increase the size of the receiver cache using resolver_receiver_map_tablesz (context).

A context's receiver-side cache also holds information about receivers created by the current application. This is used by the context when TR Advertisements are received to assist in completing subscriptions.

Be aware that with UDP-based Topic Resolution, receiver-side cache entries are typically not removed when the corresponding sources are deleted, unless they are subscribed. If a source is created (but not subscribe) and then deleted and then re-created, there will be two entries in the cache: one for the old source and one for the new. For system designs that feature short-lived sources, the topic cache can grow over time without bound. In contrast, with TCP-based Topic Resolution, receiver-side cache entries typically are removed when the sources are deleted, even if not subscribed.

TR Protocol Comparison <-

Multicast UDP TR <-

Multicast UDP-based Topic Resolution is the default protocol.

Advantages:

Very fast source discovery for small deployments.
Simplicity – no independent service (daemon) required.
Highly fault tolerant.

Disadvantages:

As the number of topics grows, the speed of source discovery degrades and resource consumption increases (network bandwidth and CPU load). This resource consumption can introduce significant latency outliers (jitter).
Since UDP is not a reliable protocol, Multicast UDP TR relies on repetition to ensure delivery of TR information.
To effectively avoid deafness issues, resources must be consumed over the long term (TR must be configured to run "forever"). Jitter can be a long-term problem.
As deployments change and grow, TR performance should be monitored and analyzed for possible reconfiguration to strike the right balance between speed of source discovery vs. resource consumption.
By default, when sources are deleted, receivers are not informed unless all sources on a given transport session are deleted. Even if "final advertisements" are enabled, their delivery is best effort and not guaranteed.

See UDP-Based Topic Resolution Details for more information.

Unicast UDP TR <-

Unicast UDP-based Topic Resolution is functionally identical to Multicast UDP. It is used as a replacement for Multicast UDP in environments where the use of multicast is not possible (e.g. the cloud) or is against policy. The "lbmrd" service simulates multicast by simply forwarding all TR traffic to all contexts registered in a TRD. Note that the "lbmrd" service does not maintain state about the sources and receivers. It simply fans out Unicast TR.

Advantages:

Does not use multicast (an advantage if multicast cannot be used).

Disadvantages:

All the same disadvantages of Multicast UDP.
Requires one or more independent "lbmrd" services, which should be monitored for failure and restarted.
Due to fan-out, puts a greater load on network hardware.
By default, when sources are deleted, receivers are not informed unless all sources on a given transport session are deleted. Even if "final advertisements" are enabled, their delivery is best effort and not guaranteed.

See UDP-Based Topic Resolution Details for more information.

TCP TR <-

TCP-based Topic Resolution is a newer implementation of a service-based distribution of source and receiver information. It is available as of UM version 6.12, in which it provides a subset of the total TR functionality. In a future UM version, TCP-based TR will provide all TR functionality, at which point it can be used to the exclusion of UDP-based TR. Until that time, TCP-based TR is typically paired with UDP-based TR (either Multicast or Unicast).

Advantages:

Can allow UDP-based TR to be disabled or "dialed-back". Its configuration can be adjusted to consume fewer CPU and network resources. See TCP-Based TR Version Interoperability.
Since TCP is a reliable protocol, TCP-based TR does not need to repeatedly send the same information to ensure its reception.
It is not necessary to consume resources over the long term to avoid deafness issues.
If a source is deleted, that deletion is reliably communicated to all contexts in the TRD.

Disadvantages:

For TRDs containing UM versions both before and after UM 6.12, TCP-based TR must be combined with UDP-based TR to support inter-version interoperability.
For UM version 6.12 and 6.13, TCP-based TR does not fulfill the TR functions of DRO route maintenance or persistent Store name resolution. Users who require one or more of those functions should upgrade to UM version 6.14 or beyond.

Most users who combine UDP and TCP TR should be able to gradually reduce the CPU and Network load from UDP-based TR as the applications are upgraded to UM 6.12 or beyond.

When all of your UM components are upgraded to UM 6.14 or beyond, you can use the resolver_disable_udp_topic_resolution (context) configuration option to turn off UDP-based TR.

See TCP-Based Topic Resolution Details for more information.

TCP-Based Topic Resolution Details <-

TCP-based TR was introduced in UM version 6.12 to address shortcomings in UDP-based TR:

Limit on scaling. It is difficult to configure UDP-based TR to scale to many hundreds of thousands of topics. Too many topics typically results in unacceptable CPU and network load, and latency outliers (jitter). Intense TR bursts can cause packet loss, retransmissions, and deafness. As deployments grow in size and complexity, UDP-based TR typically requires greatly extended Sustaining Phases, often to infinity. This results in significant CPU and network resources over the long term, and introduces latency outliers (jitter).
High time to resolve. To reduce the CPU and network load, and to avoid packet loss, UDP-based TR is usually strongly rate limited. This can greatly extend the time required to resolve topics, sometimes into the tens of minutes.

TCP-based TR differs from UDP-based TR in two important ways:

With TCP-based TR, the TCP protocol ensures reliable transmission of information. TCP also makes use of congestion control algorithms to avoid packet loss.
With TCP-based TR, topic information is maintained in the Stateful Resolution Service (SRS).

The basic approach used by TCP-based TR is as follows: Each context in a TRD is configured with the address of one or more SRS instances (up to 5). For fault-tolerance, two or three is typical. When the context is created, it connects to the configured SRSes. When the connection is successful, the context and SRSes exchange TR information. They normally do this without involving the other contexts in the TRD.

Then, as an application creates or deletes sources, its context informs the SRSes of the change, which in turn inform the other contexts in the TRD. In addition (as of UM 6.13), as an application creates or deletes receivers, the SRSes track that receiver interest. The SRSes do not distribute receiver interest to other applications, but rather use it to optimize the distribution of source information. An SRS only informs a context of sources that the context is interested in (has receiver for).

There are periodic handshakes between each context and the SRSes to ensure that connectivity is maintained and that state is valid. This removes the need to re-send TR information that has already been sent.

If an application loses connection with an SRS (perhaps due an extended network outage, or due to failure of the SRS), the context will repeatedly try to reconnect. Once successful, the process of exchanging TR information is repeated.

Note that much of the difficulty of configuring UDP-based TR is related to controlling the repeated transmission of the same TIRs and TQRs. With TCP-based TR, that repetition is eliminated, making both the configuration and the operation more straight-forward.

A note about the term "stateful" in relation to the SRS. Even though Unicast UDP TR uses a service called "lbmrd", that service does not maintain the topic information. The "lbmrd" is not "stateful". Instead, it merely forwards TR datagrams it receives, essentially simulating Multicast.

In contrast, the SRS maintains knowledge of all sources and receivers in the TRD (hence the "Stateful" in SRS). For a newly-started receiving application to discover an existing source, the SRS can send the information without the source getting involved.

With TCP-based TR, source advertisement messages are called "SIRs" (Source Information Records). This term is used elsewhere in the documentation.

For configuration information, see TCP-Based Resolver Operation Options.

Attention: It is important to configure the SRS with SRS Element "<interface>", even if the SRS is running on a single-homed host (one network connection). Also, applications that connect to the SRS must configure their interfaces properly. See default_interface (context).

Note: The LBMRD NAT Transit functionality provided by the lbmrd is not available with the SRS. If you want to mix SRS and lbmrd operation on the same network, contact UM Support.

TCP-Based TR and Fault Tolerance <-

Starting with UM version 6.13, TCP-based TR supports redundancy. This is accomplished by starting two or more instances of the Stateful Resolver Service (SRS), typically on separate physical hosts, and configuring application and daemon contexts to connect to all of them. Although up to 5 SRSes can be configured, 2 or 3 are typical.

A context uses the resolver_service (context) option to configure the desired SRSes. Each context will establish TCP connections to all of the configured SRSes. The SRSes are used "hot/hot", with all information being sent to all configured SRSes, so there is no loss of Topic Resolution service if an SRS fails.

TCP-Based TR Interest <-

Starting with UM version 6.13, the SRS tracks topic interest of contexts. If an application creates a receiver for topic "XYZ", the context informs the SRS that it is interested in that topic. This allows the SRS to filter the TR traffic it sends to contexts, which greatly increases the scalability of TR.

The SRS only sends source advertisements to contexts that are interested in that source's topic. Contexts also inform the SRS of wildcard receivers, in which case the SRS will send source advertisements for all sources that match the topic pattern.

Warning: While SRS filtering has the benefit of TR reducing traffic to a context, it can interfere with the resolver_source_notification_function (context) and resolver_event_function features. Some applications use those features to inform the application of the availability of sources before receivers are created. But the SRS will normally not inform the context of sources for which the context has no receiver. For applications that require these features, filtering must be turned off by setting resolver_service_interest_mode (context) to "flood".

TCP-Based TR Version Interoperability <-

TCP-based TR was first introduced in UM version 6.12. To maintain interoperability with pre-6.12, applications can be configured to use both TCP and UDP-based TR, in parallel.

This can make it difficult to gain all the benefits of TCP-based TR. Since pre-6.12 applications still need to avoid the problems of deafness, even applications that have upgraded to 6.12 or beyond need to enable UDP-based TR, usually with extended sustaining phases, often to infinity.

Ideally, all applications within a TRD can be upgraded to 6.14 and beyond, eliminating the need for UDP-based TR, but this is often not practical. How can the TR load be reduced in a step-wise fashion while an organization is upgrading applications gradually, over a long period of time?

Fortunately, You can set configuration options differently for individual topics, either by using XML Configuration Files (the <topic> element), or by using the API functions for setting configuration options programmatically (e.g. lbm_rcv_topic_attr_setopt() and lbm_src_topic_attr_setopt()).

Some helpful strategies might be:

Identify those topics or classes of topics that have limited application interest. If topic X has sources and receivers in upgraded applications, the UDP-based TR for that topic can be reduced (e.g. sustaining phase greatly reduced).
Identify those TRDs that have small numbers of applications. When a given TRD's applications have all been upgraded, the UDP-based TR for all topics in that TRD can be reduced. If practical, applications can be moved between TRDs to enable some TRDs to be populated by UM version 6.12 and beyond. Also, a TRD can be sub-divided, separating pre-upgraded from post-upgraded.

When all of your UM components are upgraded to UM 6.14 or beyond, you can use the resolver_disable_udp_topic_resolution (context) configuration option to turn off UDP-based TR.

TCP-Based TR Configuration <-

A UM context is configured to use TCP-based TR with the option resolver_service (context), which tells how to connect to the SRS. For example:

context resolver_service 10.29.3.41:12000

A DNS host name can be used instead of an IP address:

context resolver_service test1.informatica.com:12000

For fault tolerance, more than one running SRS instance can be configured:

context resolver_service test1.informatica.com:12000,test2.informatica.com:12000

This assumes that an SRS service is running at that address:port.

If interoperability with UDP-based TR is not needed, UM should be configured to use ONLY the SRS for topic resolution using the option resolver_disable_udp_topic_resolution (context).

SRS Service <-

The SRS service is a daemon process which must be run to provide TCP-based TR for a TRD.

See Man Pages for SRS for details on running the SRS service.

All the contexts in the TRD must be configured to connect to the SRS with the option resolver_service (context). After connecting, each context exchanges TR information with the SRS.

Attention: Applications that connect to the SRS must configure their interfaces properly. See default_interface (context).

As applications create and delete sources, the SRS is informed, and the SRS informs all connected contexts. This includes proxy sources from a DRO. In addition, a periodic "keepalive" handshake is performed between the SRS and all connected contexts.

If a network failure causes the context's connection to the SRS to be broken, the context will periodically retry the connection. Since most network failures are brief, the context will soon successfully re-establish a connection to the SRS. Even though this is a resumption of the same context's earlier connection, the context and SRS still exchange full TR information to make sure that any changes during the disconnected period are reflected.

The SRS also supports the publishing of operational and status information via the daemonstatistics feature. For full details on the SRS Daemon Statistics details, see SRS Daemon Statistics.

SRS State Lifetime

If an application exits abnormally, the SRS will detect that the TCP connection is broken. However, the SRS must not assume that the application has failed; it might be a network problem that forced the disconnection.

So the SRS flags all sources owned by that context as "potentially down", and starts a "source state lifetime" timer (see <source-state-lifetime>). If the context has not failed, and reconnects within that period, during the initial exchange of TR information, the SRS will unflag any "potentially down" sources. However, in the case of application failure, when the state lifetime expires, all "potentially down" sources are deleted. All connected contexts are informed of those deletions.

Note that as of UM version 6.13, the SRS also tracks application interest (topics for which the context has receivers). This interest is also remembered by the SRS if the connection is broken, and also has an "interest state lifetime" timer (see <interest-state-lifetime>). If the context has not failed, and reconnects within that period, during the initial exchange of TR information, the SRS will unflag any "potentially down" receiver interest. However, in the case of application failure, when the state lifetime expires, all "potentially down" receiver interest is deleted.

To maintain compatibility with 6.12 configurations, the SRS Element "<state-lifetime>" is maintained, and is used as the default value for both source and interest state lifetimes.

Note that if an application fails and then restarts, its connection to the SRS is not considered to be a resumption of the previous connection. It is considered to be a new context, and any sources created are new sources. The previous application instance's sources will remain in the "potentially down" state, and will time out with the state lifetime.

If a network outage lasts longer than the configured state lifetime, the SRS gives up on the context and deletes sources and interest. These deletions are communicated to all connected contexts. When the network outage is repaired and the context reconnects, the exchange of TR information with the SRS will re-create the context's sources and interest in the SRS, and communicate them to other contexts. This restores normal operation.

Starting with UM version 6.14, the SRS also tracks routing information sent by the DRO. That information is handled in much the same way as client information.

SRS Log File

The SRS generates log messages that are used to monitor its health and operation. You can configure these to be directed to "console" (standard output) or a specified log "file", via the <log> configuration element. Normally "console" is only used during testing; a persistent log file should be used for production. The SRS does not over-write its log files on startup, but instead appends to it.

SRS Rolling Logs

To prevent unbounded disk file growth, the SRS supports rolling log files. When the log file rolls, the file is renamed according to the model:
CONFIGUREDNAME_PID.DATE.SEQNUM
where:

CONFIGUREDNAME - Root name of log file, as configured by user.
PID - Process ID of the SRS daemon process.
DATE - Date that the log file was rolled, in YYYY-MM-DD format.
SEQNUM - Sequence number, starting at 1 when the process starts, and incrementing each time the log file rolls.

For example: srs.log_9867.2017-08-20.2

The user can configure when the log file is eligible to roll over by either or both of two criteria: size and frequency. The size criterion is in millions of bytes. The frequency criterion can be daily or hourly. Once one or both criteria are met, the next message written to the log will trigger a roll operation. These criteria are supplied as attributes to the <log> configuration element.

If both criteria are supplied, then the first one to be reached will trigger a roll. For example, consider the setting:

Let's say that the log file grows at 1 million bytes per hour (VERY unlikely for an SRS, but let's assume for illustration purposes). At 11:00 pm, the log file will reach 23 million bytes, and will roll. Then, at 12:00 midnight, the log file will roll again, even though it is only 1 million bytes in size.

In addition, the SRS supports automatic deletion of log files based on either or both of two criteria: max history, and total size cap. The max history refers to the number of archived log files, and the total size cap refers to the sum of the sizes of the archived files in millions of bytes. When either or both criteria are met, one or more of the oldest log files are removed until the criteria no longer apply.

For more information, see the <log> configuration element.

SRS Monitoring <-

See Monitoring for an overview of monitoring an Ultra Messaging network.

It is important to the health and stability of a UM network to monitor the operation of SRSes. This monitoring should include real-time automated detection of problems that will produce a timely alert to operations staff.

Two types of data should be monitored:

Log file.
Daemon statistics.

SRS Monitoring: Logs <-

Ideally, log file monitoring would support the following:

Archive all log messages for all SRSes for at least a week, preferably a month.
Provide rapid access to operations staff to view the latest log messages from a SRS.
Periodic scans of the log file to detect errors and raise alerts to operations staff.

Regarding log file scanning, messages in the SRS's log file contain a severity indicator in square brackets. For example:

Tue Nov 1 13:29:37 CDT 2022 [INFO]: SRS-10385-99: TCP Disconnect received for client[10.29.3.101:47205] sessionID[0x8295ab9a]

Informatica recommends alerting operations staff for messages of severity [WARNING], [ERROR], [CRITICAL], [ALERT], and [EMERGENCY].

It would also be useful to have a set of exceptions for specific messages you wish to ignore.

There are many third party real-time log file analysis tools available. A discussion of possible tools is beyond the scope of UM documentation.

SRS Monitoring: Daemon Stats <-

There are two data formats for the SRS to send its daemon stats:

Protobufs - recommended.
JSON - deprecated. Informatica recommends migrating to protobufs.

For information on the deprecated JSON formatted daemon stats, see SRS Daemon Statistics.

The protobufs format is accepted by the Monitoring Collector Service (MCS) and the "lbmmon" example applications: Example lbmmon.c and Example lbmmon.java.

For example, here is an excerpt from a sample SRS configuration file that shows how its daemon stats can be published:

<?xml version="1.0" encoding="UTF-8" ?>
<um-srs version="1.0">
  ...
  <daemon-monitor topic="/29west/statistics">
    <ping-interval>600000</ping-interval>
    <publishing-interval>
      <default>600000</default>
    </publishing-interval>
    <lbm-attributes>
      <option scope="context" name="resolver_unicast_daemon" value="10.29.3.101:12801"/>
      <option scope="context" name="default_interface" value="10.29.3.0/24"/>
      <option scope="context" name="mim_incoming_address" value="0.0.0.0"/>
      <option scope="source" name="transport" value="tcp"/>
      ...
    </lbm-attributes>
    <monitor-format>pb</monitor-format>
  </daemon-monitor>

Notes:

The SRS Element "<monitor-format>" value "pb" selects the protobuf format and is available for the SRS in UM version 6.15 and beyond.
For a list of possible protobuf messages for the SRS, see the "srs_mon.proto" file at Example srs_mon.proto.
The SRS Element "<ping-interval>" and SRS Element "<default>" define the statistics sampling period. In the above example, 600 seconds (10 minutes) is chosen somewhat arbitrarily. Shorter times produce more data, but not much additional benefit. However, UM networks with many thousands of applications may need a longer interval (perhaps 30 or 60 minutes) to maintain a reasonable load on the network and monitoring data storage.
The LBM attributes included in the above example within the SRS Element "<daemon-monitor>" element sets options for the monitoring data TRD. (Alternatively, you can configure the monitoring context using monitor_transport_opts (context).) When possible, Informatica recommends directing monitoring data to an administrative network, separate from the application data network. This prevents monitoring data from interfering with application data latency or throughput. In this example, the monitoring context is configured to use an interface matching 10.29.3.0/24. There are other ways to specify the interface; see Specifying Interfaces.
In this example, the monitoring data TRD uses Unicast UDP Topic Resolution. The lbmrd daemon is running on host 10.29.3.101, port 12001.
The monitoring data is sent out via UM using the TCP transport.
These settings were chosen to conform to the recommendations in Automatic Monitoring.

For a full demonstration of monitoring, see: https://github.com/UltraMessaging/mcs_demo