Concepts Guide
Topic Resolution Description

Topic Resolution ("TR") is a set of protocols and algorithms used internally by Ultra Messaging to establish and maintain shared state. Here are the basic operations of TR:

  • Receiver discovery of sources.
  • DRO route maintenance and distribution.
  • Persistent Store name resolution.
  • Redundancy.

UM performs TR automatically; there are no API functions specific to normal TR operation. However, you can influence topic resolution by configuration. Moreover, you can set configuration options differently for individual topics, either by using XML Configuration Files (the <topic> element), or by using the API functions for setting configuration options programmatically (e.g. lbm_rcv_topic_attr_setopt() and lbm_src_topic_attr_setopt()). See UDP Topic Resolution Configuration Options for details.

An important design point of Topic Resolution is that information related to sources is distributed to all contexts in a UM network. This is done so that when a receiver object is created within a context, it can discover sources for the topic and join those sources. In support of this discovery process, each context maintains a memory-based "resolver cache", which stores source information. The TR protocols and algorithms are largely in support of maintaining each context's resolver cache.

Topic Resolution also occurs across a UM Router, which means between Topic Resolution Domains (TRDs). A receiver in one TRD will discover a source in a different TRD, potentially across many UM Router hops. In this case, the UM Routers actively assist in TR. I.e. the sources and receivers in different TRDs do not exchange TR with each other directly, but rather with the assistance of the UM Router.

Note
With the UMQ product, Topic Resolution does not apply to brokered queuing sources, receivers, or the brokers themselves. However, ULB queuing does make use of topic resolution.

There are three different possible protocols used to provide Topic Resolution:

  • Multicast UDP (default),
  • Unicast UDP (with "lbmrd" service),
  • TCP (with "SRS" service).

Of those three, Multicast UDP and Unicast UDP are mutually exclusive. It is not possible to configure UM to use both within a single TRD. Multicast is generally preferred over Unicast, with Unicast being selected when there are policy or environment reasons to avoid Multicast (e.g. cloud computing).

TCP-based TR (with "SRS" service) is a more recent addition to UM. It is available as of UM version 6.12, in which it provides a subset of the total TR functionality. Specifically, it supports receivers discovering sources. However, TCP-based TR does not yet support DRO route maintenance and distribution, resolution of Persistent Store names, or redundancy. (These functions will be supported by TCP-based TR in future UM releases.)

For UM version 6.12, TCP-based TR is typically paired with one of the two UDP-based TR protocols. This is done to supply missing TR functionality, and to support interoperability with pre-6.12 versions of UM. The two protocols run in parallel, with the UDP-based TR protocol supplying the missing functionality and providing redundancy to the more-reliable TCP-based TR.

The advantage of TCP-based TR is greater reliability and reduced network and CPU load. UDP-based TR is susceptible to "deafness" issues due to transient network failures. Avoiding those deafness issues requires configuring UDP-based TR to use significant network and CPU resources. In contrast, TCP-based TR is designed to be reliable with much less network and CPU load, even in the face of transient network failures.


TR Protocol Comparison  <-


Multicast UDP TR  <-

Multicast UDP-based Topic Resolution is the default protocol.

Advantages:

  • Very fast source discovery for small deployments.
  • Simplicity – no independent service required.
  • Highly fault tolerant. No independent services are needed for TR delivery. The internal network infrastructure provides redundancy.

Disadvantages:

  • As the number of topics grows, the speed of source discovery degrades and resource consumption increases (network bandwidth and CPU load). This resource consumption can introduce significant latency outliers.
  • Since UDP is not a reliable protocol, Multicast UDP TR relies on repetition to ensure delivery of TR information.
  • To effectively avoid deafness issues, resources must be consumed over the long term (TR must be configured to run "forever"). Latency outliers can be a long-term problem.
  • As deployments change and grow, TR performance should be monitored and analyzed for possible reconfiguration to strike the right balance between speed of source discovery vs. resource consumption.
  • By default, when sources are deleted, receivers are not informed unless all sources on a given transport session are deleted. Even if "final advertisements" are enabled, their delivery is best effort and not guaranteed.


Unicast UDP TR  <-

Unicast UDP-based Topic Resolution is functionally identical to Multicast UDP. It is used as a replacement for Multicast UDP in environments where the use of multicast is not possible (e.g. the cloud) or is against policy. The "lbmrd" service simulates multicast by simply forwarding all TR traffic to all contexts registered in a TRD. Note that the "lbmrd" service does not maintain state about the sources and receivers. It simply fans out Unicast TR.

Advantages:

  • Does not use multicast.
  • Supports redundant "lbmrd" services, which provides fault tolerance and load balancing.

Disadvantages:

  • All the same disadvantages of Multicast UDP.
  • Requires one or more independent "lbmrd" services, which should be monitored for failure and restarted.
  • Due to fan-out, puts a greater load on network hardware.
  • By default, when sources are deleted, receivers are not informed unless all sources on a given transport session are deleted. Even if "final advertisements" are enabled, their delivery is best effort and not guaranteed.


TCP TR  <-

TCP-based Topic Resolution is a newer implementation of a service-based distribution of source and receiver information. It is available as of UM version 6.12, in which it provides a subset of the total TR functionality. In a future UM version, TCP-based TR will provide all TR functionality, at which point it can be used to the exclusion of UDP-based TR. Until that time, TCP-based TR is typically paired with UDP-based TR (either Multicast or Unicast).

Advantages:

  • Can allow UDP-based TR to be "dialed-back". I.e. its configuration can be adjusted to consume fewer CPU and network resources. See TCP-Based TR Version Interoperability.
  • Since TCP is a reliable protocol, TCP-based TR does not need to repeatedly send the same information to ensure its reception.
  • It is not necessary to consume resources over the long term to avoid deafness issues.
  • If a source is deleted, that deletion is reliably communicated to all contexts in the TRD.

Disadvantages:

  • For TRDs containing UM versions both before and after UM 6.12, TCP-based TR must be combined with UDP-based TR to support inter-version interoperability.
  • For UM version 6.12, TCP-based TR does not fulfill the TR functions of DRO route maintenance, Persistent Store name resolution, and redundancy. For users who require one or more of those functions, TCP-based TR must be combined with UDP-based TR to support inter-version interoperability.

Most users who combine UDP and TCP TR should be able to gradually reduce the CPU and Network load from UDP-based TR as the applications are upgraded to UM 6.12 and beyond.


UDP-Based Topic Resolution Details  <-

The following diagram illustrates UDP-based Topic Resolution. The diagram references multicast configuration options, but the concepts apply equally to unicast.

TopicResolution.png

By default, Ultra Messaging relies on UDP-based Topic Resolution. UDP-based TR uses queries (TQRs) and advertisements (TIRs) to resolve topics. These TQRs and TIRs are sent in UDP datagrams, typically with more than one TIR or TQR in a given datagram.

UDP-based topic resolution traffic can benefit from hardware acceleration. See Transport Acceleration Options for more information.

For Multicast UDP, TR datagrams are sent to an IP multicast group and UDP port configured with the Ultra Messaging configuration options resolver_multicast_address (context) and resolver_multicast_port (context)).

For Unicast UDP, TR datagrams are sent to the IP address and port of the "lbmrd" daemon. See the UM configuration option resolver_unicast_daemon (context).

Note that if both Multicast and Unicast are configured, the Unicast has higher precedence, and Multicast will not be used.

UDP-based Topic Resolution occurs in the following phases:

  • Initial Phase - Period that allows you to resolve a topic aggressively. This phase can be configured to run differently from the defaults or completely disabled.
  • Sustaining Phase - Period that allows new receivers to resolve a topic after the Initial Phase. Can also be the primary period of topic resolution if you disable the Initial Phase. This phase can also be configured to run differently from the defaults or completely disabled.
  • Quiescent Phase - The quiet phase where Topic Resolution datagrams are no longer sent in an unsolicited way. This reduces the CPU and network resources consumed by TR, and also reduces latency outliers. However, in large deployments, especially those that include wide-area networks, the Quiescent Phase is sometimes disabled, by configuring the Sustaining Phase to continue forever. This is done to avoid deafness issues.

The phases of topic resolution are specific to individual topics. A single context can have some topics in each of the three phases running concurrently.


Sources Advertise  <-

For UDP-based TR, Sources use Topic Resolution in the following ways:

  • Unsolicited advertisement of active sources. When a source is first created, it enters the Initial Phase of TR. During the Initial, and subsequent Sustaining phases, the source sends Topic Information Record datagrams (TIRs) to all the other contexts in the TRD. The source does this in an unsolicited manner; it advertises even if there are no receivers for its topic.

  • Respond to Topic Queries. When a receiver is first created, it enters the Initial phase of TR. During the Initial, and subsequent Sustaining phases, the receiver sends Topic Query Records (TQRs) to all other contexts in the TRD. When a source receives a TQR for its topic, it will restart its Sustaining Phase of advertising to ensure that the receiver discovers the source.

A TIR contains all the information that the receiver needs to join the topic's Transport Session. The TIR datagram sent unsolicited is identical to the TIR sent in response to a TQR. Depending on the transport type, a TIR will contain one of the following groups of information:

  • For a TCP transport, the source address, TCP port and Session ID.
  • For an LBT-RM transport, the source address, the multicast group address, the UDP destination port, LBT-RM Session ID, and the unicast UDP port to which NAKs are sent.
  • For an LBT-RU transport, the source address, UDP port and Session ID.
  • For an LBT-IPC transport, the Host ID, LBT-IPC Session ID and Transport ID.
  • For an LBT-SMX transport, the Host ID, LBT-SMX Session ID and Transport ID.

See UDP-Based Resolver Operation Options for more information.


Receivers Query  <-

For UDP-based TR, when an application creates a receiver within a context, the new receiver first checks the context's resolver cache for any matching sources that the context has already discovered. Those will be joined immediately.

In addition, the receiver normally initiates a process of sending Topic Query Records (TQRs). This triggers sources for the receiver's topic to advertise, if they are not already. This allows sources which are in their Quiescent Phase to be discovered by new receivers.

A TQR consists primarily of the topic string.


Wildcard Receiver Topic Resolution  <-

For UDP-based TR, UM Wildcard Receivers use Topic Resolution in conceptually the same ways as a single-topic receiver, although some of the details are different. Instead of searching the resolver cache for a specific topic, a new wildcard receiver object searches for all sources that match the wildcard pattern.

Also, the TQRs contain the wildcard pattern, and all sources matching the pattern will advertise.

Finally, wildcard receivers omit the Sustaining Phase for sending Queries. They only support Initial and Quiescent Phases.

See Wildcard Receiver Options for more information.


Initial Phase  <-

For UDP-based TR, the initial topic resolution phase for a topic is an aggressive phase that can be used to resolve all topics before sending any messages. During the initial phase, network traffic and CPU utilization might actually be higher. You can completely disable this phase, if desired. See Disabling Aspects of Topic Resolution for more information.

Advertising in the Initial Phase

For the initial phase default settings, the resolver issues the first advertisement as soon as the scheduler can process it. The resolver issues the second advertisement 10 ms later, or at the resolver_advertisement_minimum_initial_interval (source). For each subsequent advertisement, UM doubles the interval between advertisements. The source sends an advertisement at 20 ms, 40 ms, 80 ms, 160 ms, 320 ms and finally at 500 ms, or the resolver_advertisement_maximum_initial_interval (source). These 8 advertisements require a total of 1130 ms. The interval between advertisements remains at the maximum 500 ms, resulting in 7 more advertisements before the total duration of the initial phase reaches 5000 ms, or the resolver_advertisement_minimum_initial_duration (source). This concludes the initial advertisement phase for the topic.

Resolver_Initial_Phase_TIR.png

The initial phase for a topic can take longer than the resolver_advertisement_minimum_initial_duration (source) if many topics are in resolution at the same time. The configuration options, resolver_initial_advertisements_per_second (context) and resolver_initial_advertisement_bps (context) enforce a rate limit on topic advertisements for the entire UM context. A large number of topics in resolution - in any phase - or long topic names may exceed these limits.

If a source advertising in the initial phase receives a topic query, it responds with a topic advertisement. UM recalculates the next advertisement interval from that point forward as if the advertisement was sent at the nearest interval.

Querying in the Initial Phase

Querying activity by receivers in the initial phase operates in similar fashion to advertising activity, although with different interval defaults. The resolver_query_minimum_initial_interval (receiver) default is 20 ms. Subsequent intervals double in length until the interval reaches 200 ms, or the resolver_query_maximum_initial_interval (receiver). The query interval remains at 200 ms until the initial querying phase reaches 5000 ms, or the resolver_query_minimum_initial_duration (receiver).

Resolver_Initial_Phase_TQR.png

The initial query phase completes when it reaches the resolver_query_minimum_initial_duration (receiver). The initial query phase also has UM context-wide rate limit controls (resolver_initial_queries_per_second (context) and resolver_initial_query_bps (context)) that can result in the extension of a phase's duration in the case of a large number of topics or long topic names.


Sustaining Phase  <-

For UDP-based TR, the sustaining topic resolution phase follows the initial phase and can be a less active phase in which a new receiver resolves its topic. It can also act as the sole topic resolution phase if you disable the initial phase. The sustaining phase defaults use less network resources than the initial phase and can also be modified or disabled completely. See Disabling Aspects of Topic Resolution in the UM Configuration Guide.

Advertising in the Sustaining Phase

For the sustaining phase defaults, a source sends an advertisement every second (resolver_advertisement_sustain_interval (source)) for 1 minute (resolver_advertisement_minimum_sustain_duration (source)). When this duration expires, the sustaining phase of advertisement for a topic ends. If a source receives a topic query, the sustaining phase resumes for the topic and the source completes another duration of advertisements.

Resolver_Sustain_Phase_TIR.png

The sustaining advertisement phase has UM context-wide rate limit controls (resolver_sustain_advertisements_per_second (context) and resolver_sustain_advertisement_bps (context)) that can result in the extension of a phase's duration in the case of a large number of topics or long topic names.

Querying in the Sustaining Phase

Default sustaining phase querying operates the same as advertising. Unresolved receivers query every second (resolver_query_sustain_interval (receiver)) for 1 minute (resolver_query_minimum_sustain_duration (receiver)). When this duration expires, the sustaining phase of querying for a topic ends.

Resolver_Sustain_Phase_TQR.png

Sustaining phase queries stop when one of the following events occurs:

The sustaining query phase also has UM context-wide rate limit controls (resolver_sustain_queries_per_second (context) and resolver_sustain_query_bps (context)) that can result in the extension of a phase's duration in the case of a large number of topics or long topic names.


Quiescent Phase  <-

For UDP-based TR, this phase is the absence of topic resolution activity for a given topic. It is possible that some topics may be in the quiescent phase at the same time other topics are in initial or sustaining phases of topic resolution.

This phase ends if either of the following occurs.


Store (context) Name Resolution  <-

For UDP-based TR, with the UMP/UMQ products, topic resolution facilitates the resolution of Persistent Store names to a DomainID:IPAddress:Port.

Topic Resolution resolves store (or context) names by sending context name queries and context name advertisements over the topic resolution channel. A store name resolves to the store's DomainID:IPAddress:Port. You configure the store's name and IPAddress:Port in the store's XML configuration file. See Identifying Persistent Stores for more information.

If you do not use the UM Router, the DomainID is zero. Otherwise, the DomainID represents the Topic Resolution Domain where the store resides. Stores learn their DomainID by listening to Topic Resolution traffic.

Via the Topic Resolution channel, sources query for store names and stores respond with an advertisement when they see a query for their own store name. The advertisement contains the store's DomainID:IPAddress:Port.

For a new source configured to use store names (ume_store_name (source)), the resolver issues the first context name query as soon as the scheduler can process it. The resolver issues the second advertisement 100 ms later, or at the resolver_context_name_query_minimum_interval (context). For each subsequent query, UM doubles the interval between queries. The source sends a query at 200 ms, 400 ms, 800 ms and finally at 1000 ms, or the resolver_context_name_query_maximum_interval (context). The interval between queries remains at the maximum 1000 ms until the total time querying for a store (context) name equals resolver_context_name_query_duration (context). The default for this duration is 0 (zero) which means the resolver continues to send queries until the name resolves. After a store name resolves, the resolver stops sending queries.

If a source sees advertisements from multiple stores with the same name, or a store sees an advertisement that matches its own store name, the source issues a warning log message. The source also issues an informational log message whenever it detects that a resolved store (context) name changes to a different DomainID:IPAddress:Port.


UDP Topic Resolution Configuration Options  <-

See the following sections in UM Configuration Guide for more information:

Assigning Different Configuration Options to Individual Topics

You can set configuration options differently for individual topics, either by using XML Configuration Files (the <topic> element), or by using the API functions for setting configuration options programmatically (e.g. lbm_rcv_topic_attr_setopt() and lbm_src_topic_attr_setopt()).


Unicast Topic Resolution  <-

By default UM expects multicast connectivity between all sources and receivers. When only unicast connectivity is available, you may configure all sources and receivers to use unicast topic resolution. This requires that you run one or more instances of the UM unicast topic resolution daemon (lbmrd), which perform the same topic resolution activities as multicast topic resolution. You configure your applications to use the lbmrd daemons with resolver_unicast_daemon (context).

See Lbmrd Man Page for details on running the lbmrd daemon.

The lbmrd can run on any machine, including the source or receiver. Of course, sources will also have to select a transport protocol that uses unicast addressing (e.g. TCP, TCP-LB, or LBT-RU). The lbmrd maintains a table of clients (address and port pairs) from which it has received a topic resolution message, which can be any of the following:

  • Topic Information Records (TIR) - also known as topic advertisements
  • Topic Query Records (TQR)
  • keepalive messages, which are only used in unicast topic resolution

After lbmrd receives a TQR or TIR, it forwards it to all known clients. If a client (i.e. source or receiver) is not sending either TIRs or TQRs, it sends a keepalive message to lbmrd according to the resolver_unicast_keepalive_interval (context). This registration with the lbmrd allows the client to receive advertisements or queries from lbmrd. lbmrd maintains no state about topics, only about clients.

LBMRD with the UM Router Best Practice

If you're using the lbmrd for topic resolution across a UM Router, you may want all of your domains discovered and all routes to be known before creating any topics. If so, change the UM configuration option, resolver_unicast_force_alive (context), from the default setting to 1 so your contexts start sending keepalives to lbmrd immediately. This makes your startup process cleaner by allowing your contexts to discover the other Topic Resolution Domains and establish the best routes. The trade off is a little more network traffic every 5 seconds.

Unicast Topic Resolution Resilience

Running multiple instances of lbmrd allows your applications to continue operation in the face of a lbmrd failure. Your applications' sources and receivers send topic resolution messages as usual, however, rather than sending every message to each lbmrd instance, UM directs messages to lbmrd instances in a round-robin fashion. Since the lbmrd does not maintain any resolver state, as long as one lbmrd instance is running, UM continues to forward LBMR packets to all connected clients. UM switches to the next active lbmrd instance every 250-750 ms.


Network Address Translation (NAT)  <-

For UDP-based TR, if your network architecture includes LANs that are bridged with Network Address Translation (NAT), UM receivers will not be able to connect directly to UM sources across the NAT. Sources send Topic Resolution advertisements containing their local IP addresses and ports, but receivers on the other side of the NAT cannot access those sources using those local addresses/ports. They must use alternate addresses/ports, which the NAT forwards according to the NAT's configuration.

The recommended method of establishing UM connectivity across a NAT is to run a pair of UM Routers connected with a single TCP peer link. In this usage, the LANs on each side of the NAT are distinct Topic Resolution Domains.

Alternatively, if the NAT can be configured to allow two-way UDP traffic between the networks, the lbmrd can be configured to modify Topic Resolution advertisements according to a set of rules defined in an XML configuration file. Those rules allow a source's advertisements forwarded to local receivers to be sent as-is, while advertisements forwarded to remote receivers are modified with the IP addresses and ports that the NAT expects. In this usage, the LANs on each side of the NAT are combined into a single Topic Resolution domain.

Warning
Using an lbmrd NAT configuration severely limits the UM features that can be used across the NAT. Normal source-to-receiver traffic is supported, but the following more-advanced UM features are not supported: Late Join, sending to sources, and OTR can be made to work if applications are configured to use the default value (0.0.0.0) for request_tcp_interface (context). This means that you cannot use default_interface (context). Be aware that the UM Router requires a valid interface be specified for request_tcp_interface (context). Thus, lbmrd NAT support for Late Join, Request/Response, and OTR is not compatible with UM topologies that contain the UM Router.


Example NAT Configuration  <-

In this example, there are two networks, A and B, that are interconnected via a NAT firewall. Network A has IP addresses in the 10.1.0.0/16 range, and B has IP addresses in the 192.168.1/24 range. The NAT is configured such that hosts in network B have no visibility into network A, and can send TCP and UDP packets to only a single host in A (10.1.1.50) via the NAT's external IP address 192.168.1.1, ports 12000 and 12001. I.e. packets sent from B to 192.168.1.1:12000 are forwarded to 10.1.1.50:12000, and packets from B to 192.168.1.1:12001 are forwarded to 10.1.1.50:12001. Hosts in network A have full visibility of network B and can send TCP and UDP packets to hosts in B by their local 192 addresses and ports. Those packets have their source addresses changed to 192.168.1.1.

Since hosts in network A have full visibility into network B, receivers in network A should be able to use source advertisements from network B without any changes. However, receivers in network B will not be able to use source advertisements from network A unless those advertisements' IP addresses are transformed.

The lbmrd is configured for NAT using its XML configuration file:

<?xml version="1.0" encoding="UTF-8" ?>
<lbmrd version="1.0">
<daemon>
<interface>10.1.1.50</interface>
<port>12000</port>
</daemon>
<domains>
<domain name="Net-NYC">
<network>10.1.0.0/16</network>
</domain>
<domain name="Net-NJC">
<network>192.168.1/24</network>
</domain>
</domains>
<transformations>
<transform source="Net-NYC" destination="Net-NJC">
<rule>
<match address="10.1.1.50" port="*"/>
<replace address="192.168.1.1" port="*"/>
</rule>
</transform>
</transformations>
</lbmrd>

The lbmrd must be run on 10.1.1.50.

The application on 10.1.1.50 should be configured with:

context resolver_unicast_daemon 10.1.1.50:12000
source transport_tcp_port 12001

The applications in the 192 network should be configured with:

context resolver_unicast_daemon 192.168.1.1:12000
source transport_tcp_port 12100

With this, the application on 10.1.1.50 is able to create sources and receivers that communicate with applications in the 192 network.

See lbmrd Configuration File for full details of the XML configuration file.


UDP-Based Topic Resolution Strategies  <-

Configuring UDP-based TR frequently involves a process of weighing the costs and benefits of different goals. The most common goals involved are:

  • Avoid "deafness". Deafness is when there is a source and a receiver for a topic, but the receiver does not discover the source. This is usually a very high priority goal.
  • Minimize the delay before a transport session is joined. This is especially important when a new source is created and the application wants to wait until all existing receivers have fully joined the transport session before sending messages.
  • Minimizing impact on the system. Sending and receiving TR datagrams consumes CPU, network bandwidth, and can introduce latency outliers on active data transports.
  • Maximizing scalability and flexibility. Some deployments are tightly-coupled, carefully controlled, and well-defined. In those cases, scalability and flexibility might not be high-priority goals. Other deployments are loosely-coupled, and consist of many different application groups that do not necessarily coordinate their use of UM with each other. In those cases, scalability and flexibility can be important.
  • Fault tolerance. Some environments, especially those that include Wide Area Networks, can have periodic degradation or loss of network connectivity. It is desired that after a given network problem is resolved, UM will quickly and automatically reestablish normal operation without deafness.

The right TR strategy for a given deployment can depend heavily on the relative importance of these and other goals. It is impossible to give a "one size fits all" solution. Most users work with Informatica engineers to design a custom configuration.

Most users employ a variation on a few basic strategies. Note for the most part, these strategies do not depend on the specific UDP protocol (Multicast vs. Unicast). Normally Multicast is chosen, except where network or policy restrictions forbid it.


Default TR  <-

The main characteristics of UM's default TR settings are:

  • Multicast UDP.
  • Three phases enabled (Initial, Sustaining, Quiescent). Unsolicited TIRs and TQRs nominally last for 65 seconds, although that number can grow as the number of sources or receivers in a context increases.

The default settings can be fine for reasonably small, static deployments, typically not including Wide Area Networks. (A "static" deployment is one where sources, and receivers are, for the most part, created during system startup, and deleted during system shutdown. Contrast with a "dynamic" system where applications come and go during normal operation, with sources and receivers being created and deleted at unpredictable times.)

Advantages:

  • Simplicity.
  • In a network where sources and receivers are relatively static, the consumption of resources by TR stops reasonably quickly.

Disadvantages:

  • As the numbers of contexts, sources, and receivers grow, the traffic load during the initial phase can be very intense, leading to packet loss and potential deafness issues. In these cases, the initial phase can be configured to be less aggressive, or disabled altogether.
  • If a network outage lasts longer than 65 seconds, it is possible for new sources and receivers to be deaf to each other, due to entering their quiescent phases. In these cases, the sustaining phase can be configured for longer durations.


Query-Centric TR  <-

The main characteristics of Query-centric TR are:

Query-centric TR can be useful for large-scale, dynamic systems, especially those that may have many sources for which there are no receivers during normal operation. For example, in some market data distribution architectures, many tens of thousands of sources are created, but a fairly small percentage of them have receivers at any given time. In that case, it is unnecessary to advertise sources on topics that have no receivers.

Note that this strategy does not prevent advertisements. Each TQR will trigger one or more sources to send a TIR in response.

Advantages:

  • For some deployments, can result in significantly reduced TR loading due to removal of TIRs for topics with no receivers.

Disadvantages:

  • To avoid deafness issues, the Query sustaining phase is usually extended, often to infinity. This consumes CPU and Network bandwidth, and can introduce latency outliers.
  • For topics that have receivers, both TQR and TIR traffic are present. (In contrast, a Advertise-Centric TR strategy removes the TQRs, but at the expense of advertising all sources, even those that have no receivers.)


Known Query Threshold TR  <-

In a special case of Query-centric TR, certain classes of topics have a specific number of sources. For example, in point-to-point use cases, a particular topic has exactly one source. As another example, some market data distribution architectures have two sources for each topic, a primary and a warm standby.

For those topics where it is known how many sources there should be, the configuration option resolution_number_of_sources_query_threshold (receiver) can be combined with Query-centric TR to great benefit.

For example, consider a market data system with a primary and warm standby source for each topic. Unsolicited advertisements are disabled (see Disabling Aspects of Topic Resolution), and resolution_number_of_sources_query_threshold (receiver) is set to 2. The receiver will query until it has discovered two sources, at which point it will stop sending queries. If a source fails, the receiver resumes sending queries until it again has two sources.

The advantage here is that it is no longer necessary to extend the Sustaining phase forever to avoid deafness.

NOTE: wildcard receivers do not fit well with this model of TR. Wildcard receivers have their own query mechanism; see Wildcard Receiver Topic Resolution. In particular, there is no wildcard equivalent to the number of sources query threshold. In a query-centric model, wildcard queries must be extended to avoid potential deafness issues. However, in most deployments, the number of wildcard receiver objects is small compared to the number of regular single-topic receivers, so using the Known Query Threshold TR model can still be beneficial.


Advertise-Centric TR  <-

The main characteristics of Advertise-centric TR are:

Advertise-centric TR can be useful for large-scale, dynamic systems, especially those that may have very few sources for which there are no receivers. For example, most order management and routing systems use messaging in a point-to-point fashion, and every source should have a receiver. In that case, it is unnecessary to extend queries.

Advantages:

  • For some deployments, can result in moderate reduced TR loading due to reduction of TQRs.

Disadvantages:

  • To avoid deafness issues, the Advertising sustaining phase is usually extended, often to infinity. This consumes CPU and Network bandwidth, and can introduce latency outliers.
  • For topics that have no receivers, TIR traffic is present. (In contrast, a Query-Centric TR strategy removes the TIRs for topics that have no receivers, but at the expense of introducing both TQRs and TIRs.)
  • In a deployment that includes the UM Router, some number of TQRs are necessary to inform the Router that the context is interested in the topic. To avoid deafness issues, it is recommended to extend the Querying Sustaining Phase, although at a reduced rate.


TCP-Based Topic Resolution Details  <-

TCP-based TR was introduced in UM version 6.12 to address shortcomings in UDP-based TR:

  • Limit on scaling. It is difficult to configure UDP-based TR to scale to many hundreds of thousands of topics. Too many topics typically results in unacceptable CPU and network load, and latency outliers. Intense TR bursts can cause packet loss, retransmissions, and deafness.
  • Deafness issues. As deployments grow in size and complexity, UDP-based TR typically requires greatly extended Sustaining Phases, often to infinity. This results in significant CPU and network resources over the long term, and introduces latency outliers.
  • High time to resolve. To reduce the CPU and network load, and to avoid packet loss, UDP-based TR is usually strongly rate limited. This can greatly extend the time required to resolve topics, sometimes into the tens of minutes.

TCP-based TR differs from UDP-based TR in two important ways:

  • With TCP-based TR, the TCP protocol ensures reliable transmission of information. TCP also makes use of congestion control algorithms to avoid packet loss.
  • With TCP-based TR, topic information is maintained in the Stateful Resolution Service (SRS).

The basic approach used by TCP-based TR is as follows: Each context in a TRD is configured with the address of an SRS service. When the context is created, it connects to the SRS service. When the connection is successful, the context and SRS exchange TR information. They normally do this without involving the other contexts in the TRD.

Then, as an application creates or deletes sources, its context informs the SRS of the change, which in turn informs the other contexts in the TRD.

There are periodic handshakes between each context and the SRS to ensure that connectivity is maintained and that state is valid. This removes the need to re-send TR information that has already been sent.

If an application loses connection with the SRS (perhaps due an extended network outage, or due to failure of the SRS service), the context will repeatedly try to reconnect. Once successful, the process of exchanging TR information is repeated.

Note that much of the difficulty of configuring UDP-based TR is related to controlling the repeated transmission of the same TIRs and TQRs. With TCP-based TR, that repetition is eliminated, making both the configuration and the operation more straight-forward.

UM version 6.12 provides an initial implementation of TCP-based TR, and is able to significantly reduce CPU and network loading, reduce latency outliers, and avoid TR-induced packet loss. However, future versions of UM will enhance TCP-based TR significantly, leading to even greater increases in scaling and reductions in load.

A note about the term "stateful" in relation to the SRS. Even though Unicast UDP TR uses a service called "lbmrd", that service does not maintain the topic information. Instead, the "lbmrd" service merely forwards TR datagrams received, essentially simulating Multicast. For a newly-started receiving application to discover an existing source, that source must send a new TIR to the "lbmrd", which in turn forwards it to the new receiver.

In contrast, the SRS maintains knowledge of all sources in the TRD (hence the "Stateful" in SRS). For a newly-started receiving application to discover an existing source, the SRS can send the information without the source getting involved.


TCP-Based TR and Fault Tolerance  <-

A limitation of UM version 6.12 TCP-based TR is its lack of redundancy in the SRS service. Many users will want a backup to the SRS; in UM 6.12 that backup is UDP-based TR.

Note that in a single TRD environment with no UM Router, failure of TR only affects resolution of new sources and receivers. Existing data streams will continue uninterrupted. So some users may opt to run their system without UDP-based TR to gain the full benefits of TCP-based TR. These users simply ensure that a failed SRS is restarted in a reasonably timely way.

However, many users desire the elimination of single-points of failure, and will therefore need to run TCP-based TR in parallel with UDP-based TR.

Fortunately, the benefits of TCP-based TR can still be largely gained by reducing the amount of UDP-based traffic. The principal behind this is as follows: any form of redundancy is intended to provide a backup service if the primary service fails. It is highly unlikely that both the primary and backup will fail at the same time. In the case of UDP-based TR, the extended sustaining phase is intended to handle various UDP failure scenarios. With TCP-based TR as the primary and UDP-based TR as the secondary, there is no need to extend the sustaining phase since it is highly unlikely that the SRS will fail at the same time that UDP fails.


TCP-Based TR Version Interoperability  <-

TCP-based TR was first introduced in UM version 6.12. To maintain interoperability between pre-6.12 and 6.12, TCP-based TR must be combined with UDP-based TR.

This makes it difficult to gain the benefits of TCP-based TR. Since pre-6.12 applications still need to avoid the problems of deafness, even applications that have upgraded to 6.12 and beyond need to enable UDP-based TR, usually with extended sustaining phases, often to infinity.

Ideally, all applications within a TRD can be upgraded, but this is often not possible. How can the TR load be reduced in a step-wise fashion while an organization is upgrading applications gradually, over a long period of time?

Fortunately, You can set configuration options differently for individual topics, either by using XML Configuration Files (the <topic> element), or by using the API functions for setting configuration options programmatically (e.g. lbm_rcv_topic_attr_setopt() and lbm_src_topic_attr_setopt()).

Some helpful strategies might be:

  • Identify those topics or classes of topics that have limited application interest. If topic X has sources and receivers in upgraded applications, the UDP-based TR for that topic can be reduced (e.g. sustaining phase greatly reduced).
  • Identify those TRDs that have small numbers of applications. When a given TRD's applications have all been upgraded, the UDP-based TR for all topics in that TRD can be reduced. If practical, applications can be moved between TRDs to enable some TRDs to be populated by UM version 6.12 and beyond. Also, a TRD can be sub-divided, separating pre-upgraded from post-upgraded.


TCP-Based TR Configuration  <-

A UM context is configured to use TCP-based TR with the option resolver_service (context), which tells how to connect to the SRS service. For example:

context resolver_service 10.29.3.41:12000

A DNS host name can be used instead of an IP address:

context resolver_service test1.informatica.com:12000

This assumes that an SRS service is running at that address:port.


SRS Service  <-

The SRS service is a daemon process which must be run to provide TCP-based TR for a TRD.

See Man Pages for SRS for details on running the SRS service.

All the contexts in the TRD must be configured to connect to the SRS with the option resolver_service (context). After connecting, each context exchanges TR information with the SRS.

As applications create and delete sources, the SRS is informed, and the SRS informs all connected contexts. This includes proxy sources from a UM Router. In addition, a periodic "keepalive" handshake is performed between the SRS and all connected contexts.

If a network failure causes the context's connection to the SRS to be broken, the context will periodically retry the connection. Since most network failures are brief, the context will soon successfully re-establish a connection to the SRS. Even though this is a resumption of the same context's earlier connection, the context and SRS still exchange full TR information to make sure that any changes during the disconnected period are reflected.

The SRS also supports the publishing of operational and status information via the Daemon Statistics feature. For full details on the SRS Daemon Statistics details, see SRS Daemon Statistics.

SRS State Lifetime

If an application exits abnormally, the SRS will detect that the TCP connection is broken. However, the SRS must not assume that the application has failed; it might be a network problem that forced the disconnection.

So the SRS flags all sources owned by that context as "potentially down", and starts a "state lifetime" timer (see <state-lifetime>). If the context has not failed, and reconnects within that period, during the initial exchange of TR information, the SRS will unflag any "potentially down" sources. However, in the case of application failure, when the state lifetime expires, all "potentially down" sources are deleted. All connected contexts are informed of those deletions.

Note that if an application fails and then restarts, its connection to the SRS is not considered to be a resumption of the previous connection. It is considered to be a new context, and any sources created are new sources. The previous application instance's sources will remain in the "potentially down" state, and will time out with the state lifetime.

If a network outage lasts longer than the configured state lifetime, the SRS gives up on the context's sources and deletes them. These deletions are communicated to all connected contexts. When the network outage is repaired and the context reconnects, the exchange of TR information with the SRS will re-create the context's sources in the SRS, and communicate them to other contexts. This restores normal operation.

SRS Log File

The SRS generates log messages that are used to monitor its health and operation. You can configure these to be directed to "console" (standard output) or a specified log "file", via the <log> configuration element. Normally "console" is only used during testing; a persistent log file should be used for production. The SRS does not over-write its log files on startup, but instead appends to it.

SRS Rolling Logs

To prevent unbounded disk file growth, the SRS supports rolling log files. When the log file rolls, the file is renamed according to the model:
  CONFIGUREDNAME_PID.DATE.SEQNUM
where:

  • CONFIGUREDNAME - Root name of log file, as configured by user.
  • PID - Process ID of the store daemon process.
  • DATE - Date that the log file was rolled, in YYYY-MM-DD format.
  • SEQNUM - Sequence number, starting at 1 when the process starts, and incrementing each time the log file rolls.

For example: srs.log_9867.2017-08-20.2

The user can configure when the log file is eligible to roll over by either or both of two criteria: size and frequency. The size criterion is in millions of bytes. The frequency criterion can be daily or hourly. Once one or both criteria are met, the next message written to the log will trigger a roll operation. These criteria are supplied as attributes to the <log> configuration element.

If both criteria are supplied, then the first one to be reached will trigger a roll. For example, consider the setting:

<log type="file" size="23" frequency="daily">srs.log</log>

Let's say that the log file grows at 1 million bytes per hour (VERY unlikely for an SRS, but let's assume for illustration purposes). At 11:00 pm, the log file will reach 23 million bytes, and will roll. Then, at 12:00 midnight, the log file will roll again, even though it is only 1 million bytes in size.

In addition, the SRS supports automatic deletion of log files based on either or both of two criteria: max history, and total size cap. The max history refers to the number of archived log files, and the total size cap refers to the sum of the sizes of the archived files in millions of bytes. When either or both criteria are met, one or more of the oldest log files are removed until the criteria no longer apply.

For more information, see the <log> configuration element.