Configuration Guide
Interrelated Configuration Options

Some Ultra Messaging configuration options are related in ways that might not be immediately apparent. Changing the value for one option without adjusting its related option can cause problems such as NAK storms, tail loss, etc. This section identifies these relationships and recommends a best practice for setting the interrelated options.

The following sections discuss configuration option relationships.


Preventing NAK Storms with NAK Intervals  <-

The NAK generation interval should be sufficiently longer than the NAK backoff interval so that the source, after receiving the first NAK from a receiver, has time to retransmit the missing datagram and prevent a NAK storm from all receivers. LBTRM, LBTRU, and MIM all use NAK generation and backoff intervals. The NAK behavior for all transports is the same.

Interrelated Options:
Recommendation:
Set the NAK generation interval to at least 2x the NAK backoff interval.
Example:
#
# To avoid NAK storms, set NAK generation interval to at least 2x the
# NAK backoff interval.
#
receiver transport_lbtrm_nak_backoff_interval 200       # .2 seconds
receiver transport_lbtrm_nak_generation_interval 10000  # 10 seconds
See also:
Transport LBT-RM Reliability Options
Transport LBT-RU Reliability Options
Multicast Immediate Messaging Reliability Options


Preventing Tail Loss With TSNI and NAK Interval Options  <-

Tail Loss refers to the situation where a receiver (subscriber) does not receive the last few (tail) messages sent by a source (publisher). When unrecoverable loss occurs on a transport, due to the possibility of multiple topic-level messages being contained in a single transport-level sequence number (due to implicit batching), a receiver does not know which particular messages were unrecoverable until the arrival of later messages (revealing earlier gaps in topic-level sequence number) or until the arrival of Topic Sequence Number Information (TSNI) records sent periodically by a publisher. Specific topic-level knowledge of sequence gaps is a prerequisite for the receiver to deliver event callbacks to the application indicating that unrecoverable loss has occurred, because those event callbacks are per-receiver (topic-level). A TSNI active threshold that is too small relative to the TSNI and/or NAK generation interval may prevent the reporting of tail loss to the application, especially with ordered delivery.

Interrelated Options:
Recommendation:
Set the source's transport_topic_sequence_number_info_active_threshold (source) to at least 4x the transport_topic_sequence_number_info_interval (source) plus the receiver's transport_lbtru_nak_generation_interval (receiver), all divided by 1000 to get seconds..
Example:
#
# NOTE: transport_topic_sequence_number_info_active_threshold is in seconds.
#
source   transport_topic_sequence_number_info_interval 5000
receiver transport_lbtrm_nak_generation_interval       10000
# (5000*4 + 10000)/1000 = 30
source   transport_topic_sequence_number_info_active_threshold 30
See also:
Preventing Undetected Unrecoverable Loss
Transport LBT-RM Reliability Options
Transport LBT-RU Reliability Options


Preventing Undetected Unrecoverable Loss  <-

The UM UDP-based protocols are generally able to successfully recover packet loss. However, there can be cases where UM is not able to recover the lost packets, leading to Unrecoverable Loss.

With the default settings, there is a type of unrecoverable loss which can remain unreported to the application for an unbounded period of time.

For example:

  1. A sudden burst of data from a source overloads a receiver, resulting in the last few packets being lost.
  2. The source sends one more data message and then exits.
  3. The receiver's Delivery Controller gets the last message and sees the sequence number gap. So it buffers the last message and waits for the transport layer to recover the missing messages. But since the source no longer exists, there is no recovery.
  4. The NAK generation interval lapses. Thus, the gapped messages are considered unrecoverable. However, due to UM's design, a receive event is needed to deliver the unrecoverable loss event and the buffered message. But since the source is deleted, no more receive events will happen. The delivery controller is an a "stale loss" state.
  5. Finally, the transport session times out and the delivery controller is deleted, delivering EOS to the application, but not the unrecoverable loss event or the buffered message.

In this scenario, not only is the unrecoverable loss not delivered, but the buffered message which was successfully received is also never delivered. Note that this kind of Tail Loss is rare, but can happen.

This result can be avoided by enabling the "loss check interval" feature on the delivery controller. For example:

receiver delivery_control_loss_check_interval 2500

This starts a timer that wakes up every 2.5 seconds and scans UM's internal list of all topic receivers, looking for delivery controllers in the "stale loss" state. For each one it finds, it generates the unrecoverable loss event to the application's receiver callback, and also delivers the subsequently buffered message.

However, for applications that have large numbers of receivers, the cost of scanning every receiver can become significant, introducing regular latency outliers. For latency-sensitive applications, an alternate method to avoid the unreported loss is to make sure transport_topic_sequence_number_info_interval (source) is non-zero, and have the publisher delays two of those intervals plus the NAK generation interval (default: 2*5+60=70 seconds) before deleting a source that isn't needed any more. The TSNI messages will serve as receiver events to force delivery. See Preventing Tail Loss With TSNI and NAK Interval Options.

Be aware that the delivery_control_loss_check_interval (receiver) can interact with other interval configurations.

Interrelated Options:
Recommendation, if using loss check interval:
For LBT-RM, set the transport activity timeout to value greater than the sum of the delivery control loss check interval and the NAK generation interval. Also, set the NAK generation interval to at least 4x the delivery control loss check interval.
For LBT-RU, set the transport activity timeout to value greater than the delivery control loss check interval
For UMP, always enable and set accordingly the delivery control loss check interval when configuring a store
Example:
#
# To avoid undetected or unreported loss, set NAK generation to 4x the delivery
# control check interval, and ensure that these two combined are less than the
# transport activity timeout
#
receiver delivery_control_loss_check_interval 2500
receiver transport_lbtrm_activity_timeout 60000
receiver transport_lbtrm_nak_generation_interval 10000
See also:
Delivery Control Options


Preventing Undetected Late Join Loss  <-

If during a Late Join operation, a transport times out while a receiver is requesting retransmission of missing messages, this can cause lost messages to go undetected and likely become unrecoverable.

Interrelated Options:
Recommendations:
Set the Late Join retransmit request interval to a value less than its transport's activity timeout value
Example:
#
# To avoid a transport inactivity timeout while requesting Late Join
# retransmissions, set the Late Join retransmit request interval to a value
# less than its transport's activity timeout.
#
receiver retransmit_request_generation_interval 10000
receiver transport_lbtrm_activity_timeout 60000
See also:
Late Join Options


Preventing IPC Receiver Deafness With Keepalive Options  <-

With an LBT-IPC transport, an activity timeout that is too small relative to the session message interval may cause receiver deafness. If a timeout is too short, the keepalive messages might not be received in time to prevent the receiver from being deleted or disconnecting because the source appears to be gone.

Interrelated Options:
Recommendations:
Set the activity timeout to at least 2x the session message interval
Example:
#
# To avoid receiver deafness:
# - set client activity timeout to at least 2x the acknowledgement interval.
# - set activity timeout to at least 2x the session message interval.
#
receiver transport_lbtipc_activity_timeout 60000
source   transport_lbtipc_sm_interval      10000
See also:
Transport LBT-IPC Operation Options


Preventing Erroneous LBT-RM/LBT-RU Session Timeouts  <-

An LBT-RM or LBT-RU receiver-side quiescent timeout may delete a transport session that a source is still active on. This can happen if the timeout is too short relative to the source's interval between session messages (which serve as a session keepalive).

Interrelated Options:
Recommendations:
Set the receiver LBT-RM or LBT-RU activity timeout to at least 3x the source session message maximum interval.
Example:
#
# To avoid erroneous session timeouts, set receiver transport activity
# timeout to at least 3x the source session message maximum interval.
#
receiver  transport_lbtrm_activity_timeout    60000
source    transport_lbtrm_sm_maximum_interval 10000
receiver  transport_lbtru_activity_timeout    60000
source    transport_lbtru_sm_maximum_interval 10000
See also:
Transport LBT-RM Operation Options
Transport LBT-RU Operation Options


Preventing Errors Due to Bad Multicast Address Ranges  <-

Sometimes it is easy to accidentally reverse the low and high values for LBT-RM multicast addresses, which actually creates a very large range. Aside from excluding intended addresses, this can cause error conditions.

Interrelated Options:
Recommendations:
Ensure that the intended low and high values for LBT-RM multicast addresses are not reversed
Example:
#
# To avoid incorrect LBT-RM multicast address ranges, ensure that you have not
# reversed the low and high values.
#
context transport_lbtrm_multicast_address_low 239.101.4.10.10
context transport_lbtrm_multicast_address_high 239.101.4.10.14
See also:
Transport LBT-RM Network Options


Preventing Store Timeouts  <-

When using Persistence, a store may be erroneously declared unresponsive if its activity timeout expires before it has had adequate opportunity to verify it is still active via activity check intervals.

Interrelated Options:
Recommendations:
Set the store activity timeout to at least 5x the activity check interval
Example:
#
# To avoid erroneous store activity timeouts, set the activity
# timeout to at least 5x the activity check interval.
#
source ume_store_activity_timeout 3000
source ume_store_check_interval 500


Preventing ULB Timeouts  <-

When using ULB queuing, ULB source or receiver may be erroneously declared unresponsive if its activity timeout expires before it has had adequate opportunities to attempt to re-register via activity check intervals if the source appears to be inactive. It is also possible for sources to attempt to reassign messages that have already been processed.

Interrelated Options:
Recommendations:
Set the ULB source activity timeout to at least 5x the ULB source activity check interval.
Set the ULB application set message reassignment timeout to at least 5x the ULB check interval.
Set the ULB receiver activity timeout to at least 5x the ULB check interval.
Example:
#
# To avoid erroneous ULB source, receiver or application set message activity
# timeouts, set the activity timeout to at least 5x the activity check interval.
#
receiver umq_ulb_source_activity_timeout 10000
receiver umq_ulb_source_check_interval   1000
source   umq_ulb_application_set_message_reassignment_timeout 50000
source   umq_ulb_application_set_receiver_activity_timeout    10000
source   umq_ulb_check_interval 1000
See also:
Ultra Messaging Queuing Options ]]])


Preventing Unicast Resolver Daemon Timeouts  <-

A unicast resolver daemon may be erroneously declared inactive if its activity timeout expires before it has had adequate opportunity to verify that it is still alive.

Interrelated Options:
Recommendations:
Set the unicast resolver daemon activity timeout to at least 5x the activity check interval. Or, if activity notification is not desired, set both options to 0.
Example:
#
# To avoid erroneous unicast resolver daemon timeouts, set the activity
# timeout to at least 5x the activity check interval.
#
context resolver_unicast_activity_timeout 1000
context resolver_unicast_check_interval 200
See also:
UDP-Based Resolver Operation Options


Preventing Store Registration Hangs  <-

The following configuration options come into play when sources register with stores in a lossy environment:

Interrelated Options:

The sri_request "interval" and "maximum" options multiply to define a duration over which the receiver requests Store Information Records (SRI) messages from the source. Similarly, the transport_topic_sequence_number_info_request "interval" and "maximum" options multiply to define a duration over which the receiver requests Transport Topic Sequence Number Info (TSNI) messages from the source.

Recommendations:

The two request durations should be twice the value of the appropriate transport activity timer.

Example:
#
# To avoid hung store registration, set the durations of the SRI and TSNI
# requests to 2x the transport activity timeout.
# 
receiver transport_lbtrm_activity_timeout 60000
receiver ume_sri_request_maximum 120
receiver ume_sri_request_interval 1000
receiver transport_topic_sequence_number_info_request_maximum 120
receiver transport_topic_sequence_number_info_request_interval 1000
Warning
As of this version of UM, the default values for these options do not satisfy this recommendation. Users are advised to double the values for ume_sri_request_maximum (receiver) and transport_topic_sequence_number_info_request_maximum (receiver).