Configuration Guide
|
Some Ultra Messaging configuration options are related in ways that might not be immediately apparent. Changing the value for one option without adjusting its related option can cause problems such as NAK storms, tail loss, etc. This section identifies these relationships and recommends a best practice for setting the interrelated options.
The following sections discuss configuration option relationships.
The NAK generation interval should be sufficiently longer than the NAK backoff interval so that the source, after receiving the first NAK from a receiver, has time to retransmit the missing datagram and prevent a NAK storm from all receivers. LBTRM, LBTRU, and MIM all use NAK generation and backoff intervals. The NAK behavior for all transports is the same.
# # To avoid NAK storms, set NAK generation interval to at least 2x the # NAK backoff interval. # receiver transport_lbtrm_nak_backoff_interval 200 # .2 seconds receiver transport_lbtrm_nak_generation_interval 10000 # 10 seconds
Tail Loss refers to the situation where a receiver (subscriber) does not receive the last few (tail) messages sent by a source (publisher). When unrecoverable loss occurs on a transport, due to the possibility of multiple topic-level messages being contained in a single transport-level sequence number (due to implicit batching), a receiver does not know which particular messages were unrecoverable until the arrival of later messages (revealing earlier gaps in topic-level sequence number) or until the arrival of Topic Sequence Number Information (TSNI) records sent periodically by a publisher. Specific topic-level knowledge of sequence gaps is a prerequisite for the receiver to deliver event callbacks to the application indicating that unrecoverable loss has occurred, because those event callbacks are per-receiver (topic-level). A TSNI active threshold that is too small relative to the TSNI and/or NAK generation interval may prevent the reporting of tail loss to the application, especially with ordered delivery.
# # NOTE: transport_topic_sequence_number_info_active_threshold is in seconds. # source transport_topic_sequence_number_info_interval 5000 receiver transport_lbtrm_nak_generation_interval 10000 # (5000*4 + 10000)/1000 = 30 source transport_topic_sequence_number_info_active_threshold 30
The UM UDP-based protocols are generally able to successfully recover packet loss. However, there can be cases where UM is not able to recover the lost packets, leading to Unrecoverable Loss.
With the default settings, there is a type of unrecoverable loss which can remain unreported to the application for an unbounded period of time.
For example:
In this scenario, not only is the unrecoverable loss not delivered, but the buffered message which was successfully received is also never delivered. Note that this kind of Tail Loss is rare, but can happen.
This result can be avoided by enabling the "loss check interval" feature on the delivery controller. For example:
receiver delivery_control_loss_check_interval 2500
This starts a timer that wakes up every 2.5 seconds and scans UM's internal list of all topic receivers, looking for delivery controllers in the "stale loss" state. For each one it finds, it generates the unrecoverable loss event to the application's receiver callback, and also delivers the subsequently buffered message.
However, for applications that have large numbers of receivers, the cost of scanning every receiver can become significant, introducing regular latency outliers. For latency-sensitive applications, an alternate method to avoid the unreported loss is to make sure transport_topic_sequence_number_info_interval (source) is non-zero, and have the publisher delays two of those intervals plus the NAK generation interval (default: 2*5+60=70 seconds) before deleting a source that isn't needed any more. The TSNI messages will serve as receiver events to force delivery. See Preventing Tail Loss With TSNI and NAK Interval Options.
Be aware that the delivery_control_loss_check_interval (receiver) can interact with other interval configurations.
# # To avoid undetected or unreported loss, set NAK generation to 4x the delivery # control check interval, and ensure that these two combined are less than the # transport activity timeout # receiver delivery_control_loss_check_interval 2500 receiver transport_lbtrm_activity_timeout 60000 receiver transport_lbtrm_nak_generation_interval 10000
If during a Late Join operation, a transport times out while a receiver is requesting retransmission of missing messages, this can cause lost messages to go undetected and likely become unrecoverable.
# # To avoid a transport inactivity timeout while requesting Late Join # retransmissions, set the Late Join retransmit request interval to a value # less than its transport's activity timeout. # receiver retransmit_request_generation_interval 10000 receiver transport_lbtrm_activity_timeout 60000
With an LBT-IPC transport, an activity timeout that is too small relative to the session message interval may cause receiver deafness. If a timeout is too short, the keepalive messages might not be received in time to prevent the receiver from being deleted or disconnecting because the source appears to be gone.
# # To avoid receiver deafness: # - set client activity timeout to at least 2x the acknowledgement interval. # - set activity timeout to at least 2x the session message interval. # receiver transport_lbtipc_activity_timeout 60000 source transport_lbtipc_sm_interval 10000
An LBT-RM or LBT-RU receiver-side quiescent timeout may delete a transport session that a source is still active on. This can happen if the timeout is too short relative to the source's interval between session messages (which serve as a session keepalive).
# # To avoid erroneous session timeouts, set receiver transport activity # timeout to at least 3x the source session message maximum interval. # receiver transport_lbtrm_activity_timeout 60000 source transport_lbtrm_sm_maximum_interval 10000 receiver transport_lbtru_activity_timeout 60000 source transport_lbtru_sm_maximum_interval 10000
Sometimes it is easy to accidentally reverse the low and high values for LBT-RM multicast addresses, which actually creates a very large range. Aside from excluding intended addresses, this can cause error conditions.
# # To avoid incorrect LBT-RM multicast address ranges, ensure that you have not # reversed the low and high values. # context transport_lbtrm_multicast_address_low 239.101.4.10.10 context transport_lbtrm_multicast_address_high 239.101.4.10.14
When using Persistence, a store may be erroneously declared unresponsive if its activity timeout expires before it has had adequate opportunity to verify it is still active via activity check intervals.
# # To avoid erroneous store activity timeouts, set the activity # timeout to at least 5x the activity check interval. # source ume_store_activity_timeout 3000 source ume_store_check_interval 500
When using ULB queuing, ULB source or receiver may be erroneously declared unresponsive if its activity timeout expires before it has had adequate opportunities to attempt to re-register via activity check intervals if the source appears to be inactive. It is also possible for sources to attempt to reassign messages that have already been processed.
# # To avoid erroneous ULB source, receiver or application set message activity # timeouts, set the activity timeout to at least 5x the activity check interval. # receiver umq_ulb_source_activity_timeout 10000 receiver umq_ulb_source_check_interval 1000 source umq_ulb_application_set_message_reassignment_timeout 50000 source umq_ulb_application_set_receiver_activity_timeout 10000 source umq_ulb_check_interval 1000
A unicast resolver daemon may be erroneously declared inactive if its activity timeout expires before it has had adequate opportunity to verify that it is still alive.
# # To avoid erroneous unicast resolver daemon timeouts, set the activity # timeout to at least 5x the activity check interval. # context resolver_unicast_activity_timeout 1000 context resolver_unicast_check_interval 200
The following configuration options come into play when sources register with stores in a lossy environment:
The sri_request "interval" and "maximum" options multiply to define a duration over which the receiver requests Store Information Records (SRI) messages from the source. Similarly, the transport_topic_sequence_number_info_request "interval" and "maximum" options multiply to define a duration over which the receiver requests Transport Topic Sequence Number Info (TSNI) messages from the source.
The two request durations should be twice the value of the appropriate transport activity timer.
# # To avoid hung store registration, set the durations of the SRI and TSNI # requests to 2x the transport activity timeout. # receiver transport_lbtrm_activity_timeout 60000 receiver ume_sri_request_maximum 120 receiver ume_sri_request_interval 1000 receiver transport_topic_sequence_number_info_request_maximum 120 receiver transport_topic_sequence_number_info_request_interval 1000