Ultra Messaging Knowledge Base
Overview of Ultra Messaging's lost packet recovery protocol.
• Lost Packet Recovery
• Introduction
• Packet Loss Prevention
• Recovery Protocols
• UDP-Based Transport Recovery
• Delivery Controller Recovery
• Proper Configuration
This article gives an overview of Ultra Messaging's lost packet recovery protocol. It assumes you are familiar with the basics of Ultra Messaging's messaging paradigm and the basics of network data communication. If any of the following are unfamiliar, contact Support for assistance.
Related articles:
Everybody should have as their goal to reduce or eliminate packet loss as much as possible. See Packet Loss.
But it is usually not possible to completely eliminate all possibility of packet loss. In most networks, the most common failure is packet loss. I.e., a publishing application sends a message, and the packets never make it to the subscribing application.
Packet loss usually occurs in one of two ways:
Packet loss generally happens in "incidents" that are limited in duration. Sporatdc packet loss might last a few tens of milliseconds to many seconds. Connectivity failures are typically longer, lasting anywhere from a several seconds to many hours.
Note that the TCP protocol implements its own lost packet recovery protocol independent of UM, and usually recovers sporadic loss. But connectivity failures that last too long will result in TCP disconnects, with the result that when the connectivity failure is repaired, the messages sent during the incident will not be automatically recovered by the TCP protocol.
In contrast, UDP-based protocols (like UM's LBT-RU and LBT-RM) implement their own reliability protocol that performs the same function as TCP in recovering lost packets. Note that UM's UDP-based reliability protocols can provide lower latency and higher average throughput than TCP in networks where low levels of sporadic loss are endemic.
UM's lost packet recovery protocols are divided into two layers: transport protocol and delivery controller.
The transport protocol's recovery is targeted primarily at sporadic packet loss. The details of the recovery protocols vary depending on the source protocol selected. For example, the LBT-TCP protocol simply leverages TCP's recovery, while LBT-RM and LBT-RU use a NAK-based recovery protocol.
The delivery controller recovery layer generally functions when the transport layer fails to recover. For example:
The LBT-RU and LBT-RM transports have almost identical lost packet recovery protocols. This section applies to both.
Application messages are inserted into a UM datagram that contains a transport sequence number specific to the transport session that the message's topic is mapped to. This is not the sequence number that is visible to the application; it is only visible in packet captures.
A subscriber detects loss as a gap in sequence numbers. I.e., it has to have successfully received messages before the loss incident and after to see the gap.
UM's transport recovery does not detect or attempt to recover from head loss or tail loss, although the Delivery Controller recovery can.
When a subscriber detects a sequence number gap, here is the sequence of operations performed by the transport recovery protocol:
In both LBT-RM and LBT-RU, message retransmissions and NCFs are sent to all subscribers, not just the one(s) that sent NAKs. (In contrast, the delivery controller recovery protocols only send recovery to the subscriber(s) that request it.)
Additional description of this protocol is available at Transport LBT-RM Reliability Options
NOTES:
[a]: In step 1, the publisher saves transmitted messages in the transport session's transmission window. This is a fixed-size data structure that saves previously sent messages up to a size limit. When the size limit is reached, the oldest messages are removed to make room for the newer messages. Each message that is sent contains in its header the current lowest and highest sequence numbers in the transmission window.
[b]: The initial NAK backoff was not originally part of LBT-RU. It was added in UM version 6.10. We encourage all users of pre-6.10 versions to upgrade.
[c]: In step 3, the missing datagrams might arrive before the initial NAK timer expires because some other subscriber had the same loss and set a shorter randomized timer and already sent the NAK and got retransmission. Or it might be because networks and operating systems can sometimes deliver UDP packets out of order. When UM receives the packets out of order, it interprets the sequence number discontinuities as loss which is immediately repaired.
[d]: In step 4, the receiver assembles a NAK datagram. However, the most recent successfully received datagram for that transport session contains the current lowest and highest sequence numbers in the transmission window. If the receiver detects that one or more of its missing sequence numbers are NOT in the source's transmission window range, the receiver will declare the sequence numbers unrecoverable loss and will not include them in the NAK datagram. This does not necessarily trigger an unrecoverable loss event being delivered to the subscribing application since the delivery controller may succeed in recovering the missing messages.
[e]: In step 5, the source might determine that it cannot retransmit a requested datagram. This might be because the datagram is no longer in its transmission window, in which case the request is ignored and no NCF is sent. Or it might have reached its retransmit rate limit, in which case it sends an NCF. Or it might have already sent a retransmission within its suppression interval, in which case it sends an NCF. Also, note that LBT-RM's NCF algorithm changed in UM version 6.10 to reduce the number of NCFs sent during periods of heavy loss. We encourage all users of pre-6.10 versions to upgrade.
There are two forms of delivery controller recovery:
Note that both late join and OTR use the same underlying protocol, which is TCP-based. This protocol functions between a source and a receiver in a "streaming" (non-persisted) use case, or between a Store and a receiver in a persisted use case.
Proper configuration of sources and receivers will reduce the chances of loss and NAK storms. See Configuring UDP-Based Transports.
UM Home
See Notices for important information.