Informatica

Ultra Messaging Knowledge Base


Packet Loss

Overview of the causes and treatments for packet loss.

Packet Loss
    • Introduction
    • Causes of Packet Loss
    • Avoiding Packet Loss
        • Decrease Packet Flow through Loss Points
        • Increase Efficiency of Packet Consumers
        • Decrease Packet Rate Using Rate Limiter
            • Many Subscribers to Few Receivers
        • Proper Configuration

Introduction

This article gives an overview of what causes network packet loss and how to treat it. It assumes you are familiar with the basics of Ultra Messaging's messaging paradigm and the basics of network data communication.

Related articles:

Causes of Packet Loss

See Packet Loss Points.

Avoiding Packet Loss

Everybody's goal should be to reduce packet loss as much as possible. There are four methods of avoiding packet loss:

  • Decrease packet flow through loss points.
  • Increase efficiency of packet consumers.
  • Decrease packet rate using rate limiter.
  • Proper configuration.

Decrease Packet Flow through Loss Points

  • Message Batching. At most loss points, the number of packets is usually more important than the sizes of the packets. 100 packets of 60 bytes each is much more burdensome to packet consumers than 10 packets of 600 bytes each. For latency-sensitive applications, consider implementing an Intelligent Batching algorithm.

  • Reduce UM discards. Due to how publishers map topics to Transport Sessions, the receiver may have to discard messages it hasn't subscribed to. The LBT-RM transport statistics structure for each transport type contains the field "lbm_msgs_no_topic_rcved" which counts the number of data messages discarded. Also, the context statistics structure contains the field "lbtrm_unknown_msgs_rcved", which also counts data messages discarded.

    For the LBT-RU transport type, you can get similar discards. See the LBT-RU transport statistics structure field "lbm_msgs_no_topic_rcved", and the context statistics structure field "lbtru_unknown_msgs_rcved".

    For the TCP or LBT-RU transport types, you can often decrease discards by turning on Source-Side Filtering.

    With Multicast, the Source-Side Filtering feature is not possible. So it is usually necessary to change the topic-to-transport session mapping. Ideally, this is done by considering the subscribers' topic interest. But often, it means increasing the number of transport sessions. In the case of LBT-RM, increasing the number of multicast groups is preferred, but is often limited by the network hardware. In that case, you can multiply the number of transport sessions by varying the destination port.

  • Limit the packet transmission rate with the UM rate limiter. With the previous methods, you can reduce the probability of getting loss, but you still might be operating "close to the edge" such that an unexpected burst of traffic will cause an overload and packet loss. The UM rate limiter is intended to reduce the chances of loss close to zero. See Decrease Packet Rate Using Rate Limiter.

Increase Efficiency of Packet Consumers

Here are some methods for increasing the efficiency of subscribers:

These steps, plus optimizing your own message handling code, will enable your subscribers to increase their message consumption rate without packet loss. However, as long as your publishers can exceed your subscribers' rate, packet loss is still possible.

Decrease Packet Rate Using Rate Limiter

It is usually possible for publishers to outpace subscribers, especially when multiple publishers feed a single subscriber. For example, if multiple client gateways can send to the same order handling process, it is possible that all of them burst at the same time, overloading the order handler. To avoid packet loss, those publishers should be prevented from bursting at dangerous rates.

Note that Smart Sources do not support rate limits. To minimize the chances of packet loss and NAK storms, it becomes the publishing application's responsibility to control its send rate.

  1. Measure your subscribers' maximum sustainable message rate (MSMR). This is usually done empirically by sending test data at progressively higher rates until you start getting loss. See Verifying Loss Detection Tools for information on detecting when that loss happens. With UDP-based transports, you should run for more than a full minute without loss. See notes [a] and [b].

  2. Identify the sources that will be sending messages to the subscriber. Let's say there are two publishers, A and B. Set each publisher's rate limit to 40% of the subscriber's MSMR (never allow the average send rates to exceed 80% of the MSMR). Note that this will block the sender if it attempts to exceed this average rate.

  3. Traffic tends to be bursty. You want to allow brief traffic surges above the MSMR so long as subsequent slower periods bring the average below the MSMR. This is configured using the rate interval. A larger rate interval allows longer, more intense bursts, while a smaller value forces smoother, less-bursty traffic (at the expense of potentially blocking the sender). See notes [c] and [d].

When the protocol operates over a WAN, you may need to reduce the rate limiter below the MSMR. Consider a subscriber that can sustain an input rate of 8 gigabits/sec, but it is separated from the publisher by a WAN with a total throughput of 10 gigabits/sec. However, this 10 gigabits/sec must be shared across all networking applications, and allowing this publisher to peak at 8 gigabits/sec would overload the WAN link.

To ensure a lossless utilization, you should divide up the available bandwidth and limit each app that sends over the WAN to its share of the bandwidth.

NOTES:

[a]: Measuring the maximum sustainable message rate (MSMR) can be difficult if you don't have a consistent execution environment. For example, supposed you don't assign your hot threads to specific CPU cores. In that case, your operating system will migrate your threads between cores, even across NUMA zones, preventing optimal cache and memory usage. A rate that is easily handled in one test run can cause packet loss in the next. This measurement can be even harder to characterize if different message types require different amounts of time to consume.

[b]: Your subscriber might handle different message types that require different times to consume. How do you measure the maximum sustainable message rate in this case? You might be able to statistically say that X% of messages are type 1 and Y% are type 2, but unless you can guarantee this breakdown, it is usually safest to assume 100% of your messages are of the type that requires the maximum consumption time.

[c]: The rate interval should be set explicitly according to your needs. A rate limiter of 500 megabits/sec will essentially be divided into N equal periods of the rate interval milliseconds each. During each interval, the application is allowed to send up to the rate limiter/N bits of data. This can be at the full line rate. You can allow more intense bursts by increasing the rate interval, thus decreasing the number N.

[d]: The UM documentation warns you against setting the rate interval to values other than the specific set listed, ranging from 5 to 100. A rate interval of 100 divides each second into 10 periods (N=10). If this value blocks your publisher too much, you can extend it to 200, 500, or possibly even 1000. However, be aware that this allows the publisher to send very intense bursts that might lead to queue overflows at packet loss points with small queues, especially if multiple publishers burst at the same time. You should test your publishers sending "as fast as they can," only constrained by the UM rate limiter. This will produce the worst-case bursts for the duration of the test.

Many Subscribers to Few Receivers

The principle of restricting N publishers to 1/N the maximum sustainable message rate (MSMR) of the subscriber may not be practical in some use cases. If N becomes large, the permitted send rates may be too low for your system to function effectively. In that case, you may have to assume a statistical behavior. I.e., as N grows, the probability of all of them bursting simultaneously becomes small. In that case, many users set their rate limiters higher than suggested, confident that the aggregate rate of the N publishers won't exceed the MSMR for longer than the queues can hold.

Just be aware that low-probability events do happen occasionally. UM's Lost Packet Recovery should handle these occasional loss incidents, but you will always have some risk of NAK Storms.

Proper Configuration

Proper configuration of sources and receivers will reduce the chances of loss and NAK storms. See Configuring UDP-Based Transports.


KB Home | Index

UM Home

See Notices for important information.