Store Loss Repair <-

The persistent Store uses a normal UM receiver to get messages from the source. The Store is subject to the same potential Packet Loss scenarios as an application subscriber, and uses the same loss repair techniques as a subscriber.

Informatica recommends enabling Off-Transport Recovery (OTR) on the Store and application subscribers.

Persistence Buffer Sizes <-

There are two memory buffers that need to be properly sized to get the desired level of performance and reliability for persistence:

Source retention buffer, also known as the late join buffer.
Store message cache.

This analysis assumes that you are using disk-based Stores.

The sizing depends on how sensitive your publishing application is to being slowed down in the event of packet loss. The Stores are designed to write messages to disk in sequence-number order. So in the event of packet loss, newly received messages must be buffered in the Store's message cache while the Store waits for retransmission of the lost messages. If the source's message rate is high and the time required to repair the packet loss is significant, the Store's cache must be configured to be large.

Note that this affects the flight size configuration also. For SPP Stores, message stability is acknowledged to the source after the messages is successfully written to disk. But for messages being kept in the Store's message cache waiting for lost packet retransmission, stability acknowledgement is delayed. All messages sent during this time are said to be "in flight". To avoid the publisher being blocked on flight size, the flight size limit might need to be made large. It is not unusual for our performance-sensitive users to configure the flight size limit to be in the tens of thousands.

Finally, the source's retention buffer (late join buffer) should be sized the same as the flight size limit.

Calculating Options for SPP <-

Determine the following for your application:

avg_msg_size - The source's average size of application messages, in bytes.
avg_msg_rate - The source's average message send rate, in datagrams per second.

The following formulas calculate minimum recommended values of some configuration options:

ume_flight_size = 3 * (ume_ack_batching_interval/1,000) * avg_msg_rate
ume_flight_size_bytes = ume_flight_size * avg_msg_size
ume_repository_size_threshold = ume_flight_size_bytes
ume_repository_size_limit = 1.2 * ume_repository_size_threshold

If Topic Option "stability-ack-minimum-number" is greater than 1, the value of Topic Option "stability-ack-interval" needs to be added in as follows:

ume_flight_size = 3 * ((ume_ack_batching_interval + stability-ack-interval)/1,000)
                  * avg_msg_rate

The other parameters must then be recalculated.

For example:

avg_msg_size = 1024 bytes            (hypothetical use case)
avg_msg_rate = 10,000 datagrams/sec  (hypothetical use case)

ume_flight_size = 3 * (100/1000) * 10000 = 3000
ume_flight_size_bytes = 3000 * 1024 = 3,072,000
ume_repository_size_threshold = 3,072,000
ume_repository_size_limit = 1.2 * 3072000 = 3,686,400

Note that application deviations (e.g. bursts) from the averages can result in unexpected disk writes and blocking imposed on the source. To reduce the chances of blocking, the values for avg_msg_size and/or avg_msg_rate can be increased (with the subsequent configuration values recalculated).

RPP Configuration Specifics <-

With RPP, there is a non-obvious interaction between the settings for:

Store configuration option repository-size-limit,
Store configuration option repository-size-threshold
Store configuration option repository-disk-write-delay,
Source LBM configuration option ume_repository_ack_on_reception (source),
Source LBM configuration options ume_flight_size_bytes (source) and ume_flight_size (source).
Receiver LBM configuration option ume_ack_batching_interval (context).
Store LBM configuration options stability-ack-interval and stability-ack-minimum-number.

The source and Store can enter a state where the sending rate is severely limited by flight size, even though the Store is relatively idle.

To avoid this issue, you need to analyze your usage patterns and set your configuration options appropriately. Determine the following for your application:

avg_msg_size - The source's average size of application messages, in bytes.
avg_msg_rate - The source's average message send rate, in datagrams per second.

Next you need to decide how long you want to set repository-disk-write-delay. The idea is that you want to avoid writing to disk during normal operation, so you should set it long enough that all normally-operating receivers have a chance to acknowledge consumption of the messages during the write delay time. Remember that when a receiver gets a message, it might delay sending consumption for ume_ack_batching_interval (context) milliseconds (defaults to 100).

You also need to take into account heavy bursts of traffic, where receivers might store significant numbers of messages in their socket buffers, and it can take time for the receivers to work their way through all of them.

But you don't want to make the repository-disk-write-delay larger than necessary because it can lead to very high memory usage in the Store. We generally see repository-disk-write-delay being set to values as low as 1000 (1 sec) and as high as 5000 (5 sec).

The following formulas calculate minimum recommended values of some configuration options:

ume_flight_size = 3 * (ume_ack_batching_interval/1,000) * avg_msg_rate
ume_flight_size_bytes = ume_flight_size * avg_msg_size
ume_repository_size_threshold =
        (avg_msg_size * avg_msg_rate * (repository-disk-write-delay/1,000))
        + ume_flight_size_bytes
ume_repository_size_limit = 1.2 * ume_repository_size_threshold

If Topic Option "stability-ack-minimum-number" is greater than 1, the value of Topic Option "stability-ack-interval" needs to be added in as follows:

ume_flight_size = 3 * ((ume_ack_batching_interval + stability-ack-interval)/1,000)
                  * avg_msg_rate

The other parameters must then be recalculated.

For example:

avg_msg_size = 1024 bytes            (hypothetical use case)
avg_msg_rate = 10,000 datagrams/sec  (hypothetical use case)
repository-disk-write-delay = 2500  (2.5 sec, chosen by user)

ume_flight_size = 3 * (100/1000) * 10000 = 3000
ume_flight_size_bytes = 3000 * 1024 = 3,072,000
ume_repository_size_threshold = (1024 * 10000 * (2500/1000)) + 3072000 = 28,672,000
ume_repository_size_limit = 1.2 * 28672000 = 34,406,400

Note that application deviations (e.g. bursts) from the averages can result in unexpected disk writes and blocking imposed on the source. To reduce the chances of blocking, the values for avg_msg_size and/or avg_msg_rate can be increased (with the subsequent configuration values recalculated).