Guide for Persistence
Configuring for Persistence and Recovery

Deployment decisions play a huge role in the success of any persistent system. Configuration in UM has a number of options that aid in performance, fault recovery, and overall system stability. It is not possible, or at least not wise, to totally divorce configuration from application development for high performance systems. This is true not only for persistent systems, but for practically all distributed systems. When designing systems, deployment considerations need to be taken into account for the following:


Source Considerations  <-

Performance of sources is heavily impacted by:

  • The Retention Policy that the source uses,

  • Streaming methods of the source,

  • The throughput and latency requirements of the data.

Source release settings have a direct impact on memory usage. As messages are retained, they consume memory. You reclaim memory when you release messages. Message stability, delivery confirmation and retention size all interact to create your release policies. UM provides a hard limit on the memory usage. When exceeded, UM delivers a Forced Reclamation event. Thus applications that anticipate forced reclamations can handle them appropriately. See also Source Message Retention and Release.

How the source streams data has a direct impact on latency and throughput. One streaming method sets a maximum, outstanding count of messages. Once reached, the source does not send any more until message stability notifications come in to reduce the number of outstanding messages. The umesrc example program uses this mechanism to limit the speed of a source to something a Store can handle comfortably. This also provides a maximum bound on recovery that can simplify handling of streaming source recovery.

The throughput and latency requirements of the data are normal UM concerns.


Receiver Considerations  <-

In addition to the following, receiver performance shares the same considerations as receivers during normal operation.


Receiver Acknowledgement Generation  <-

Persistent receivers send a message consumption acknowledgement to Stores and the message source. Some applications may want to control this acknowledgement explicitly themselves. In this case, ume_explicit_ack_only (receiver) can be used.


Controlling Retransmission  <-

Receivers send retransmission requests and receive and process retransmissions. Control over this process is crucial when handling very long recoveries, such as hundreds of thousands or millions of messages. A receiver only sends a certain number of retransmission requests at a time.

This means that a receiver will not, unless configured to with retransmit_request_outstanding_maximum (receiver), request everything at once. The value of the low sequence number (Persistent Receiver Recovery) has a direct impact on how many requests need to be handled. A receiving application can decide to only handle the last X number of messages instead of recovering them all using the option, retransmit_request_maximum (receiver). The timeout used between requests, if the retransmission does not arrive, is totally controllable with retransmit_request_interval (receiver). And the total time given to recover all messages is also controllable.


Receiver Recovery Process  <-

Theoretically, receivers can handle up to roughly 2 billion messages during recovery. This limit is implied from the sequence number arithmetic and not from any other limitation. For recovery, the crucial limiting factor is how a receiver processes and handles retransmissions which come in as fast as UM can request them and a Store can retransmit them. This is perhaps much faster than an application can handle them. In this case, it is crucial to realize that as recovery progresses, the source may still be transmitting new data. This data will be buffered until recovery is complete and then handed to the application. It is prudent to understand application processing load when planning on how much recovery is going to be needed and how it may need to be configured within UM.


Store Configuration Considerations  <-

Stores have numerous configuration options. See Configuration Reference for Umestored for details.


Configuring Store Usage per Source  <-

A Store handles persisted state on a per topic per source basis. Based on the load of topics and sources, it may be prudent to spread the topic space, or just source space, across Stores as a way to handle large loads. As configuration of Store usage is per source, this is extremely easy to do. It is easy to spread CPU load via multi-threading as well as hard disk usage across Stores. A single Store Process can have a set of Store instances within it, each with their own thread.


Memory Use by Stores  <-

As mentioned previously in Persistent Store Concept, Stores can be memory based or disk based. Disk Stores also have the ability to spread hard disk usage across multiple physical disks by using multiple Store instances within a single Store Process. This gives great flexibility on a per source basis for spreading data reception and persistent data load.

Stores provide settings for controlling memory usage and for caching messages for retransmission in memory as well as on disk. All messages in a Store, whether in memory or on disk, have some small memory state. This is roughly about 72 bytes per message. For very large caches of messages, this can become non-trivial in size.


Activity Timeouts  <-

Stores are NOT archives and are not designed for archival. Stores persist source and receiver state with the aim of providing message recovery in the event of a fault. Central to this is the concept that a source or receiver has an activity timeout attached to it. Once a source or receiver suspends operation or has a failure, it has a set time before the Store will forget about it. This activity timeout needs to be long enough to handle the recovery demands of sources and receivers. However, it can not and should not be infinite. Each source takes up memory and disk space, therefore an appropriate timeout should be chosen that meets the requirements of recovery, but is not excessively long so that the limited resources of the Store are exhausted.


Recommendations for Store Configuration  <-

  • Number of Stores in the QC group. Informatica recommends a minimum of 3 Stores. A publisher defines the QC Store QC group using the LBM configuration option ume_store (source). This option is specified multiple times to define the desired number of Stores in the QC group.

  • Flight Size - Maximum number of messages sent but not stable in a quorum of Stores. The publishing application should not exceed the flight size. See Persistence Flight Size for configuration details.

  • Off-Transport Recovery (OTR). Informatica recommends that Stores be configured to use OTR to recover lost messages from the Source. Note that the default for use_otr (receiver) is "2", which does NOT enable OTR for the Store. Informatica recommends setting "use_otr" to 1 in the Store's LBM configuration file.

  • Proactive Retransmissions. Informatica recommends that persistent sources use proactive retransmission to ensure message stability. See ume_message_stability_timeout (source) (on by default).

  • Burst Loss. Informatica strongly recommends disabling "burst loss" by setting the LBM configuration option delivery_control_maximum_burst_loss (receiver) to a very large number, perhaps 10000000. This should be done for both the Store's LBM configuration and for the subscriber's LBM configuration.

  • Persistence Buffer Sizes. Informatica recommends performing an analysis of expected publisher data rates and worst-case data repair times to properly size the Store's retention buffer and the source's retention buffer (late join buffer). See Persistence Buffer Sizes.


Store Configuration Practices to Avoid  <-

Informatica recommends against the following Store configuration practices: