Message Loss Recovery <-

Persistence offers the following message recovery mechanisms:

Method	Product	Transports	Description
Negative Acknowledgments (NAKs)	UMS, UMP, UMQ	LBT-RM, LBT-RU	Recovers lost transport datagrams from the source which may contain many small topic messages or fragments of a large message. Receivers send unicast NAKs to the source for missed transport datagrams. Source retransmits datagrams over the configured UM transport.
Late Join	UMS, UMP, UMQ	All	Retransmits messages via unicast to receivers joining the stream after the messages were originally sent. See Using Late Join.
Durable Receiver Recovery	UMP, UMQ	All	Recovers messages persisted while a durable receiver was off line. UM initiates recovery when a durable receiver joins a persistent stream. The receiver then requests retransmission from the Store starting with the low sequence number, defined as the last message it acknowledged to the Store plus one. The Store unicasts retransmissions. See Persistent Receiver Recovery.
Off Transport Recovery	UMS, UMP, UMQ	All	Recovers lost topic messages. Receiver detects lost sequence number and requests retransmission from the source or Persistent Stores (if applicable). UM unicasts retransmissions. See Off-Transport Recovery (OTR).
Proactive Retransmissions	UMP, UMQ	All	Recovers lost messages never received by the Store or never acknowledged by the Store. Operates independently of any receivers. Source unicasts retransmissions. See Proactive Retransmissions.

Persistence Proxy Sources <-

By default, UM expects persistent sources to be running concurrently with persistent receivers. If a source exits, any persistent receivers will disconnect from that source's transport and will wait for the source to come back. More significantly, if a new receiver starts while the source is absent, the receiver will be unable to discover the Stores where the old source's previous messages are Stored. So that late-joining receiver will not recover messages until the source finally restarts.

The Proxy Source feature allows you to configure Stores to create a UM source object to take the place of the exited source. This proxy source behaves much like a real source in that it provides all of the necessary information to subscribers so that they can discover and register with the Stores. This allows late joining receiver to recover messages they missed.

After the the real source returns, the Store automatically deletes its proxy source, allowing the real source to resume normal operation.

Some other features of Proxy Sources include:

Requires a Quorum/Consensus Store configuration.
Normal Store failover operation also initiates a new proxy source.
A Store can be running more than one proxy source if more than one source has failed.
A Store can be running multiple proxy sources for the same topic, each one corresponding to a previous instance of a real source.

Note that proxy sources do introduce extra network and CPU loading, so proxy sources should only be enabled if their functionality is needed.

How Proxy Sources Operate <-

The following sequence illustrates the life of a proxy source:

A source configured for Proxy Source sends to receivers and a group of Quorum/Consensus Stores.
The source fails.
The source's ume_activity_timeout (source) or the Store's source-activity-timeout expires.
The Quorum/Consensus Stores elect a single Store to run the proxy source.
The elected Store creates a proxy source and sends topic advertisements.
The failed source reappears.
The Store deletes the proxy source and the original source resumes activity.

Note that the implementation of the proxy source involves the Store creating a normal UM source object. As such, the user is responsible for providing the Store with a UM library configuration with appropriate source-scoped options. For most source-scoped configuration options, there is no requirement for the proxy source's settings to match the original source's settings. However, there are a few that should be configured the same:

ume_retention_intergroup_stability_behavior (source) (if configured by the original source).
ume_retention_intragroup_stability_behavior (source) (if configured by the original source).
source-related topic resolution options (e.g. resolver_advertisement_minimum_sustain_duration (source)).

Some UM customers have found reasons to intentionally configure their proxy source differently from the original source. For example, to conserve network resources, some customers choose to configure a different transport and change topic-to-transport session mappings. Feel free to contact UM Support for guidance in configuring your proxy sources.

If the Store running the proxy source fails, the other Stores in the Quorum/Consensus group detect a source failure again and can elect a new Store to initiate a proxy source, subject to the Store Option "proxy-source-repo-quorum-required".

Activity Timeout and State Lifetimes <-

UM provides activity and state lifetime timers for sources and receivers that operate in conjunction with the proxy source option or independently. This section explains how these timers work together and how they work with proxy sources.

Activity Timeout

The Store uses the activity timer to decide if a new registration is allowed with the same registration ID. The Store does not allow two applications to be registered at the same time with the same registration ID. However, if an application exits abnormally, we obviously want to restart the application and have it register with the same registration ID. How does the Store prevent simultaneous registration while allowing sequential registrations? I.e. how does the Store decide that an existing registrant has exited? The activity timer.

After registration, the Store expects to hear some kind of activity (message, control, or keepalive) before the activity timer expires. If not, then the Store assumes the source or receiver has been deleted, perhaps by the program cleaning up, or perhaps by crashing. That "releases" the registration ID for use by another application instance.

Setting the activity timeout is somewhat of a balancing act. If you set it too long, then you need to wait a long time before you can restart a crashed application instance. If you set it too short, it risks the Store timing out the application too soon, leaving it vulnerable to having its registration ID "stolen" by another application instance.

Some users maintain tight control over their applications, and choose to set the activity timeout to zero. This results in "weak RegIDs", meaning that the Store does not enforce serialized access to the registration IDs. Other users choose a non-zero activity timeout, and rely on the Store to prevent simultaneous use of a registration ID. This results in "strong RegIDs", meaning that the Store enforces serialized access to the registration IDs.

The activity timeouts default to 30 seconds, and can be configured by the application using: ume_activity_timeout (source) and ume_activity_timeout (receiver). They can also be configured by the Store using: Topic Option "source-activity-timeout" and Topic Option "receiver-activity-timeout". (If both the application and the Store configures the same timer, the result varies and is described in the above linked documentation.)

Finally, be aware that if the activity timeout is longer than the state lifetime, then the expiration of the activity timeout also triggers the deletion of state information.

State Lifetime

The state lifetime timer determines how long state information is retained on a Store in the absence of the source or receiver. I.e. if a publisher exits, the state and message data is retained for the state lifetime period, and is then discarded.

After registration, the Store expects to hear some kind of activity (message, control, or keepalive) before the state lifetime timer expires. If not, then the Store deletes the state information associated with the source or receiver.

Setting the state lifetime is somewhat of a balancing act. If you set the source state lifetime too long, it can lead to old, stale data being available to subscribers during periods that you don't want it. If you set it too short, it risks the Store timing out the application too soon, and potentially leading to undesired message loss.

For short-lived publishers that start, register, perform some function, and exit, a fairly short state lifetime can make sense. For long-lived publishers that might have long-lasting outages and it's important for all published messages to be reliably delivered, a long state lifetime is more appropriate.

The state lifetimes default to 0, meaning that an application's state will be deleted immediately after the activity timeout happens. Most UM users set this option to a non-zero value, according to their requirements. The state lifetime can be configured by the application using: ume_state_lifetime (source) and ume_state_lifetime (receiver). They can also be configured by the Store using: Topic Option "source-state-lifetime" and Topic Option "receiver-state-lifetime". (If both the application and the Store configures the same timer, the result varies and is described in the above linked documentation.)

Activity and State Lifetime Timers Together

Proxy Sources

If you have enabled the Proxy Source option, a source activity timeout triggers the creation of the proxy source. The following diagram illustrates this behavior:

Enabling the Proxy Sources <-

You must configure both the source and the Stores to enable the Proxy Source option.

Configure the source in an LBM Configuration File with the source configuration option, ume_proxy_source (source).
Configure the Stores in the Store configuration file with the Store Element Option, allow-proxy-source.

Proxy Source Elections <-

When the Stores configured for proxy source detect the loss of a registered source (expiration of the source's ume_activity_timeout (source)), one of the Stores should create a proxy source. The Stores of a Q/C group perform an election to determine which Store creates the proxy.

Each Store starts by waiting a randomized amount of time based on its proxy-election-interval option setting. The Store creates a proxy source if it has not received a persistent registration request (PREG) from a proxy on a different Store. The proxy source then sends a PREG containing a unique random value to the other Stores. This value determines which Store deletes it's proxy source in the case that any two Stores independently determine they should create a proxy source. The nature of the random values ensures that only one Store within the QC group or configuration of groups keeps its proxy source.

Note that Topic Option "source-activity-timeout" value should be set to at least double the Topic Option "keepalive-interval" value.

There are two algorithms that the Stores can use when holding a proxy source election:

Quorum not required (default),
Quorum required (new as of UM version 6.15; set Store Option "proxy-source-repo-quorum-required" to 1).

Informatica recommends that new projects use algorithm 2 (Quorum required). This is not the default and must be explicitly set. Existing projects that use algorithm 1 and do not have problems related to proxy sources do not need to change.

ALGORITHM DETAILS:

A proxy source is specific to a topic/reg-ID (or topic/session-ID). When a source exits (publisher deletes it or crashes), the Stores time the source out and hold an election to determine which Store will create a proxy source.

With algorithm 1 (quorum not required), every running Store in the Q/C group participates in the election.

With algorithm 2 (quorum required), only those Stores that have state for the topic/reg-ID will participate. A proxy source will be elected only if a quorum of Stores participate.

Algorithm 2 was introduced in UM version 6.15 to help customers who need to perform an un-recommended Store restart procedure whereby the state and cache files are deleted before restarting. Informatica recommends retaining the state and cache files over a restart, but we also understand that sometimes it is unavoidable and a Store must be started "clean" (for example, if a disk fails).

Creating a proxy source for a particular topic/reg-ID that does not have a quorum of repositories is contrary to the general design of UM persistence. Selecting algorithm 2 conforms with the UM persistence design.

Proactive Retransmissions <-

Proactive Retransmissions, which is enabled by default, address two types of loss:

loss of message data between the source and a Store
loss of stability acknowledgments (ACK) between the Store and the source

The Store sends message stability acknowledgments to the source after the Store persists the message data.

With Proactive Retransmissions, the source maintains an unstable message queue for those messages sent but not acknowledged by the Store. The source checks this queue at the ume_message_stability_timeout (source). If a message in this queue exceeds its ume_message_stability_timeout (source), the source retransmits the message and puts it back on the unstabilized message queue, restarting the message's ume_message_stability_timeout (source).

The source continues to retransmit and check the message's stability timeout until the ume_message_stability_lifetime (source) expires or it receives a stability acknowledgment from the Store. If the source has not received a stability acknowledgment when the ume_message_stability_lifetime (source) expires, the source sends a Store Message Not Stable source event notification to the application. When the Store discards the message because it has not met stability requirements, the Store sends a Store Forced Reclaim source event notification to the application.

To disable Proactive Retransmissions, set ume_message_stability_timeout (source) to 0 (zero). As a result, sources do not create an unstable message queue.

The following applies whether you enable or disable Proactive Retransmissions.

The Store does not discard duplicate messages, but rather always responds to duplicate, retransmitted messages by sending stability acknowledgments even if the message is already stable.
If the Store has marked the message unrecoverably lost and receives a duplicate message from the source, the Store sends the source a negative stability acknowledgment (NAK), which induces the source to remove the message from its unstabilized message queue. A stability NAK is identical to a stability ACKs except that it has a NAK flag set.