Guide for Persistence
Persistence Fault Tolerance


Message Loss Recovery  <-

Persistence offers the following message recovery mechanisms:

Method Product Transports

Description

Negative Acknowledgments (NAKs) UMS, UMP, UMQ LBT-RM, LBT-RU

Recovers lost transport datagrams from the source which may contain many small topic messages or fragments of a large message. Receivers send unicast NAKs to the source for missed transport datagrams. Source retransmits datagrams over the configured UM transport.

Late Join UMS, UMP, UMQ All

Retransmits messages via unicast to receivers joining the stream after the messages were originally sent. See Using Late Join.

Durable Receiver Recovery UMP, UMQ All

Recovers messages persisted while a durable receiver was off line. UM initiates recovery when a durable receiver joins a persistent stream. The receiver then requests retransmission from the Store starting with the low sequence number, defined as the last message it acknowledged to the Store plus one. The Store unicasts retransmissions. See Persistent Receiver Recovery.

Off Transport Recovery UMS, UMP, UMQ All

Recovers lost topic messages. Receiver detects lost sequence number and requests retransmission from the source or Persistent Stores (if applicable). UM unicasts retransmissions. See Off-Transport Recovery (OTR).

Proactive Retransmissions UMP, UMQ All Recovers lost messages never received by the Store or never acknowledged by the Store. Operates independently of any receivers. Source unicasts retransmissions. See Proactive Retransmissions.


Persistence Proxy Sources  <-

By default, UM expects persistent sources to be running concurrently with persistent receivers. If a source exits, any persistent receivers will disconnect from that source's transport and will wait for the source to come back. More significantly, if a new receiver starts while the source is absent, the receiver will be unable to discover the Stores where the old source's previous messages are Stored. So that late-joining receiver will not recover messages until the source finally restarts.

The Proxy Source feature allows you to configure Stores to create a UM source object to take the place of the exited source. This proxy source behaves much like a real source in that it provides all of the necessary information to subscribers so that they can discover and register with the Stores. This allows late joining receiver to recover messages they missed.

After the the real source returns, the Store automatically deletes its proxy source, allowing the real source to resume normal operation.

Some other features of Proxy Sources include:

  • Requires a Quorum/Consensus Store configuration.

  • Normal Store failover operation also initiates a new proxy source.

  • A Store can be running more than one proxy source if more than one source has failed.

  • A Store can be running multiple proxy sources for the same topic, each one corresponding to a previous instance of a real source.

Note that proxy sources do introduce extra network and CPU loading, so proxy sources should only be enabled if their functionality is needed.


How Proxy Sources Operate  <-

The following sequence illustrates the life of a proxy source:

  1. A source configured for Proxy Source sends to receivers and a group of Quorum/Consensus Stores.

  2. The source fails.

  3. The source's ume_activity_timeout (source) or the Store's source-activity-timeout expires.

  4. The Quorum/Consensus Stores elect a single Store to run the proxy source.

  5. The elected Store creates a proxy source and sends topic advertisements.

  6. The failed source reappears.

  7. The Store deletes the proxy source and the original source resumes activity.

Note that the implementation of the proxy source involves the Store creating a normal UM source object. As such, the user is responsible for providing the Store with a UM library configuration with appropriate source-scoped options. For most source-scoped configuration options, there is no requirement for the proxy source's settings to match the original source's settings. However, there are a few that should be configured the same:

Some UM customers have found reasons to intentionally configure their proxy source differently from the original source. For example, to conserve network resources, some customers choose to configure a different transport and change topic-to-transport session mappings. Feel free to contact Informatica support for guidance in configuring your proxy sources.

If the Store running the proxy source fails, the other Stores in the Quorum/Consensus group detect a source failure again and can elect a new Store to initiate a proxy source, subject to the Store Option "proxy-source-repo-quorum-required".


Activity Timeout and State Lifetimes  <-

UM provides activity and state lifetime timers for sources and receivers that operate in conjunction with the proxy source option or independently. This section explains how these timers work together and how they work with proxy sources.

The ume_activity_timeout (source) and ume_activity_timeout (receiver) options determine how long a source or receiver must be inactive before a Store allows another source or receiver to register using that RegID. This prevents a second source or receiver from stealing a RegID from an existing source or receiver. An activity timeout can be configured for the source/receiver with the LBM configuration option cited above or with a topic's UMP Element "<ume-attributes>" in the Store configuration file. The following diagram illustrates the default activity timeout behavior, which uses source-state-lifetime in the Store configuration file.

source_act_timeout_def.png

In addition to the activity timeout, you can also configure sources and receivers with a state lifetime timer using the following options.

The ume_state_lifetime (source) and ume_state_lifetime (receiver) options, when used in conjunction with the ume_activity_timeout (source) and ume_activity_timeout (receiver) options, determines at what point UM removes the source or receiver state files. UM does not check the state lifetime until the activity timeout expires. The following diagram illustrates this behavior:

source_state_lifetime.png

If you have enabled the Proxy Source option, the ume_activity_timeout (source) triggers the creation of the proxy source. The following diagram illustrates this behavior:

src_act_and_state_timers.png


Enabling the Proxy Sources  <-

You must configure both the source and the Stores to enable the Proxy Source option.

  • Configure the source in an LBM Configuration File with the source configuration option, ume_proxy_source (source).

  • Configure the Stores in the Store configuration file with the Store Element Option, allow-proxy-source.


Proxy Source Elections  <-

When the Stores configured for proxy source detect the loss of a registered source (expiration of the source's ume_activity_timeout (source)), one of the Stores should create a proxy source. The Stores of a Q/C group perform an election to determine which Store creates the proxy.

Each Store starts by waiting a randomized amount of time based on its proxy-election-interval option setting. The Store creates a proxy source if it has not received a persistent registration request (PREG) from a proxy on a different Store. The proxy source then sends a PREG containing a unique random value to the other Stores. This value determines which Store deletes it's proxy source in the case that any two Stores independently determine they should create a proxy source. The nature of the random values ensures that only one Store within the QC group or configuration of groups keeps its proxy source.

Note that Topic Option "source-activity-timeout" value should be set to at least double the Topic Option "keepalive-interval" value.

There are two algorithms that the Stores can use when holding a proxy source election:

  1. Quorum not required (default),
  2. Quorum required (new as of UM version 6.15; set Store Option "proxy-source-repo-quorum-required" to 1).

Informatica recommends that new projects use algorithm 2 (Quorum required). This is not the default and must be explicitly set. Existing projects that use algorithm 1 and do not have problems related to proxy sources do not need to change.

ALGORITHM DETAILS:

A proxy source is specific to a topic/reg-ID (or topic/session-ID). When a source exits (publisher deletes it or crashes), the Stores time the source out and hold an election to determine which Store will create a proxy source.

With algorithm 1 (quorum not required), every running Store in the Q/C group participates in the election.

With algorithm 2 (quorum required), only those Stores that have state for the topic/reg-ID will participate. A proxy source will be elected only if a quorum of Stores participate.

Algorithm 2 was introduced in UM version 6.15 to help customers who need to perform an un-recommended Store restart procedure whereby the state and cache files are deleted before restarting. Informatica recommends retaining the state and cache files over a restart, but we also understand that sometimes it is unavoidable and a Store must be started "clean" (for example, if a disk fails).

Creating a proxy source for a particular topic/reg-ID that does not have a quorum of repositories is contrary to the general design of UM persistence. Selecting algorithm 2 conforms with the UM persistence design.


Proactive Retransmissions  <-

Proactive Retransmissions, which is enabled by default, address two types of loss:

  • loss of message data between the source and a Store

  • loss of stability acknowledgments (ACK) between the Store and the source

The Store sends message stability acknowledgments to the source after the Store persists the message data.

With Proactive Retransmissions, the source maintains an unstable message queue for those messages sent but not acknowledged by the Store. The source checks this queue at the ume_message_stability_timeout (source). If a message in this queue exceeds its ume_message_stability_timeout (source), the source retransmits the message and puts it back on the unstabilized message queue, restarting the message's ume_message_stability_timeout (source).

The source continues to retransmit and check the message's stability timeout until the ume_message_stability_lifetime (source) expires or it receives a stability acknowledgment from the Store. If the source has not received a stability acknowledgment when the ume_message_stability_lifetime (source) expires, the source sends a Store Message Not Stable source event notification to the application. When the Store discards the message because it has not met stability requirements, the Store sends a Store Forced Reclaim source event notification to the application.

To disable Proactive Retransmissions, set ume_message_stability_timeout (source) to 0 (zero). As a result, sources do not create an unstable message queue.

The following applies whether you enable or disable Proactive Retransmissions.

  • The Store does not discard duplicate messages, but rather always responds to duplicate, retransmitted messages by sending stability acknowledgments even if the message is already stable.

  • If the Store has marked the message unrecoverably lost and receives a duplicate message from the source, the Store sends the source a negative stability acknowledgment (NAK), which induces the source to remove the message from its unstabilized message queue. A stability NAK is identical to a stability ACKs except that it has a NAK flag set.