Concepts Guide
Advanced Optimizations

The internal design of UM has many compromises between performance and flexibility. For example, there are critical sections which maintain state information which must be kept internally consistent. Since UM allows the application the flexibility of multi-threaded use, those critical sections are protected with Mutex locks. These locks add very little overhead to UM's execution, but "very little" is not the same as "zero". The use of locks is a compromise between efficiency and flexibility. Similar lines of reasoning explain why UM makes use of dynamic memory (malloc and free), and bus-interlocked read/modify/write operations (e.g. atomic increment).

UM provides configuration options which improve efficiency, at the cost of reduced application design flexibility. Application designers who are able to constrain their programs within certain restrictions can take advantage of improved performance and reduced latency outliers (jitter).

RECEIVE-SIDE

SEND-SIDE

GENERAL (both receivers and senders)


Receive Thread Busy Waiting  <-

Busy looping is a method for reducing latency and especially latency outliers (jitter) by preventing threads from going to sleep. In an event-driven system, if a thread goes to sleep waiting for an event, and the event happens, the operating system needs to wake the thread back up and schedule its execution on a CPU core. This can take several microseconds. Alternatively, if the thread avoids going to sleep and "polls" for the event in a tight loop, it will detect and be able to start processing the event without the operating system scheduling delay.

However, remember that a thread that never goes to sleep will fully consume a CPU core. If you have more busy threads than you have CPU cores in your computer, you can have CPU thrashing, where threads are forced to time-share the cores. This can produce worse latency than sleep-based waits.

Only use busy waiting if there are enough cores to allocate a core exclusively to each busy thread. Also, pinning threads to cores is highly recommended to prevent thread migration across cores, which can introduce latency and significant jitter.


Network Socket Busy Waiting  <-

The UM receive code performs socket calls to wait for network events, like received data. By default, UM does sleep-based waiting for events. For example, if there are no packets waiting to be read from any of the sockets, the operating system will put the receive thread to sleep until a packet is received.

However, the file_descriptor_management_behavior (context) configuration option can be used to change the behavior of the receive thread to not sleep. Instead, the socket is checked repeatedly in a tight loop - busy waiting. With most use cases, enabling "busy wait" will typically reduce average latency only a little, but it can significantly reduce latency outliers (jitter).

For network-based transports, a receive thread can either be the main context thread, or it can be an XSP thread. A given application can have more than one context, and a given context can have zero or more XSPs. The threads of each context and XSP can be independently configured to have either busy waiting or sleep waiting.

Note that when creating an XSP, it is not unusual to simply let the XSP inherit the parent context's attributes. However, a common XSP use case is to create a single XSP for user data, and leave the parent context for Topic Resolution and other overhead. In this case, you may want to configure the parent context to use sleep-based waiting ("pend"), and configure the XSP to use busy waiting ("busy-wait"). You will need to pass a context attribute to the lbm_xsp_create() API.

A Better Alternative

Kernel bypass network drivers typically have a busy waiting mode of operation which happens inside the driver itself. For example Solarflare's Onload driver can be configured to do busy waiting. This can produce even greater improvement than UM receive thread busy waiting. When using a busy waiting kernel bypass network driver like Onload, the file_descriptor_management_behavior (context) configuration option should be left at its default, "pend".


IPC Transport Busy Waiting  <-

The Transport LBT-IPC does not use the context or XSP threads for receiving messages. It has its own internal thread which can be configured for busy waiting with the transport_lbtipc_receiver_thread_behavior (context) option.


SMX Transport Busy Waiting  <-

The Transport LBT-SMX does not use the context or XSP threads for receiving messages. It has its own internal thread which always operates in busy waiting. I.e. it cannot be configured to sleep waiting.


Zero Object Delivery  <-

Zero Object Delivery is a set of Java and .NET programming conventions to avoid garbage.

In Java and .NET, garbage collection (GC) can be a very useful feature to simplify programming. However, for low latency applications, GC is a problem. Most JVMs impose milliseconds of latency when GC runs. Also, unnecessary object creation introduces additional overhead.

Most high-performance Java and .NET programs avoid GC by saving objects that are no longer needed and reusing them. Most of this is the responsibility of the application programmer. However, UM application callbacks represent a special case. When UM needs to deliver an event, like a received message, to a Java or .NET program, it typically passes to the callback one or more objects containing the pertinent event data. UM needs to know when the application is finished with those passed-in objects so that UM can re-use them.

This requires the application to follow a set of conventions. Applications that deviate from these conventions will typically suffer from higher latency due to unnecessary object creation and garbage collection. These programming conventions are collectively called "Zero Object Delivery" (ZOD).

The ZOD programming convention are the same between Java and .NET. (ZOD does not apply to the C API.)

  1. Special data access methods must be used to access the received message's data.
  2. If a message needs to be processed outside of the application receiver callback, it must be "promoted". A UM recycler must be used to ensure no garbage is created.
  3. Every message must be explicitly deleted when the application is finished processing it. This is true for messages fully processed by the receiver callback, and for promoted messages that are processed outside of the receiver callback. This is done by calling the "dispose()" method.

Details of these coding conventions can be found at:

Note
While the coding conventions of ZOD are most often seen with the receiver callback, many of the same conventions exist with other event delivery callbacks. For example, source events should be disposed. Source events must also be promoted before they are passed to another thread for processing, after which they should be disposed and recycled.


Receive Buffer Recycling  <-

By default, the UM receive code base allocates a fresh buffer for each received datagram. This allows the user a great degree of threading and buffer utilization flexibility in the application design.

For transport types RM (reliable multicast), RU (Reliable Unicast), and IPC (shared memory), you can set a configuration option to enable reuse of receive buffers, which can avoid per-message dynamic memory calls (malloc/free). This produces a modest reduction in average latency, but more importantly, can significantly reduce occasional latency outliers (jitter).

See the configuration options:

Note that setting the option does not guarantee the elimination of per-message malloc and free except in fairly restrictive use cases.

Also note that this feature is different from the Java and .NET ZOD recycler. See Zero Object Delivery.


Receive Buffer Recycling Restrictions  <-

There are no hard restrictions to enabling buffer recycling. I.e. it is not functionally not compatible with any use patterns or UM features. However, some use patterns will prevent the recycling of the receive buffer, and therefore not deliver the benefit, even if the configuration option is set.

  • Event Queues - Event Queues prevent the recycling of receive buffers. When the UM library transfers a received message to an event queue for later processing, it allocates (malloc) a new message receive buffer.
  • Message Object Retention - Message retention prevents the recycling. For context-thread receive message callbacks, the act of retaining a message allocates (mallocs) a new message receive buffer.
  • Persistence - For a persistent receiver, enabling receive buffer recycling will reduce dynamic memory usage (malloc/free), but does not eliminate it. Certain persistence-related features require the use of dynamic memory.
  • Packet Loss - Applications typically use Ordered Delivery. When packets are lost, UM needs to internally retain newly received messages so that they can be delivered after the missing messages are retransmitted. This internal retention prevents the newly received message buffers from being recycled.
  • Message Fragmentation and Reassembly - Large application messages must be split into smaller fragments and sent serially. The receiver must internally retain these fragments so that the original large message can be reassembled and delivered to the application. This internal retention prevents the fragment message buffers from being recycled.

Note that in spite of the restrictions that can prevent recycling of receive message buffers, UM dynamically takes advantage of recycling as much as it can. E.g. if there is a loss event which suppresses recycling while waiting for retransmission, once the gap is filled and pending messages are delivered, UM will once again be able to recycle its receive buffers.

Of specific interest for persistent receivers is the use of Explicit Acknowledgments, either to batch ACKs, or simply defer them. Instead of retaining the messages, which prevents message buffer recycling, you can extract the ACK information from a message and allow the return from the receiver callback to delete and recycle the message buffer without acknowledging it.

See Object-free Explicit Acknowledgments for details.


Single Receiving Thread  <-

This feature optimizes the execution of UM receive-path code by converting certain thread-safe operations to more-efficient thread-unsafe operations. For example, certain bus-locked operations (e.g. atomic increment) are replaced by non-bus-locked equivalents (e.g. non-atomic increment). This can reduce the latency of delivering received messages to the application, but does so at the expense of thread safety.

This feature is often used in conjunction with the Context Lock Reduction feature.

The transport_session_single_receiving_thread (context) configuration option enables this feature.

Except as listed in the restrictions below, the Single Receiving Thread feature should be compatible with all other receive-side UM features.


Single Receiving Thread Restrictions  <-

It is very important for applications using this feature to be designed within certain restrictions.

  • Threading - The intended use case is for each received message to be fully processed by the UM thread that delivers the message to the application. Note that the Transport Services Provider (XSP) feature is compatible with the Single Receiving Thread feature.

  • No Event Queues - Event queues cannot be used with Single Receiving Thread.

  • Message Object Retention - Most traditional uses of message retention are related to giving a message to an alternate thread for processing. This is not compatible with Single Receiving Thread feature.

    However, there are some use cases where message retention is viable when used with Single Receiving Thread: when a message must be held for future processing, and that processing will be done by the same thread.

    For example, a persistent application might use Explicit Acknowledgments to delay message acknowledgement until the application completes a handshake with a remote service. As long as it is the same thread which initially receives and retains the message as that which completes the explicit acknowledgement of the message, it is supported to use message retain / message delete.

    Note
    If the Transport Services Provider (XSP) feature is used, care must be taken to ensure that the same XSP thread is used to perform all processing for a received message. I.e. a different XSP or the main context may not be used to complete processing on a deferred retained message. For example, a user-scheduled timer event will be delivered using the main context thread, and therefore cannot complete processing of a retained message.
  • Transport Type - The Single Receiving Thread feature does not enhance the operation of Broker or SMX transport types. These transport types use somewhat different internal buffer handling. Note that these transport types are technically compatible with the Single Receiving Thread feature, they just don't benefit from it.


Extended Context Process Events  <-

Most developers of UM applications use a multi-threaded approach to their application design. For example, they typically have one or more application threads, and they create a UM context with embedded mode, which creates a separate context thread.

However, there is a model of application design in which a single thread is used for the entire application. In this case, the UM context must be created with Sequential Mode and the application must regularly call the UM event processor API, usually with the msec parameter set to zero. In this design, there is no possibility that application code, UM API code, and/or UM context code will be executing concurrently.

The lbm_context_process_events_ex() API allows the application to enable specialized optimizations. (For Java and .NET use the context object's "processEvents()" method with 2 or more input parameters. See com::latencybusters::lbm::LBMContext::processEvents.)


Context Lock Reduction  <-

The application can improve performance by suppressing the taking of certain mutex locks within the UM context processing code. This can reduce the latency of delivering received messages to the application, but does so at the expense of thread safety.

This feature is often used in conjunction with the Single Receiving Thread feature.

Warning
It is very important for the application to ensure that UM code related to a given context cannot be executed concurrently by multiple threads when this feature is used. This includes UM object creation and send-path API functions. I.e. the application may not call a UM message send API by one thread while another thread is calling lbm_context_process_events_ex(). However, it is permissible for a context thread callback to call a UM message send API, within the restrictions of the send API being used.

To enable this feature, call lbm_context_process_events_ex(), passing in the lbm_process_events_info_t structure with the LBM_PROC_EVENT_EX_FLAG_NO_MAIN_LOOP_MUTEX bit set in the flags field. (Sequential Mode is required for this feature.)


Context Lock Reduction Restrictions  <-

It is very important for applications using this feature to be designed within certain restrictions.

  • Threading - It is critical that Context Lock Reduction be used only if Sequential Mode is used and there is no possibility of concurrent execution of UM code for a given context.

    It is further strongly advised that the same thread be used for all UM execution within a given context. I.e. it is not guaranteed to be safe if the application has multiple threads which can operate on a context, even if the application guarantees that only one thread at a time will execute the UM code.

    Note that if an application maintains two contexts, it is acceptable for a different thread to be used to operate on each context. However, it is not supported to pass UM objects between the threads.

  • No Transport Services Provider (XSP) - The Context Lock Reduction feature is not compatible with XSP.

  • No Event Queues - Event queues cannot be used with Context Lock Reduction.

  • No SMX or DBL - Context Lock Reduction is not compatible with SMX or DBL transports. This is because these transports create independent threads to monitor their respective transport types.

  • Transport LBT-IPC - Context Lock Reduction was not designed with the IPC transport in mind. By default, IPC creates an independent thread to monitor the shared memory, which is not compatible with Context Lock Reduction. However, in principle, it is possible to specify that the IPC receiver should use sequential mode (see transport_lbtipc_receiver_operational_mode (context)), and then write your application to use the same thread to call the context and IPC event processing APIs. However, be aware that the IPC event processing API does not have an extended form, so IPC will simply continue to take the locks it is designed to take.

  • Message Object Retention - Most traditional uses of Message Object Retention are related to handing a message to an alternate thread for processing. This is not compatible with Context Lock Reduction because the alternate thread is responsible for deleting the message when it is done. This represents two threads making API calls for the same context, which is not allowed for the Context Lock Reduction feature.

    However, there are some use cases where message retention is viable when used with Context Lock Reduction: when a message must be held for future processing, and that processing will be done by the same thread.

    For example, a persistent application might use Explicit Acknowledgments to delay message acknowledgement until the application completes a handshake with a remote service. As long as it is the same thread which initially receives and retains the message as that which completes the explicit acknowledgement of the message, it is supported to use message retain / message delete.

  • No LBM_SRC_BLOCK - All forms of UM send message must be done non-blocking (i.e. with LBM_SRC_NONBLOCK). This is because of the way UM blocks calls that cannot be completed; the context thread explicitly wakes up the blocked call when appropriate. But if the same thread is being used to run the context (via the process events API) and also sending messages, a blocked send call will never be woken up.


Gettimeofday Reduction  <-

UM's main context loop calls gettimeofday() in strategic places to ensure that its internal timers are processed correctly. However, there is a "polling" model of application design in which Sequential Mode is enabled and the context event processing API is called in a fast loop with the msec parameter set to zero. This results in the internal context call to gettimeofday() to happen unnecessarily frequently.

A polling application can improve performance by suppressing the internal context calls to gettimeofday(). This can reduce the latency of delivering received messages to the application.

To enable this feature, call lbm_context_process_events_ex(), passing in the lbm_process_events_info_t structure with the LBM_PROC_EVENT_EX_FLAG_USER_TIME bit set in the flags field. In addition, the application must set the time_val field in lbm_process_events_info_t with the value returned by gettimeofday(). (Sequential Mode is required for this feature.)

Note
The internal UM timers generally use millisecond precision. Users of the gettimeofday() reduction feature typically design their application to fetch a new value for time_val only a few times per millisecond.


Gettimeofday Reduction Restrictions  <-

  • Monotonically Increasing Time - The application is responsible for ensuring that each call to lbm_context_process_events_ex() has a time_val field value which is greater than or equal to the previous time_val.
  • "msec" must be 0 - You must call lbm_context_process_events_ex() with the "msec" parameter set to zero. You can't tell lbm_context_process_events_ex() to loop for a period of time, and also tell it not to call gettimeofday(); UM won't see the passage of time.


Receive Multiple Datagrams  <-

A UM receiver for UDP-based protocols normally retrieves a single UDP datagram from the socket with each socket read. Setting multiple_receive_maximum_datagrams (context) to a value greater than zero directs UM to retrieve up to that many datagrams with each socket read. When receive socket buffers accumulate multiple messages, this feature improves CPU efficiency, which reduces the probability of loss, and also reduces total latency for those buffered datagrams. Note that UM does not need to wait for that many datagrams to be received before processing them; if fewer datagrams are in the socket's receive buffer, only the available datagrams are retrieved.

The multiple_receive_maximum_datagrams (context) configuration option defaults to 0 so as to retain previous behavior, but users are encouraged to set this to a value between 2 and 10. (Having too large a value during a period of overload can allow starvation of low-rate Transport Sessions by high-rate Transport Sessions.)

Note that in addition to increasing efficiency, setting multiple_receive_maximum_datagrams (context) greater than one can produce changes in the dynamic behavior across multiple sockets. For example, let's say that a receiver is subscribed to two Transport Sessions, A and B. Let's further say that Transport Session A is sending message relatively quickly and has built up several datagrams in its socket buffer. But in this scenario, B is sending slowly. If multiple_receive_maximum_datagrams (context) is zero or one, the two sockets will compete equally for UM's attention. I.e. B's socket will still have a chance to be read after each A datagram is read and processed.

However, if multiple_receive_maximum_datagrams (context) is 10, then UM can process up to 10 of A's messages before giving B a chance to be read. This is desirable if low message latency is equally important across all Transport Sessions; the efficiency improvement derived by retrieving multiple datagrams with each read operation results in lower overall latency. However, if different transport sessions' data have different priorities in terms of latency, then processing 10 messages of a low priority transport session can unnecessarily delay processing of messages from a higher priority transport session.

In this case, the Transport Services Provider (XSP) feature can be used to prioritize different transport sessions differently and prevent low-priority messages from delaying high-priority messages.

Note that UM versions prior to 6.13 could see occasional increases in latency outliers when this feature was used. As of UM version 6.13, those outliers have been fixed (see bug10726).


Receive Multiple Datagrams Compatibility  <-

The Receive Multiple Datagrams feature modifies the behavior of the UDP-based transport protocols: LBT-RM and LBT-RU.

(Note: prior to UM version 6.13, the Receive Multiple Datagrams feature also affected MIM and UDP-based Topic Resolution. But this could introduced undesired latencies, so starting with UM 6.13, MIM and Topic Resolution no longer use Receive Multiple Datagrams.)


Receive Multiple Datagrams Restrictions  <-

The Receive Multiple Datagrams feature does not affect the following UM features:

  • Non-UDP Transport Protocols (TCP, IPC, SMX).
  • MIM (as of UM version 6.13).
  • UDP-based Topic Resolution (as of UM version 6.13).
  • All TCP-based features (Unicast Immediate Message, Late Join, Persistent Store Recovery, UM Response messages).
  • Non-Linux.
  • Linux prior to kernel version 2.6.33, and glibc in version 2.12 (released in May, 2010).


Transport Demultiplexer Table Size  <-

A UM Transport Session can have multiple sources (topics) mapped to it. For example, if a publishing application creates two sources with the same multicast address and destination port, both sources will be carried on the same transport session. A receiver joined to that transport session must separate (demultiplex) the topics, either for delivery to the proper receiver callback, or for discarding.

The demultiplexing of the topics is managed by a hash table (not the same kind of hash table that manages the Topic Resolution cache). As a result, the processing of received messages can be made more efficient by optimally sizing the hash table. This is done using the configuration option transport_demux_tablesz (receiver).

Unlike many hash tables, the transport demultiplexer needs to have a number of buckets which is a power of two. The demultiplexing code will be most efficient if the number of buckets is equal to or greater than the number of sources created on the transport session. In that case, the hash function is "perfect", which is to say that there will never be any collisions. Note that if the number of buckets is smaller than the number of sources, the collision resolution process is O(log N) where N is the length of the collision chain.

The only disadvantage of increasing the size of the hash table is memory usage (each bucket is 16 bytes on 64-bit architectures). Having a larger than optimal table does not make performance worse.

Note that if the number of sources is small, only a small degree of efficiency improvement results from optimally sizing the hash table.


Smart Sources  <-

The normal lbm_src_send() function (and its Java and .NET equivalents) are very flexible and support the full range of UM's rich feature set. To provide this level of capability, it is necessary to make use of dynamic (malloc/free) memory, and critical section locking (mutex) in the send path. While modern memory managers and thread locks are very efficient, they do introduce some degree of variability of execution time, leading to latency outliers (jitter) potentially in the millisecond range.

For applications which require even higher speed and very consistent timing, and are able to run within certain constraints, UM has an alternate send feature called Smart Source. This is a highly-optimized send path with no dynamic memory operations or locking; all allocations are done at source creation time, and lockless algorithms are used throughout. To achieve these improvements, Smart Source imposes a number of restrictions (see Smart Sources Restrictions).

The Smart Source feature provides the greatest benefit when used in conjunction with a kernel bypass network driver.

Note
the Smart Source feature is not the same thing as the Zero-Copy Send API feature; see Comparison of Zero Copy and Smart Sources.

One design feature that is central to Smart Sources is the pre-allocation of a fixed number of carefully-sized buffers during source creation. This allows deterministic algorithms to be used for the management of message buffers throughout the send process. To gain the greatest benefit from Smart Sources, the application builds its outgoing messages directly in one of the pre-allocated buffers and submits the buffer to be sent.

To use Smart Sources, a user application typically performs the following steps:

  1. Create a context with lbm_context_create(), as normal.
  2. Create the topic object and the Smart Source with lbm_src_topic_alloc() and lbm_ssrc_create(), respectively. Use Smart Sources Configuration to pre-allocate the desired number of buffers.
  3. Get the desired number of messages buffers with lbm_ssrc_buff_get() and initialize them if desired. The application typically constructs outgoing messages directly in these buffers for transmission.
  4. Send messages with lbm_ssrc_send_ex(). The buffers gotten in the previous step must be used.
  5. While most applications manage the message buffers internally, it is also possible to give the buffers back to UM with lbm_ssrc_buff_put(), and then getting them again for subsequent sends. Getting and putting messages buffers can simplify application design at the expense of extra overhead.
  6. To clean up, delete the Smart Source with lbm_ssrc_delete(). It is not necessary to "put" the message buffers back to UM; they will be freed automatically when the Smart Source is deleted.

For details, see the example applications Example lbmssrc.c or Example lbmssrc.java.

Warning
To avoid the overhead of locking, the Smart Source API functions are not thread-safe. Applications must be written to avoid concurrent calls. In particular, the application is restricted to sending messages on a given Transport Session with one thread. If Smart Source Defensive Checks are enabled, the first call to send a message on a newly-created Transport Session captures the ID of the calling thread. Subsequently, only that thread is allowed to call send for Smart Sources on that Transport Session. For applications which have multiple sending threads, Smart Source topics must be mapped to Transport Sessions carefully such that all of the topics on a given Transport Session are managed by the same sending thread.
Note
There are no special requirements on the receive side when using Smart Sources. Normal receiving code is used.


Smart Source Message Buffers  <-

When a Smart Source is created, UM pre-allocates a set of user buffers according to the configuration options smart_src_max_message_length (source) and smart_src_user_buffer_count (source).

As of UM version 6.12, Smart Source supports UM fragmentation. Which is to say that messages larger than the transport's Datagram Max Sizes can be sent, which the Smart Source will split into multiple datagrams.

For example, an application can configure smart_src_max_message_length (source) to be 2000, while the datagram max size is set to 1500 (network MTU size). During operation, the application might send a 500-byte message. This will not require any fragmentation; the message is sent in a single network packet. However, when the application sends a 2000-byte message, the Smart Source will split it into two datagrams. This avoids IP fragmentation. The precise sizes of those datagrams will depend on the space reserved for headers, and is subject to change with different versions of UM.

Another feature available as of UM version 6.12 is the user-specified buffer. This allows an application to send messages larger than the configured smart_src_max_message_length (source). Instead of building the message in a pre-allocated Smart Source buffer, the application must allocate and manage its own user-supplied buffer. To use this feature, the application supplies both a pre-allocated buffer and a user-supplied buffer. The Smart Source will use the supplied pre-allocated buffer as a "work area" for building the datagram with proper headers, and use the user-supplied buffer for message content.

For example to use the buffer "ubuffer", you simply set the LBM_SSRC_SEND_EX_FLAG_USER_SUPPLIED_BUFFER flag and the usr_supplied_buffer field in the lbm_ssrc_send_ex_info_t passed to the lbm_ssrc_send_ex() API function, as shown below:

char *ubuffer = malloc(65536); /* Large user-supplied buffer. */
info.flags = 0;
char *ss_buffer = NULL; /* Smart Source pre-allocated buffer. */
...
lbm_ssrc_buff_get(ssrc, &ss_buffer, 0); /* Get Smart Source pre-alloc buff. */
...
/* Application puts message data into ubuffer. */
info.usr_supplied_buffer = ubuffer;
lbm_ssrc_send_ex(ssrc, ss_buffer, message_len, 0, &info);

Note that the Smart Source pre-allocated buffer ss_buffer also has to be passed in.

Also note that sending messages with the user-supplied message buffer is slightly less CPU efficient than using the pre-allocated buffers. But making pre-allocated buffers larger to accommodate occasional large messages can be very wasteful of memory, depending on the counts of user buffers, transmission window buffers, and retention buffers.

UM Fragment Sizes

A traditional source will split application messages into "N" fragments when those messages (plus worst-case header) are greater than the Datagram Max Sizes. The size of the first "N-1" fragments will be (approximately) the datagram max size.

With Smart Sources, fragmentation is done somewhat differently. Consider as an example a configuration with a datagram max size of 8192 and a Smart Source max message length of 2000. No UM message fragmentation will happen when the application uses the Smart Source pre-allocated buffers to build outgoing messages. However, if a user-supplied buffer is used, the user can send arbitrarily large application message, and the Smart Source will split the message into "N" fragments. But those fragments will be limited in size to the Smart Source max message length of 2000 bytes of application data (plus additional bytes for headers).

This can lead to unexpected inefficiencies. Continuing the above example, suppose case the application sends a 6000-byte message. The Smart Source will spit it into three 2000-byte datagrams. The underlying IP stack will perform IP fragmentation and send each datagram as two packets of 1500 and 500 bytes respectively, for a total of 6 packets. Whereas if the Smart Source max message length were set to 1500, then the message would be split into 4 fragments of 1500 bytes each, and each fragment would fit in a single packet, for a total of 4 packets. (The calculations above were simplified for clarity, but are not accurate because they do not take into consideration headers.)

When a kernel bypass network driver is being used, users will sometimes set the datagram max size to approximately an MTU. In that case, it could easily happen that the Smart Source pre-allocated buffers are larger than the datagram max size. In that case, the Smart Source will behave more like a traditional source, splitting the application message into datagrams of (approximately) datagram max size fragments. See Datagram Max Size and Network MTU.


Smart Sources and Memory Management  <-

As of UM 6.11, there are new C APIs that give the application greater control over the allocation of memory when Smart Sources are being created. Since creation of a Smart Source pre-allocates buffers used for application message data as well as internal retransmission buffers, an application can override the stock malloc/free to ensure, for example, that memory is local to the CPU core that will be sending messages.

When the application is ready to create the Smart Source, it should set up the configuration option mem_mgt_callbacks (source), which uses the lbm_mem_mgt_callbacks_t structure to specify application callback functions.


Smart Sources Configuration  <-

The following configuration options are used to control the creation and operation of Smart Sources:

The option smart_src_max_message_length (source) is used to size the window transmission buffers. This means that the first Smart Source created on the session defines the maximum possible size of user messages for all Smart Sources on the Transport Session. It is not legal to create a subsequent Smart Source on the same Transport Session that has a larger smart_src_max_message_length (source), although smaller values are permissible.


Smart Source Defensive Checks  <-

Ultra Messaging generally includes defensive checks in API functions to verify validity of input parameters. In support of faster operation, deep defensive checks for Smart Sources are optional, and are disabled by default. Users should enable them during application development, and can leave them disabled for production.

To enable deep Smart Source defensive checks, set the environment variable LBM_SMART_SOURCE_CHECK to the numeric sum of desired values. Hexadecimal values may be supplied with the "0x" prefix. Each value enables a class of defensive checking:

Numeric ValueDeep Check
1Send argument checking
2Thread checking
4User buffer pointer checking
8User buffer structure checking
16, 0x10user message length checking
32, 0x20application header checking, including Spectrum and Message Properties.
64, 0x40null check for User Supplied Buffer (see Smart Source Message Buffers)

To enable all checking, set the environment variable LBM_SMART_SOURCE_CHECK to "0xffffffff".


Smart Sources Restrictions  <-

  • Linux and Windows 64-bit Only - Smart Sources is only supported on the 64-bit Linux and 64-bit Windows platforms, C and Java APIs.

  • LBT-RM And LBT-RU Sources Only - Smart Sources can only be created with the LBT-RM and LBT-RU transport types. Smart Sources are not compatible with the UM features Multicast Immediate Messaging, Unicast Immediate Messaging, or sending responses with Request/Response.

  • Persistence - As of UM 6.11, Smart Sources support Persistence, but with some restrictions. See Smart Sources and Persistence for details.

  • Spectrum - As of UM 6.11, Smart Sources support Spectrum, but with some API changes. See Smart Sources and Spectrum for details.

  • Single-threaded Only - It is the application's responsibility to serialize calls to Smart Source APIs for a given Transport Session. Concurrent sends to different Transport Sessions are permitted.

  • No Application Headers - Application Headers are not compatible with Smart Sources.

  • Limited Message Properties - Message Properties may be included, but their use has restrictions. See Smart Source Message Properties Usage.

  • No Queuing - Queuing is not currently supported, although support for ULB is a possibility in the future.

  • No Send Request for Java - The Java API does not support sending UM Requests. (As of UM version 6.14, the C API does: lbm_ssrc_send_request_ex()).

  • No Data Rate Limit - Smart Source data messages are not rate limited, although retransmissions are rate limited. Care must be taken in designing and provisioning systems to prevent overloading network and host equipment, and overrunning receivers.

  • No Hot Failover - Smart Sources are not compatible with Hot Failover (HF).

  • No Batching - Smart Sources are not compatible with Implicit Batching or Explicit Batching.
Note
It is not permitted to mix Smart Source API calls with standard source API calls for a given Transport Session.


Zero-Copy Send API  <-

This section introduces the use of the zero-copy send API for LBT-RM.

Note
the Zero-Copy Send API feature is not the same thing as the Smart Sources feature; see Comparison of Zero Copy and Smart Sources.

The zero-copy send API modifies the lbm_src_send() function for sending messages such that the UM library does not copy the user's message data before handing the datagram to the socket layer. These changes reduce CPU overhead and provide a minor reduction in latency. The effects are more pronounced for larger user messages, within the restrictions outlined below.

Application code using the zero-copy send API must call lbm_src_alloc_msg_buff() to request a message buffer into which it will build its outgoing message. That function returns a message buffer pointer and also a separate buffer handle. When the application is ready to send the message, it must call lbm_src_send(), passing the buffer handle as the message (not the message buffer) and specify the LBM_MSG_BUFF_ALLOC send flag.

Once the message is sent, UM will process the buffer asynchronously. Therefore, the application must not make any further reference to either the buffer or the handle.


Zero-Copy Send Compatibility  <-

The zero-copy send API is compatible with the following UM features:

  • C language, Streaming, source-based publishing applications using LBT-RM.
  • Messages sent with the zero-copy API can be received by any UM product or daemon. No special restrictions apply to receivers of messages sent with the zero-copy send API.
  • Compatible with implicit batching and message flushing.
  • Compatible with non-blocking sends and wakeup source event handling.
  • Compatible with hardware timestamps (see section High-resolution Timestamps ).
  • Compatible with UD Acceleration.


Zero-Copy Restrictions  <-

Due to the specialized nature of this feature, there are several restrictions in its use:


Comparison of Zero Copy and Smart Sources  <-

There are two UM features that are intended to reduce latency and jitter when sending messages:

These two features use different approaches to latency and jitter reduction, and are not compatible with each other. There are trade offs explained below, and users seeking latency and/or jitter reduction will sometimes need to try both and empirically measure which is better for their use case.

The zero-copy send API removes a copy of the user's data buffer, as compared to a normal send. For small messages of a few hundred bytes, a malloc and a data copy represent a very small amount of time, so unless your messages are large, the absolute latency reduction is minimal.

The Smart Source has the advantage of eliminating all mallocs and frees from the send path. In addition, all thread locking is eliminated. This essentially removes all sources of jitter from the UM send path. Also, the Smart Source feature supports UM fragmentation, which zero-copy sends do not. However, because of the approach taken, sending to a Smart Source is somewhat more restrictive than sending with the zero-copy API.

In general, Informatica recommends Smart Sources to achieve the maximum reduction in jitter. For example, the zero-copy send API supports the use of batching to combine multiple messages into a single network datagram. Batching can be essential to achieve high throughputs. Some application designers may determine that the throughput advantages of zero-copy with batching outweigh the jitter advantages of Smart Sources.

See the sections Zero-Copy Send API and Smart Sources for details of their restrictions.


XSP Latency Reduction  <-

A common source of latency outliers is when Topic Resolution packets are received at the same time that user data messages are received. The UM context thread might process those Topic Resolution packets before processing the user data messages.

By using the XSP feature, user data reception can be moved to a different thread than topic resolution reception. See Transport Services Provider (XSP) for details, paying careful attention to XSP Threading Considerations.


Receive-Side Batching  <-

The receive-side batching feature can improve throughput of subscribers that use UM event queues and/or are written in Java.

See also source-side Message Batching.

UM event queues introduce a small overhead which adds latency and reduces maximum sustainable throughput. UM's Java API also introduces a small overhead to deliver received messages from the UM library. Both of these overheads can usually be mitigated with the receive-side batching feature.

This feature is enabled via the configuration option delivery_control_message_batching (context).

The feature works by collecting receiver events into bundles. For an event queue, the bundle is sent to the dispatch thread as a single queue item. For Java, the bundle is passed to the Java language as a single object. In both cases, the overhead is amortized across the multiple messages in the bundle.

This feature is most effective when used along with the Receive Multiple Datagrams feature (Linux only), which will increase the bundle size automatically if the receiver falls behind. A publisher using Message Batching also increases the bundle size. Larger bundles means greater efficiency gains.

Warning
Do not use receive-side batching for C or .NET programs that are not using event queues. The feature does not increase efficiency and can actually reduce it.
This feature is not compatible with XSP. Users of XSP are typically trying to maximize the performance of UM, and probably should not be using event queues. For alternatives, please contact Informatica Support.
Note
If you enable receive-side batching and you use an event queue that is in polling mode, using lbm_event_dispatch(evq, LBM_EVENT_QUEUE_POLL), then rather than dispatching exactly one event per call to lbm_event_dispatch, you may get multiple events dispatched with a single call.


Core Pinning  <-

The Unix and Windows operating systems attempt to balance CPU utilization across all available CPU cores. They often do this without regard to the architectural design of the system hardware, which can introduce significant inefficiencies. For example, if a thread's execution migrates from one NUMA node to another, the code will frequently need to access memory located in the other NUMA zone, which happens over a slower memory interconnect.

Fortunately, Unix and Windows support pinning processes and threads to specific CPU cores. It is the user's responsibility to understand the host architecture sufficiently to know which cores are grouped into NUMA zones. Pinning a group of related threads to cores within the same NUMA zone is important to maintain high performance.

However, even letting the operating system migrate a thread from one core to another within a single NUMA zone has the side effect of invalidating the cache, which introduces latency. You get the best performance when each thread is pinned to its own core, with no other threads contending for that core. This approach obviously severely limits the number of threads that can run on a host.

UM does not have a general feature that pins threads to cores for applications. It is the user's responsibility to pin (set affinity) using the appropriate operating system APIs.

The Persistent Store allows the user to assign individual threads to specific cores. See Store Thread Affinity for details.

For the DRO, it is not possible to identify specific threads and assign them to individual cores. But the user can use the operating system's user interface to assign the entire DRO process to a group of cores known to be in the same NUMA zone.


Memory Latency Reduction  <-

UM makes use of dynamic memory allocation/deallocation using malloc() and free(). The default memory allocator included with Linux and Windows can sometimes introduce latency outliers of multiple milliseconds. It is rare, but we have seen outliers as long as 10 milliseconds.

There are higher-performing allocators available, many of them open-source. For example, Hoard. There are many others.

A good commercial product is MicroQuill's SmartHeap. In fact, the Persistent Store is built and ships with SmartHeap. Note that licensing Ultra Messaging does not grant a license to the customer for general use of SmartHeap. Users who want to use SmartHeap in applications should contact MicroQuill directly.

None of these products can guarantee that there will never be millisecond-long latencies, but they can greatly reduce the frequency.