|
|||||||||
PREV NEXT | FRAMES NO FRAMES |
See:
Description
Packages | |
---|---|
com.latencybusters.auxapi | |
com.latencybusters.lbm | |
com.latencybusters.lbm.sdm | |
com.latencybusters.pdm |
Provides classes and interfaces for implementing an UM source and/or receiver application.
For the most part, the best practices for the C API also apply to the Java API. The following topics discuss the differences in C API vs. Java API best practices.
The UM Java API makes use of the Java Native Interface (JNI) to bridge between the native Java code and the UM library (written in C). This interface provides for two different types of calls: "downcalls" from Java into C, and "upcalls" from C into Java.
While still more expensive (in terms of CPU time) than a call from one Java method to another Java method, downcalls are relatively inexpensive. Upcalls, on the other hand, are significantly more expensive. Our experience (as well as anecdotal evidence from the Internet) indicates that upcalls are 10 to 20 times more expensive than downcalls. This expense translates directly to increased CPU time. In terms of UM, this translates to higher latency and lower throughput.
In addition, an upcall must attach to a Java thread object in the JVM in order to obtain an environment in which to run. Attaching to a new Java thread object is significantly more expensive than re-attaching to an extant Java thread object.
This discussion applies only to receivers. Sending in UM requires mostly downcalls. In fact, performance measurements show that the throughput from a Java sender is generally within a few percentage points of the throughput for a C sender.
Receivers, on the other hand, receive data via callbacks from the UM library into the application code. Callbacks directly equate to JNI upcalls, which (as noted above) are significantly more expensive than JNI downcalls.
One significant problem with Java performance is garbage collection. While this makes the programmer's life easier, having your application periodically stop doing application-specific useful work to perform its housekeeping duties significantly degrades the overall performance of the application. This is not to imply that garbage collection is not useful work: in the context of Java, it certainly is. But it does nothing to further the goal of the application itself, namely to receive and process data.
Thus, Java performance is inherently unpredictable and can vary significantly from one instant to the next. As an example, consider the lbmrcv example program supplied with UM. Running the C version will show a fairly steady data rate for each sample printed. Running the Java version will show wildly varying data rates for each sample printed. This is due in large part to the periodic interruption of the application to do garbage collection. See Zero Object Delivery (ZOD) below for a UM Java feature that reduces the need for garbage collection.
UM's Zero Object Delivery (ZOD) feature for Java allows receivers to deliver messages, and sources/receivers deliver events, to an application with no per-message object creation. This lets you write Java sending/receiving applications that require little to no garbage collection at runtime, resulting in lower and more consistent message latencies and hence, better performance.
To benefit from this feature, you must call .dispose() on a message to mark it as available for reuse. To access data from the message when using ZOD, use the following two methods in the LBMMessage class:
The .dataBuffer() method returns a reference to a thread-local direct ByteBuffer containing the message data. The ByteBuffer's capacity is probably larger than the message data length, so you must call .dataLength() for the actual data length. Discard/ignore the excess data in the ByteBuffer.
See the lbmrcv.java and lbmrcv.cs sample applications for examples using these methods.
Note that calling LBMMessage.data() is another, valid way to access message data, but this method does create a new byte[] array object. JVM creates the byte[] array object returned by LBMMessage.data() once, on demand, the first time you call .data(), and returns a reference to the same byte[] array object on subsequent calls to .data().
This method is a useful way to keep an LBMMessage object for processing outside of a receiver's callback. The byte[] array object returned by LBMMessage.data() is persistent, that is, the data it contains and any array references remain valid after the receiver's callback returns and until garbage collection. If you need only the message data for further processing, save the return from LBMMessage.data() within the receiver's callback.
If instead you need to promote an entire LBMMessage to a full-fledged independent object for use outside of a receiver's callback, use the LBMMessage.promote() method. This can be beneficial to an application receiving a mix of message types; some requiring additional processing, and others not. receiver callback code can look similar to the following:
if (msg.dataBuffer().getInt() == number_indicating_lots_of_work) { /* Promote this message to an object for handoff to a worker thread, * so it remains valid after the receiver callback returns. */ msg.promote(); workerThread.msgQueue.enqueue(msg); } else { /* Use ZOD and just read out of msg.dataBuffer(), * entirely within the receiver callback. */ ... }
LBMMessages promoted to full objects also return their own independent ByteBuffer objects from a call to LBMMessage.dataBuffer(). This means that the ByteBuffer returned from a promoted LBMMessage's.dataBuffer() method is persistent in the same way as the byte[] array object returned from a call to .data() on any LBMMessage object.
See the lbmresp.java sample application for another example of using LBMMessage.promote() to keep a message.
Message size has a significant effect on performance. Smaller messages mean more upcalls are required to receive and process a given amount of data. Larger messages result in fewer upcalls, resulting in better performance. Our tests have shown that smaller messages (64 bytes or smaller) yield approximately 25% of the performance of an equivalent C application, while larger messages (512 bytes and up) can yield 67% of the performance of an equivalent C application.
Granted, the size of the message is beyond the control of the receiving application. But the sender can control the message size. If you can modify the sender, consider using larger (500 bytes or more) messages. If this is not possible, consider blocking multiple "logical" messages together into a single "physical" message. The receiver would then be responsible for deblocking into the constituent messages. Note that this is not the same as UM batching.
Larger messages also have the advantage of making better use of the network bandwidth. For example, consider a TCP packet. A minimum of 54 bytes (14 for the Ethernet frame, 20 bytes for the IP header, 20 bytes for the TCP header) of overhead are required to send a one-byte message. A 512-byte message requires those same 54 bytes of overhead. On a 100 Mbps Ethernet network, at most 227,272 one-byte messages can be sent each second, resulting in 227,272 bytes of application data. With a 512-byte message, at most 22,084 message can be sent each second, resulting in 11,307,008 bytes of application data.
Our experience shows that Java applications are much more sensitive to CPU speed than C applications. In other words, increasing the CPU speed will have more benefit for a Java application than for a C application.
There is some improvement between the 1.4.2 and 1.5.0 JVMs with regard to overall latency. However, in at least one case (Linux 64-bit x86), using the IBM JVM resulted in double the performance (in terms of throughput) of the Sun JVM. However, we have heard that the IBM JVM is much slower than Suns in straightforward computation. It is difficult to predict the overall performance differences between IBM and SUN for a particular application without experimentation.
Surprisingly, no difference was seen between the 2.4 and 2.6 Linux kernels. In addition, later 2.6 kernels allow the timer frequency to be set to 100 (default), 250, or 1000 Hz. No measurable difference was seen by changing the timer frequency.
Multiple CPUs have a negligible effect in embedded and sequential modes. Using an event queue yields a modest performance increase of approximately 15% over a single CPU.
|
|||||||||
PREV NEXT | FRAMES NO FRAMES |