How do I tune the JVM performance for Ice applications?

Ice produces many short-lived objects during request processing, so giving some thought to how the Java run-time garbage collection works is worthwhile. There are many documents available on the internet that discuss performance tuning as well as memory and garbage collection of the Java virtual machine (JVM). We'll summarize some of the relevant issues here.

There are basically two ways you can alter the JVM's behavior to improve performance. Firstly, you can influence memory use by changing how memory is allocated and organized by the JVM. Secondly, you can influence how garbage collection is performed. (A third mechanism, just-in-time (JIT) compilation, increases performance by compiling the byte-code of frequently-used methods to native code; however, JIT is enabled by default and is not tunable, so we ignore it for this discussion.) This discussion focuses on the Sun Java HotSpot JVM; other JVMs may provide similar options or additional tuning features.

First, let's consider how objects are allocated and destroyed. The HotSpot JVM implements a multi-generational garbage collector. New objects are initially allocated in a young generation space. After a while, an object that survives collections in the young generation space is moved to the tenured generation space. When a collection occurs on the young generation space, it is called a minor collection. If there isn't sufficient free memory in the tenured generation space, a major collection occurs. Major collections are relatively expensive as they involve all live objects.

The young generation space is optimized for objects that have short life times. If your application creates lots of objects that live for a relatively short time, you can decrease the frequency of collections in the young generation space by increasing the amount of memory allocated to it. Unfortunately, this can also increase the time it takes for the young generation collection to complete. Optimal performance will involve getting the young generation to the "right" size.

A few additional notes on the young generation space:

  • If you are setting an upper bound for the Java heap (see the -Xmx option), you need to be careful not to set the young generation space to more than half of the upper bound. This is especially important when using the serial collector (the default garbage collector on single-CPU machines). With the serial collector, the JVM reserves enough memory in the tenured generation space to ensure that a minor collection will succeed when the young generation is full of live objects. However, if there is not enough memory available in the tenured generation space, a full collection is triggered instead.
  • There are other options that affect the behavior of collections in the young generation space, such as configuring the survivor space ratios. We won’t discuss these here. However, if you want to change these options, you need to consider how Ice works before doing so.

By default, the tenured generation is sized relative to the maximum Java heap size and the size of the young generation space. If you have somehow configured the JVM such that there insufficient tenured space, you will know because you will get java.lang.OutOfMemoryExceptions.

Apart from the young and tenured spaces, there is also a third space, known as the permanent generation space. The permanent generation space is intended for objects that live for the lifespan of the application. If your application has a large number of classes, you may want to turn on GC logging for a period of time and see if the collector collects anything in the permanent generation. If it does, you should consider increasing the size of the permanent generation space. Some useful GC logging options are:

-verbose:gc

Enable verbose GC logging.

-XX:+PrintGCDetails

Provide detailed statistics about collections.

-XX:+PrintTenuringDistribution

Give details about young generation survivor spaces and how many objects are promoted to the tenured space.

Here are some of the options for controlling how the JVM allocates and organizes memory (please see the Java documentation for details):

-Xms

Configure the initial Java heap size.

-Xmx

Configure the maximum Java heap size.

-Xmn

Configure the maximum size for the young generation space.

-XX:+AggressiveHeap

Instructs the VM to analyze the current operating environment and attempt to optimize settings for memory-intensive applications. This option also enables implicit garbage collection and adaptive memory sizing. It is intended for machines with large amounts of memory and multiple CPUs.

-XX:NewRatio

Configure the relative size of the young generation space.

-XX:NewSize

Configure the initial size of the young generation space.

-XX:MaxNewSize

Configure the maximum size of the young generation space.

So how is this relevant to Ice? Besides application-specific issues, there are some details of how Ice is implemented that affect how the JVM will behave. For example, the size of the young generation is relevant to Ice because processing a request often involves creation of transient objects. Complex types, sequences of types, and rapid-fire requests can fill up the young generation space quickly. If a request is very complex, it may result in the young generation space filling before a single request is completed. In turn, this causes some of the transient objects to spill over into the tenured space. Eventually, the tenured space fills up and you end up with an expensive major collection pass. Profiling typical load scenarios will help you determine a good size for the Java heap and young generation size.

Also of relevance is Ice's per-connection buffer caching feature. Marshaling buffers live for the duration of a connection between a client and server communicator. Because these buffers can be large, they can quickly fill the young generation space and cause a collection. On the other hand, with buffer caching, the buffer objects are quickly moved to the tenured space. This is beneficial in that it leaves more room in the young generation space for transient objects used during marshaling and unmarshaling. However, if you create and close many connections, such as in a heavily-loaded server, the tenured space may fill more quickly, causing an expensive major collection pass. Ice 3.2 introduced a new configuration property, Ice.CacheMessageBuffers, that permits you to disable the per-connection buffering feature, thereby allowing most of the transient request data to remain completely in the young generation space. Naturally, your Ice for Java applications (especially servers) will be happier with loads of memory.

By the way, if you have read somewhere that calling System.gc() is a bad idea, here is why: System.gc() forces a major collection, so all of your careful tuning goes out the window if an applications calls System.gc(). Fortunately, if an application does this, you can run it with -XX:+DisableExplicitGC to make the call a no-op.

Another area that affects performance is what the garbage collector does when it needs to reclaim memory. The following collectors are available in J2SE 5.0:

  • The default (serial) collector
  • Throughput collector (-XX:+UseParallelGC)
  • Concurrent low pause collector (-XX:+UseConcMarkSweepGC)
  • Incremental collector (-XX:+UseTrainGC)

The default collector is a serial collector that pauses the application during minor and major collections. If the host machine has a single CPU, the serial collector will likely be as fast or faster as the other collectors.

Pausing the entire application during garbage collection wastes one or more CPUs for the duration of a collector run so, on hosts with multiple CPUs, the throughput and concurrent low-pause collectors are worth looking at. The throughput collector uses the same collection mechanism as the serial collector for major collections, but implements a parallel minor collector for the young generation space. On the other hand, the concurrent low-pause collector attempts to perform most of the work of a major collection without interrupting your application. (Your application may still pause briefly during collection, but not as long as with the serial collector.) As with the throughput collector, the concurrent low-pause collector collects objects in the young generation space in parallel. Finally, the incremental collector tries to perform part of the work of a major collection each time it does a minor collection to amortize the cost of a major collection. However, the incremental collector is deprecated and will eventually be removed.

Which collector works best for your application depends on the load of the application and the host system. However, unless you have a system with multiple CPUs, tuning memory configuration rather than garbage collection is likely to yield performance gains. Let's look at the Ice throughput demo as an example.

Running on a Dell Dimension 8250 (Intel P4 with HT enabled, 1GB of RAM, with CentOS 4.4 Linux), we get the following results:

Configuration

Throughput in Mbps

byte sequence send

785

byte sequence echo

795

string sequence send

41

variable-length struct send

59

fixed-length struct send

103

If we specify the heap size and increase the size of the young generation using the options -Xms250m -Xmx250m -Xmn100m:

Configuration

Throughput in Mbps

byte sequence send

870

byte sequence echo

885

string sequence send

94

variable-length struct send

104

fixed-length struct send

126

So by just tweaking the memory configuration a little bit, we can get a fairly sizable increase in throughput.

This machine only has a single CPU (hyper-threading notwithstanding), so let's see what happens if we throw a parallel collector (-XX:+UseParallelGC) into the mix.

Configuration

Throughput in Mbps

byte sequence send

840

byte sequence echo

860

string sequence send

90

variable-length struct send

100

fixed-length struct send

125

The parallel garbage collection doesn't appear to help here. However, on a machine with several CPUs, you should see some benefits by enabling parallel collection.The parallel garbage collection doesn’t appear to help here. However, on a machine with several CPUs, you should see some benefits by enabling parallel collection.

References