One last word on the effect of memory mapped IO on the VIRT, RES and SHR columns in the output of the top utility.
With mmap enabled, VIRT can be big, as much as the sum of the size of all index and data files and the sizes of shared libraries. RES is the sum of the sizes of 1) the Java heap, 2) the native heap (JVM-internal data, Java stacks for all threads, direct buffers, JNA-allocated memory, PermGen) 3) the native stack, 4) minor things like static data (C not Java) and 5) with mmap enabled, the RAM pages into which the OS has currently loaded contents of memory-mapped files. The 5th component of RES is managed by the OS and can fluctuate wildly. The SHR column shows the RAM pages into which the OS has currently loaded contents of memory-mapped files but to which the process has not written to. Because Cassandra doesn't use memory-mapped IO for writing (correct me if I am wrong), SHR and component 5 of RES are identical. Without mmap, the 5th component will be negligible. To determine if there is a native leak, one needs to look at m = RES - SHR - JavaHeap - PermGen - (JavaStack * #-of-threads). If m keeps growing, there is a native leak. On Sun, May 15, 2011 at 11:52 AM, Hannes Schmidt <han...@eyealike.com> wrote: > As promised: https://issues.apache.org/jira/browse/CASSANDRA-2654 > > On Sun, May 15, 2011 at 7:09 AM, Jonathan Ellis <jbel...@gmail.com> wrote: >> Great debugging work! >> >> That workaround sounds like the best alternative to me too. >> >> On Sat, May 14, 2011 at 9:46 PM, Hannes Schmidt <han...@eyealike.com> wrote: >>> Ok, so I think I found one major cause contributing to the increasing >>> resident size of the Cassandra process. Looking at the OpenJDK sources >>> was of great help in understanding the problem but my findings also >>> apply to the Sun/Oracle JDK because the affected code is shared by >>> both. >>> >>> Each IncomingTcpConnection (ITC) thread handles a socket to another >>> node. That socket is a server socket returned from >>> ServerSocket.accept() and as such it is implemented on top of an NIO >>> socket channel (sun.nio.ch.SocketAdaptor) which in turn makes use of >>> direct byte buffers. It obtains these buffers from sun.nio.ch.Util >>> which caches the 3 most recently used buffers per thread. If a cached >>> buffer isn't large enough for a message, a new one that is will >>> replace it. The size of the buffer is determined by the amount of data >>> that the application requests to be read. ITC uses the readFully() >>> method of DataInputStream (DIS) to read data into a byte array >>> allocated to hold the entire message: >>> >>> int size = socketStream.readInt(); >>> byte[] buffer = new byte[size]; >>> socketStream.readFully(buffer); >>> >>> Whatever the value of 'size' will end up being the size of the direct >>> buffer allocated by the socket channel code. >>> >>> Our application uses range queries whose result sets are around 40 >>> megabytes in size. If a range isn't hosted on the node the application >>> client is connected to, the range result set will be fetched from >>> another node. When that other node has prepared the result it will >>> send it back (asynchonously, this took me a while to grasp) and it >>> will end up in the direct byte buffer that is cached by >>> sun.nio.ch.Util for the ITC thread on the original node. >>> >>> The thing is that the buffer holds the entire message, all 40 megs of >>> it. ITC is rather long-lived and so the buffers will simply stick >>> around. Our range queries cover the entire ring (we do a lot of "map >>> reduce") and so each node ends up with as many 40M buffers as we have >>> nodes in the ring, 10 in our case. That's 400M of native heap space >>> wasted on each node. >>> >>> Each ITC thread holds onto the historically largest direct buffer, >>> possibly for a long time. This could be alleviated by periodically >>> closing the connection and thereby releasing a potentially large >>> buffer and replacing it with a new thread that starts with a clean >>> slate. If all queries have large result sets, this solution won't >>> help. Another alternative is to read the message incrementally rather >>> than buffering it in its entirety in a byte array as ITC currently >>> does. A third and possibly the simplest solution would be to read the >>> messages into the buffer in chunks of say 1M. DIS has offers >>> readFully( data, offset, length ) for that. I have tried this solution >>> and it fixes this problem for us. I'll open an issue and submit my >>> patch. We have observed the issue with 0.6.12 but from looking at ITC >>> in trunk it seems to be affected too. >>> >>> It gets worse though: even after the ITC thread dies, the cached >>> buffers stick around as they are being held via SoftReferences. SR's >>> are released only as a last resort to prevent an OutOfMemoryException. >>> Using SR's for caching direct buffers is silly because direct buffers >>> have negligible impact on the Java heap but may have dramatic impact >>> on the native heap. I am not the only one who thinks so [1]. In other >>> words, sun.nio.ch.Util's buffer caching is severely broken. I tried to >>> find a way to explicitly release soft references but haven't found >>> anything other than the allocation of an oversized array to force an >>> OutOfMemoryException. The only thing we can do is to keep the buffer >>> sizes small in order to reduce the impact of the leak. My patch takes >>> care of that. >>> >>> I will post a link to the JIRA issue with the patch shortly. >>> >>> [1] http://bugs.sun.com/view_bug.do?bug_id=6210541 >>> >>> On Wed, May 4, 2011 at 11:50 AM, Hannes Schmidt <han...@eyealike.com> wrote: >>>> Hi, >>>> >>>> We are using Cassandra 0.6.12 in a cluster of 9 nodes. Each node is >>>> 64-bit, has 4 cores and 4G of RAM and runs on Ubuntu Lucid with the >>>> stock 2.6.32-31-generic kernel. We use the Sun/Oracle JDK. >>>> >>>> Here's the problem: The Cassandra process starts up with 1.1G resident >>>> memory (according to top) but slowly grows to 2.1G at a rate that >>>> seems proportional to the write load. No writes, no growth. The node >>>> is running other memory-sensitive applications (a second JVM for our >>>> in-house webapp and a short-lived C++ program) so we need to ensure >>>> that each process stays within certain bounds as far as memory >>>> requirements go. The nodes OOM and crash when the Cassandra process is >>>> at 2.1G so I can't say if the growth is bounded or not. >>>> >>>> Looking at theĀ /proc/$pid/smapsĀ for the Cassandra process it seems to >>>> me that it is the native heap of the Cassandra JVM that is leaking. I >>>> attached a readable version of the smaps file generated by [1]. >>>> >>>> Some more data: Cassandra runs with default command line arguments, >>>> which means it gets 1G heap. The JNA jar is present and Cassandra logs >>>> that the memory locking was successful. In storage-conf.xml, >>>> DiskAccessMode is mmap_index_only. Other than that and some increased >>>> timeouts we left the defaults. Swap is completely disabled. I don't >>>> think this is related but I am mentioning it anyways: overcommit [2] >>>> is always-on (vm.overcommit_memory=1). Without that we get OOMs when >>>> our application JVM is fork()'ing and exec()'ing our C++program even >>>> though there is enough free RAM to satisfy the demands of the C++ >>>> program. We think this is caused by a flawed kernel heuristic that >>>> assumes that the forked process (our C++ app) is as big as the forking >>>> one (the 2nd JVM). Anyways, the Cassandra process leaks with both, >>>> vm.overcommit_memory=0 (the default) and 1. >>>> >>>> Whether it is the native heap that leaks or something else, I think >>>> that 1.1G of additional RAM for 1G of Java heap can't be normal. I'd >>>> be grateful for any insights or pointers at what to try next. >>>> >>>> [1] http://bmaurer.blogspot.com/2006/03/memory-usage-with-smaps.html >>>> [2] http://www.win.tue.nl/~aeb/linux/lk/lk-9.html#ss9.6 >>>> >>> >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of DataStax, the source for professional Cassandra support >> http://www.datastax.com >> >