I'll work on doing more tests around this. In 0.5 we used a different data structure that required polling. But this does seem problematic.
-Chris On Apr 26, 2010, at 7:04 PM, Eric Yu wrote: > I have the same problem here, and I analysised the hprof file with mat, as > you said, LinkedBlockQueue used 2.6GB. > I think the ThreadPool of cassandra should limit the queue size. > > cassandra 0.6.1 > > java version > $ java -version > java version "1.6.0_20" > Java(TM) SE Runtime Environment (build 1.6.0_20-b02) > Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode) > > iostat > $ iostat -x -l 1 > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz > avgqu-sz await svctm %util > sda 81.00 8175.00 224.00 17.00 23984.00 2728.00 221.68 > 1.01 1.86 0.76 18.20 > > tpstats, of coz, this node is still alive > $ ./nodetool -host localhost tpstats > Pool Name Active Pending Completed > FILEUTILS-DELETE-POOL 0 0 1281 > STREAM-STAGE 0 0 0 > RESPONSE-STAGE 0 0 473617241 > ROW-READ-STAGE 0 0 0 > LB-OPERATIONS 0 0 0 > MESSAGE-DESERIALIZER-POOL 0 0 718355184 > GMFD 0 0 132509 > LB-TARGET 0 0 0 > CONSISTENCY-MANAGER 0 0 0 > ROW-MUTATION-STAGE 0 0 293735704 > MESSAGE-STREAMING-POOL 0 0 6 > LOAD-BALANCER-STAGE 0 0 0 > FLUSH-SORTER-POOL 0 0 0 > MEMTABLE-POST-FLUSHER 0 0 1870 > FLUSH-WRITER-POOL 0 0 1870 > AE-SERVICE-STAGE 0 0 5 > HINTED-HANDOFF-POOL 0 0 21 > > > On Tue, Apr 27, 2010 at 3:32 AM, Chris Goffinet <goffi...@digg.com> wrote: > Upgrade to b20 of Sun's version of JVM. This OOM might be related to > LinkedBlockQueue issues that were fixed. > > -Chris > > > 2010/4/26 Roland Hänel <rol...@haenel.me> > Cassandra Version 0.6.1 > OpenJDK Server VM (build 14.0-b16, mixed mode) > Import speed is about 10MB/s for the full cluster; if a compaction is going > on the individual node is I/O limited > tpstats: caught me, didn't know this. I will set up a test and try to catch a > node during the critical time. > > Thanks, > Roland > > > 2010/4/26 Chris Goffinet <goffi...@digg.com> > > Which version of Cassandra? > Which version of Java JVM are you using? > What do your I/O stats look like when bulk importing? > When you run `nodeprobe -host XXXX tpstats` is any thread pool backing up > during the import? > > -Chris > > > 2010/4/26 Roland Hänel <rol...@haenel.me> > > I have a cluster of 5 machines building a Cassandra datastore, and I load > bulk data into this using the Java Thrift API. The first ~250GB runs fine, > then, one of the nodes starts to throw OutOfMemory exceptions. I'm not using > and row or index caches, and since I only have 5 CF's and some 2,5 GB of RAM > allocated to the JVM (-Xmx2500M), in theory, that should happen. All inserts > are done with consistency level ALL. > > I hope with this I have avoided all the 'usual dummy errors' that lead to > OOM's. I have begun to troubleshoot the issue with JMX, however, it's > difficult to catch the JVM in the right moment because it runs well for > several hours before this thing happens. > > One thing gets to my mind, maybe one of the experts could confirm or reject > this idea for me: is it possible that when one machine slows down a little > bit (for example because a big compaction is going on), the memtables don't > get flushed to disk as fast as they are building up under the continuing bulk > import? That would result in a downward spiral, the system gets slower and > slower on disk I/O, but since more and more data arrives over Thrift, finally > OOM. > > I'm using the "periodic" commit log sync, maybe also this could create a > situation where the commit log writer is too slow to catch up with the data > intake, resulting in ever growing memory usage? > > Maybe these thoughts are just bullshit. Let me now if so... ;-) > > > > > >