Ok, spending some time slogging through the cassandra-user archives. Seems lots of folks have this problem. Starting with a JVM upgrade, then skimming through JIRA looking for patches.
Ian On Fri, May 21, 2010 at 12:09 PM, Ian Soboroff <isobor...@gmail.com> wrote: > So at the moment, I'm not running my loader, and I'm looking at one node > which is slow to respond to nodetool requests. At this point, it has a pile > of hinted-handoffs pending which don't seem to be draining out. The > system.log shows that it's GCing pretty much constantly. > Ian > > > $ /usr/local/src/cassandra/bin/nodetool --host node7 tpstats > Pool Name Active Pending Completed > FILEUTILS-DELETE-POOL 0 0 178 > STREAM-STAGE 0 0 0 > RESPONSE-STAGE 0 0 21852 > ROW-READ-STAGE 0 0 0 > LB-OPERATIONS 0 0 0 > MESSAGE-DESERIALIZER-POOL 0 0 1648536 > GMFD 0 0 125430 > LB-TARGET 0 0 0 > CONSISTENCY-MANAGER 0 0 0 > ROW-MUTATION-STAGE 2 2 1886537 > MESSAGE-STREAMING-POOL 0 0 0 > LOAD-BALANCER-STAGE 0 0 0 > FLUSH-SORTER-POOL 0 0 0 > MEMTABLE-POST-FLUSHER 0 0 206 > FLUSH-WRITER-POOL 0 0 206 > AE-SERVICE-STAGE 0 0 0 > HINTED-HANDOFF-POOL 1 158 23 > > > > On Fri, May 21, 2010 at 10:37 AM, Ian Soboroff <isobor...@gmail.com>wrote: > >> On the to-do list for today. Is there a tool to aggregate all the JMX >> stats from all nodes? I mean, something a little more complete than nagios. >> Ian >> >> >> On Fri, May 21, 2010 at 10:23 AM, Jonathan Ellis <jbel...@gmail.com>wrote: >> >>> you should check the jmx stages I posted about >>> >>> On Fri, May 21, 2010 at 7:05 AM, Ian Soboroff <isobor...@gmail.com> >>> wrote: >>> > Just an update. I rolled the memtable size back to 128MB. I am still >>> > seeing that the daemon runs for a while with reasonable heap usage, but >>> then >>> > the heap climbs up to the max (6GB in this case, should be plenty) and >>> it >>> > starts GCing, without much getting cleared. The client catches lots of >>> > exceptions, where I wait 30 seconds and try again, with a new client if >>> > necessary, but it doesn't clear up. >>> > >>> > Could this be related to memory leak problems I've skimmed past on the >>> list >>> > here? >>> > >>> > It can't be that I'm creating rows a bit at a time... once I stick a >>> web >>> > page into two CFs, it's over and done with for this application. I'm >>> just >>> > trying to get stuff loaded. >>> > >>> > Is there a limit to how much on-disk data a Cassandra daemon can >>> manage? Is >>> > there runtime overhead associated with stuff on disk? >>> > >>> > Ian >>> > >>> > On Thu, May 20, 2010 at 9:31 PM, Ian Soboroff <isobor...@gmail.com> >>> wrote: >>> >> >>> >> Excellent leads, thanks. cassandra.in.sh has a heap of 6GB, but I >>> didn't >>> >> realize that I was trying to float so many memtables. I'll poke >>> tomorrow >>> >> and report if it gets fixed. >>> >> Ian >>> >> >>> >> On Thu, May 20, 2010 at 10:40 AM, Jonathan Ellis <jbel...@gmail.com> >>> >> wrote: >>> >>> >>> >>> Some possibilities: >>> >>> >>> >>> You didn't adjust Cassandra heap size in cassandra.in.sh (1GB is too >>> >>> small) >>> >>> You're inserting at CL.ZERO (ROW-MUTATION-STAGE in tpstats will show >>> >>> large pending ops -- large = 100s) >>> >>> You're creating large rows a bit at a time and Cassandra OOMs when it >>> >>> tries to compact (the oom should usually be in the compaction thread) >>> >>> You have your 5 disks each with a separate data directory, which will >>> >>> allow up to 12 total memtables in-flight internally, and 12*256 is >>> too >>> >>> much for the heap size you have (FLUSH-WRITER-STAGE in tpstats will >>> >>> show large pending ops -- large = more than 2 or 3) >>> >>> >>> >>> On Tue, May 18, 2010 at 6:24 AM, Ian Soboroff <isobor...@gmail.com> >>> >>> wrote: >>> >>> > I hope this isn't too much of a newbie question. I am using >>> Cassandra >>> >>> > 0.6.1 >>> >>> > on a small cluster of Linux boxes - 14 nodes, each with 8GB RAM and >>> 5 >>> >>> > data >>> >>> > drives. The nodes are running HDFS to serve files within the >>> cluster, >>> >>> > but >>> >>> > at the moment the rest of Hadoop is shut down. I'm trying to load >>> a >>> >>> > large >>> >>> > set of web pages (the ClueWeb collection, but more is coming) and >>> my >>> >>> > Cassandra daemons keep dying. >>> >>> > >>> >>> > I'm loading the pages into a simple column family that lets me >>> fetch >>> >>> > out >>> >>> > pages by an internal ID or by URL. The biggest thing in the row is >>> the >>> >>> > page >>> >>> > content, maybe 15-20k per page of raw HTML. There aren't a lot of >>> >>> > columns. >>> >>> > I tried Thrift, Hector, and the BMT interface, and at the moment >>> I'm >>> >>> > doing >>> >>> > batch mutations over Thrift, about 2500 pages per batch, because >>> that >>> >>> > was >>> >>> > fastest for me in testing. >>> >>> > >>> >>> > At this point, each Cassandra node has between 500GB and 1.5TB >>> >>> > according to >>> >>> > nodetool ring. Let's say I start the daemons up, and they all go >>> live >>> >>> > after >>> >>> > a couple minutes of scanning the tables. I then start my importer, >>> >>> > which is >>> >>> > a single Java process reading Clueweb bundles over HDFS, cutting >>> them >>> >>> > up, >>> >>> > and sending the mutations to Cassandra. I only talk to one node at >>> a >>> >>> > time, >>> >>> > switching to a new node when I get an exception. As the job runs >>> over >>> >>> > a few >>> >>> > hours, the Cassandra daemons eventually fall over, either with no >>> error >>> >>> > in >>> >>> > the log or reporting that they are out of heap. >>> >>> > >>> >>> > Each daemon is getting 6GB of RAM and has scads of disk space to >>> play >>> >>> > with. >>> >>> > I've set the storage-conf.xml to take 256MB in a memtable before >>> >>> > flushing >>> >>> > (like the BMT case), and to do batch commit log flushes, and to not >>> >>> > have any >>> >>> > caching in the CFs. I'm sure I must be tuning something wrong. I >>> >>> > would >>> >>> > eventually like this Cassandra setup to serve a light request load >>> but >>> >>> > over >>> >>> > say 50-100 TB of data. I'd appreciate any help or advice you can >>> >>> > offer. >>> >>> > >>> >>> > Thanks, >>> >>> > Ian >>> >>> > >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Jonathan Ellis >>> >>> Project Chair, Apache Cassandra >>> >>> co-founder of Riptano, the source for professional Cassandra support >>> >>> http://riptano.com >>> >> >>> > >>> > >>> >>> >>> >>> -- >>> Jonathan Ellis >>> Project Chair, Apache Cassandra >>> co-founder of Riptano, the source for professional Cassandra support >>> http://riptano.com >>> >> >> >