Just an update.  I rolled the memtable size back to 128MB.  I am still
seeing that the daemon runs for a while with reasonable heap usage, but then
the heap climbs up to the max (6GB in this case, should be plenty) and it
starts GCing, without much getting cleared.  The client catches lots of
exceptions, where I wait 30 seconds and try again, with a new client if
necessary, but it doesn't clear up.

Could this be related to memory leak problems I've skimmed past on the list
here?

It can't be that I'm creating rows a bit at a time... once I stick a web
page into two CFs, it's over and done with for this application.  I'm just
trying to get stuff loaded.

Is there a limit to how much on-disk data a Cassandra daemon can manage?  Is
there runtime overhead associated with stuff on disk?

Ian

On Thu, May 20, 2010 at 9:31 PM, Ian Soboroff <isobor...@gmail.com> wrote:

> Excellent leads, thanks.  cassandra.in.sh has a heap of 6GB, but I didn't
> realize that I was trying to float so many memtables.  I'll poke tomorrow
> and report if it gets fixed.
> Ian
>
>
> On Thu, May 20, 2010 at 10:40 AM, Jonathan Ellis <jbel...@gmail.com>wrote:
>
>> Some possibilities:
>>
>> You didn't adjust Cassandra heap size in cassandra.in.sh (1GB is too
>> small)
>> You're inserting at CL.ZERO (ROW-MUTATION-STAGE in tpstats will show
>> large pending ops -- large = 100s)
>> You're creating large rows a bit at a time and Cassandra OOMs when it
>> tries to compact (the oom should usually be in the compaction thread)
>> You have your 5 disks each with a separate data directory, which will
>> allow up to 12 total memtables in-flight internally, and 12*256 is too
>> much for the heap size you have (FLUSH-WRITER-STAGE in tpstats will
>> show large pending ops -- large = more than 2 or 3)
>>
>> On Tue, May 18, 2010 at 6:24 AM, Ian Soboroff <isobor...@gmail.com>
>> wrote:
>> > I hope this isn't too much of a newbie question.  I am using Cassandra
>> 0.6.1
>> > on a small cluster of Linux boxes - 14 nodes, each with 8GB RAM and 5
>> data
>> > drives.  The nodes are running HDFS to serve files within the cluster,
>> but
>> > at the moment the rest of Hadoop is shut down.  I'm trying to load a
>> large
>> > set of web pages (the ClueWeb collection, but more is coming) and my
>> > Cassandra daemons keep dying.
>> >
>> > I'm loading the pages into a simple column family that lets me fetch out
>> > pages by an internal ID or by URL.  The biggest thing in the row is the
>> page
>> > content, maybe 15-20k per page of raw HTML.  There aren't a lot of
>> columns.
>> > I tried Thrift, Hector, and the BMT interface, and at the moment I'm
>> doing
>> > batch mutations over Thrift, about 2500 pages per batch, because that
>> was
>> > fastest for me in testing.
>> >
>> > At this point, each Cassandra node has between 500GB and 1.5TB according
>> to
>> > nodetool ring.  Let's say I start the daemons up, and they all go live
>> after
>> > a couple minutes of scanning the tables.  I then start my importer,
>> which is
>> > a single Java process reading Clueweb bundles over HDFS, cutting them
>> up,
>> > and sending the mutations to Cassandra.  I only talk to one node at a
>> time,
>> > switching to a new node when I get an exception.  As the job runs over a
>> few
>> > hours, the Cassandra daemons eventually fall over, either with no error
>> in
>> > the log or reporting that they are out of heap.
>> >
>> > Each daemon is getting 6GB of RAM and has scads of disk space to play
>> with.
>> > I've set the storage-conf.xml to take 256MB in a memtable before
>> flushing
>> > (like the BMT case), and to do batch commit log flushes, and to not have
>> any
>> > caching in the CFs.  I'm sure I must be tuning something wrong.  I would
>> > eventually like this Cassandra setup to serve a light request load but
>> over
>> > say 50-100 TB of data.  I'd appreciate any help or advice you can offer.
>> >
>> > Thanks,
>> > Ian
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>>
>
>

Reply via email to