Excellent leads, thanks.  cassandra.in.sh has a heap of 6GB, but I didn't
realize that I was trying to float so many memtables.  I'll poke tomorrow
and report if it gets fixed.
Ian

On Thu, May 20, 2010 at 10:40 AM, Jonathan Ellis <jbel...@gmail.com> wrote:

> Some possibilities:
>
> You didn't adjust Cassandra heap size in cassandra.in.sh (1GB is too
> small)
> You're inserting at CL.ZERO (ROW-MUTATION-STAGE in tpstats will show
> large pending ops -- large = 100s)
> You're creating large rows a bit at a time and Cassandra OOMs when it
> tries to compact (the oom should usually be in the compaction thread)
> You have your 5 disks each with a separate data directory, which will
> allow up to 12 total memtables in-flight internally, and 12*256 is too
> much for the heap size you have (FLUSH-WRITER-STAGE in tpstats will
> show large pending ops -- large = more than 2 or 3)
>
> On Tue, May 18, 2010 at 6:24 AM, Ian Soboroff <isobor...@gmail.com> wrote:
> > I hope this isn't too much of a newbie question.  I am using Cassandra
> 0.6.1
> > on a small cluster of Linux boxes - 14 nodes, each with 8GB RAM and 5
> data
> > drives.  The nodes are running HDFS to serve files within the cluster,
> but
> > at the moment the rest of Hadoop is shut down.  I'm trying to load a
> large
> > set of web pages (the ClueWeb collection, but more is coming) and my
> > Cassandra daemons keep dying.
> >
> > I'm loading the pages into a simple column family that lets me fetch out
> > pages by an internal ID or by URL.  The biggest thing in the row is the
> page
> > content, maybe 15-20k per page of raw HTML.  There aren't a lot of
> columns.
> > I tried Thrift, Hector, and the BMT interface, and at the moment I'm
> doing
> > batch mutations over Thrift, about 2500 pages per batch, because that was
> > fastest for me in testing.
> >
> > At this point, each Cassandra node has between 500GB and 1.5TB according
> to
> > nodetool ring.  Let's say I start the daemons up, and they all go live
> after
> > a couple minutes of scanning the tables.  I then start my importer, which
> is
> > a single Java process reading Clueweb bundles over HDFS, cutting them up,
> > and sending the mutations to Cassandra.  I only talk to one node at a
> time,
> > switching to a new node when I get an exception.  As the job runs over a
> few
> > hours, the Cassandra daemons eventually fall over, either with no error
> in
> > the log or reporting that they are out of heap.
> >
> > Each daemon is getting 6GB of RAM and has scads of disk space to play
> with.
> > I've set the storage-conf.xml to take 256MB in a memtable before flushing
> > (like the BMT case), and to do batch commit log flushes, and to not have
> any
> > caching in the CFs.  I'm sure I must be tuning something wrong.  I would
> > eventually like this Cassandra setup to serve a light request load but
> over
> > say 50-100 TB of data.  I'd appreciate any help or advice you can offer.
> >
> > Thanks,
> > Ian
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Reply via email to