Excellent leads, thanks. cassandra.in.sh has a heap of 6GB, but I didn't realize that I was trying to float so many memtables. I'll poke tomorrow and report if it gets fixed. Ian
On Thu, May 20, 2010 at 10:40 AM, Jonathan Ellis <jbel...@gmail.com> wrote: > Some possibilities: > > You didn't adjust Cassandra heap size in cassandra.in.sh (1GB is too > small) > You're inserting at CL.ZERO (ROW-MUTATION-STAGE in tpstats will show > large pending ops -- large = 100s) > You're creating large rows a bit at a time and Cassandra OOMs when it > tries to compact (the oom should usually be in the compaction thread) > You have your 5 disks each with a separate data directory, which will > allow up to 12 total memtables in-flight internally, and 12*256 is too > much for the heap size you have (FLUSH-WRITER-STAGE in tpstats will > show large pending ops -- large = more than 2 or 3) > > On Tue, May 18, 2010 at 6:24 AM, Ian Soboroff <isobor...@gmail.com> wrote: > > I hope this isn't too much of a newbie question. I am using Cassandra > 0.6.1 > > on a small cluster of Linux boxes - 14 nodes, each with 8GB RAM and 5 > data > > drives. The nodes are running HDFS to serve files within the cluster, > but > > at the moment the rest of Hadoop is shut down. I'm trying to load a > large > > set of web pages (the ClueWeb collection, but more is coming) and my > > Cassandra daemons keep dying. > > > > I'm loading the pages into a simple column family that lets me fetch out > > pages by an internal ID or by URL. The biggest thing in the row is the > page > > content, maybe 15-20k per page of raw HTML. There aren't a lot of > columns. > > I tried Thrift, Hector, and the BMT interface, and at the moment I'm > doing > > batch mutations over Thrift, about 2500 pages per batch, because that was > > fastest for me in testing. > > > > At this point, each Cassandra node has between 500GB and 1.5TB according > to > > nodetool ring. Let's say I start the daemons up, and they all go live > after > > a couple minutes of scanning the tables. I then start my importer, which > is > > a single Java process reading Clueweb bundles over HDFS, cutting them up, > > and sending the mutations to Cassandra. I only talk to one node at a > time, > > switching to a new node when I get an exception. As the job runs over a > few > > hours, the Cassandra daemons eventually fall over, either with no error > in > > the log or reporting that they are out of heap. > > > > Each daemon is getting 6GB of RAM and has scads of disk space to play > with. > > I've set the storage-conf.xml to take 256MB in a memtable before flushing > > (like the BMT case), and to do batch commit log flushes, and to not have > any > > caching in the CFs. I'm sure I must be tuning something wrong. I would > > eventually like this Cassandra setup to serve a light request load but > over > > say 50-100 TB of data. I'd appreciate any help or advice you can offer. > > > > Thanks, > > Ian > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of Riptano, the source for professional Cassandra support > http://riptano.com >