Yes. For your other question, I'm not sure but it makes sense that the Cassandra memory usage would be separate from the pig memory usage - so pig my be doing the spill to disk.
On Mar 25, 2011, at 6:21 PM, Jeffrey Wang wrote: > Just to be clear, it's also the case that if I have a Hadoop TaskTracker > running on each node that Cassandra is running on, a map/reduce job will > automatically handle data locality, right? I.e. each mapper will only read > splits which live on the same box. > > -Jeffrey > > -----Original Message----- > From: Jeffrey Wang [mailto:jw...@palantir.com] > Sent: Friday, March 25, 2011 11:42 AM > To: user@cassandra.apache.org > Subject: RE: pig counting question > > I don't think it's Pig running out of memory, but rather Cassandra itself > (the data doesn't even make it to Pig). get_range_slices() is called with a > row batch size of 4096, the default, and it's fetching all of the columns in > each row. If I have 10K columns in each row, that's a huge request, and > Cassandra runs into memory pressure trying to serve it. > > That's my understanding of it; if there's something I'm missing, please let > me know. > > -Jeffrey > > -----Original Message----- > From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] > Sent: Friday, March 25, 2011 11:06 AM > To: user@cassandra.apache.org > Subject: Re: pig counting question > > One thing I wonder though - if your columns are the thing that are increasing > your heap size and eating up a lot of memory, and you're reading the data > structure out as a bag of columns, why isn't pig spilling to disk instead of > growing in memory. The pig model is that you can have huge bags that don't > kill you on memory but they are just slower because they spill to disk. What > is the schema that you impose when you load the data? > > On Mar 24, 2011, at 3:57 PM, Jeffrey Wang wrote: > >> It looks like this functionality is not in the 0.7.3 version of >> CassandraStorage. I tried to add the constructor which takes the limit to >> the class, but I ran into some Pig parsing errors, so I had to make the >> parameter a string. How did you get around this for the version of >> CassandraStorage in trunk? I'm running Pig 0.8.0. >> >> Also, when I bump the limit up very high (e.g. 1M columns), my Cassandra >> starts eating up huge amounts of memory, maxing out my 16GB heap size. I >> suspect this is because of the get_range_slices() call from >> ColumnFamilyRecordReader. Are there plans to make this streaming/paged? >> >> -Jeffrey >> >> -----Original Message----- >> From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] >> Sent: Thursday, March 24, 2011 11:34 AM >> To: user@cassandra.apache.org >> Subject: Re: pig counting question >> >> The limit defaults to 1024 but you can set it when you use CassandraStorage >> in pig, like so: >> rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage(4096); >> or whatever value you wish. >> >> Give that a try and see if it gives you more of what you're looking for. >> >> On Mar 24, 2011, at 1:16 PM, Jeffrey Wang wrote: >> >>> Hey all, >>> >>> I'm trying to run a very simple Pig script against my Cassandra cluster (5 >>> nodes, 0.7.3). I've gotten it all set up and working, but the script is >>> giving me some strange results. Here is my script: >>> >>> rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage(); >>> rowct = FOREACH rows GENERATE $0, COUNT($1); >>> dump rowct; >>> >>> If I understand Pig correctly, this should output (row name, column count) >>> tuples, but I'm always seeing 1024 for the column count even though the >>> rows have highly variable number of columns. Am I missing something? Thanks. >>> >>> -Jeffrey >>> >> >