Re: pig counting question

Jeremy Hanna Fri, 25 Mar 2011 16:45:23 -0700

Yes.  For your other question, I'm not sure but it makes sense that the 
Cassandra memory usage would be separate from the pig memory usage - so pig my 
be doing the spill to disk.


On Mar 25, 2011, at 6:21 PM, Jeffrey Wang wrote:

> Just to be clear, it's also the case that if I have a Hadoop TaskTracker 
> running on each node that Cassandra is running on, a map/reduce job will 
> automatically handle data locality, right? I.e. each mapper will only read 
> splits which live on the same box.
> 
> -Jeffrey
> 
> -----Original Message-----
> From: Jeffrey Wang [mailto:jw...@palantir.com] 
> Sent: Friday, March 25, 2011 11:42 AM
> To: user@cassandra.apache.org
> Subject: RE: pig counting question
> 
> I don't think it's Pig running out of memory, but rather Cassandra itself 
> (the data doesn't even make it to Pig). get_range_slices() is called with a 
> row batch size of 4096, the default, and it's fetching all of the columns in 
> each row. If I have 10K columns in each row, that's a huge request, and 
> Cassandra runs into memory pressure trying to serve it.
> 
> That's my understanding of it; if there's something I'm missing, please let 
> me know.
> 
> -Jeffrey
> 
> -----Original Message-----
> From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] 
> Sent: Friday, March 25, 2011 11:06 AM
> To: user@cassandra.apache.org
> Subject: Re: pig counting question
> 
> One thing I wonder though - if your columns are the thing that are increasing 
> your heap size and eating up a lot of memory, and you're reading the data 
> structure out as a bag of columns, why isn't pig spilling to disk instead of 
> growing in memory.  The pig model is that you can have huge bags that don't 
> kill you on memory but they are just slower because they spill to disk.  What 
> is the schema that you impose when you load the data?
> 
> On Mar 24, 2011, at 3:57 PM, Jeffrey Wang wrote:
> 
>> It looks like this functionality is not in the 0.7.3 version of 
>> CassandraStorage. I tried to add the constructor which takes the limit to 
>> the class, but I ran into some Pig parsing errors, so I had to make the 
>> parameter a string. How did you get around this for the version of 
>> CassandraStorage in trunk? I'm running Pig 0.8.0.
>> 
>> Also, when I bump the limit up very high (e.g. 1M columns), my Cassandra 
>> starts eating up huge amounts of memory, maxing out my 16GB heap size. I 
>> suspect this is because of the get_range_slices() call from 
>> ColumnFamilyRecordReader. Are there plans to make this streaming/paged?
>> 
>> -Jeffrey
>> 
>> -----Original Message-----
>> From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] 
>> Sent: Thursday, March 24, 2011 11:34 AM
>> To: user@cassandra.apache.org
>> Subject: Re: pig counting question
>> 
>> The limit defaults to 1024 but you can set it when you use CassandraStorage 
>> in pig, like so:
>> rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage(4096);
>> or whatever value you wish.
>> 
>> Give that a try and see if it gives you more of what you're looking for.
>> 
>> On Mar 24, 2011, at 1:16 PM, Jeffrey Wang wrote:
>> 
>>> Hey all,
>>> 
>>> I'm trying to run a very simple Pig script against my Cassandra cluster (5 
>>> nodes, 0.7.3). I've gotten it all set up and working, but the script is 
>>> giving me some strange results. Here is my script:
>>> 
>>> rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage();
>>> rowct = FOREACH rows GENERATE $0, COUNT($1);
>>> dump rowct;
>>> 
>>> If I understand Pig correctly, this should output (row name, column count) 
>>> tuples, but I'm always seeing 1024 for the column count even though the 
>>> rows have highly variable number of columns. Am I missing something? Thanks.
>>> 
>>> -Jeffrey
>>> 
>> 
>

Re: pig counting question

Reply via email to