RE: pig counting question

Jeffrey Wang Fri, 25 Mar 2011 16:22:29 -0700

Just to be clear, it's also the case that if I have a Hadoop TaskTracker 
running on each node that Cassandra is running on, a map/reduce job will 
automatically handle data locality, right? I.e. each mapper will only read 
splits which live on the same box.


-Jeffrey

-----Original Message-----
From: Jeffrey Wang [mailto:jw...@palantir.com] 
Sent: Friday, March 25, 2011 11:42 AM
To: user@cassandra.apache.org
Subject: RE: pig counting question

I don't think it's Pig running out of memory, but rather Cassandra itself (the 
data doesn't even make it to Pig). get_range_slices() is called with a row 
batch size of 4096, the default, and it's fetching all of the columns in each 
row. If I have 10K columns in each row, that's a huge request, and Cassandra 
runs into memory pressure trying to serve it.

That's my understanding of it; if there's something I'm missing, please let me 
know.

-Jeffrey

-----Original Message-----
From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] 
Sent: Friday, March 25, 2011 11:06 AM
To: user@cassandra.apache.org
Subject: Re: pig counting question

One thing I wonder though - if your columns are the thing that are increasing 
your heap size and eating up a lot of memory, and you're reading the data 
structure out as a bag of columns, why isn't pig spilling to disk instead of 
growing in memory.  The pig model is that you can have huge bags that don't 
kill you on memory but they are just slower because they spill to disk.  What 
is the schema that you impose when you load the data?

On Mar 24, 2011, at 3:57 PM, Jeffrey Wang wrote:

> It looks like this functionality is not in the 0.7.3 version of 
> CassandraStorage. I tried to add the constructor which takes the limit to the 
> class, but I ran into some Pig parsing errors, so I had to make the parameter 
> a string. How did you get around this for the version of CassandraStorage in 
> trunk? I'm running Pig 0.8.0.
> 
> Also, when I bump the limit up very high (e.g. 1M columns), my Cassandra 
> starts eating up huge amounts of memory, maxing out my 16GB heap size. I 
> suspect this is because of the get_range_slices() call from 
> ColumnFamilyRecordReader. Are there plans to make this streaming/paged?
> 
> -Jeffrey
> 
> -----Original Message-----
> From: Jeremy Hanna [mailto:jeremy.hanna1...@gmail.com] 
> Sent: Thursday, March 24, 2011 11:34 AM
> To: user@cassandra.apache.org
> Subject: Re: pig counting question
> 
> The limit defaults to 1024 but you can set it when you use CassandraStorage 
> in pig, like so:
> rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage(4096);
> or whatever value you wish.
> 
> Give that a try and see if it gives you more of what you're looking for.
> 
> On Mar 24, 2011, at 1:16 PM, Jeffrey Wang wrote:
> 
>> Hey all,
>> 
>> I'm trying to run a very simple Pig script against my Cassandra cluster (5 
>> nodes, 0.7.3). I've gotten it all set up and working, but the script is 
>> giving me some strange results. Here is my script:
>> 
>> rows = LOAD 'cassandra://Keyspace/ColumnFamily' USING CassandraStorage();
>> rowct = FOREACH rows GENERATE $0, COUNT($1);
>> dump rowct;
>> 
>> If I understand Pig correctly, this should output (row name, column count) 
>> tuples, but I'm always seeing 1024 for the column count even though the rows 
>> have highly variable number of columns. Am I missing something? Thanks.
>> 
>> -Jeffrey
>> 
>

RE: pig counting question

Reply via email to