Re: prep for cassandra storage from pig

Jeremy Hanna Wed, 15 Jun 2011 12:26:42 -0700

Yeah - for completely dynamic column names, then yeah - From/To Cassandra Bag 
doesn't handle that.  It does handle prefixed names though - like link* will 
get a bag of all the columns that start with link.  But sounds like you are 
doing what I would have to do if I got into a nested data conundrum.  Like I 
said, others may have better advice for getting the data the way you want it.


On Jun 15, 2011, at 2:08 PM, William Oberman wrote:

> My problem is the column names are dynamic (a date), and pygmalion seems to
> want the column names to be fixed at "compile time" (the script).
> 
> On Wed, Jun 15, 2011 at 3:04 PM, Jeremy Hanna 
> <jeremy.hanna1...@gmail.com>wrote:
> 
>> Hi Will,
>> 
>> That's partly why I like to use FromCassandraBag and ToCassandraBag from
>> pygmalion - it does the work for you to get it back into a form that
>> cassandra understands.
>> 
>> Others may know better how to massage the data into that form using just
>> pig, but if all else fails, you could write a udf to do that.
>> 
>> Jeremy
>> 
>> On Jun 15, 2011, at 1:17 PM, William Oberman wrote:
>> 
>>> I think I'm stuck on typing issues trying to store data in cassandra.  To
>> verify, cassandra wants (key, {tuples})
>>> 
>>> My pig script is fairly brief:
>>> raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS
>> (key:chararray, columns:bag {column:tuple (name, value)});
>>> --colums == timeUUID -> JSON
>>> rows = FOREACH raw GENERATE key, FLATTEN(columns);
>>> alias_target_day = FOREACH rows {
>>>    --I wrote a specialized parser that does exactly what I need
>>>    observation_map = com.civicscience.pig.ParseObservation($2);
>>>    GENERATE $0 as alias, observation_map#'_fqt' as target,
>> observation_map#'_day' as day;
>>> };
>>> grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
>>> X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1,
>> COUNT($1)) as day_count;
>>> 
>>> This gets me:
>>> (targetA, (day1, count))
>>> (targetA, (day2, count))
>>> (targetB, (day1, count))
>>> ....
>>> 
>>> But, cassandra wants the 2nd item to be a bag.  So, I tried:
>>> X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1,
>> COUNT($1))) as day_count;
>>> 
>>> But this results in:
>>> (targetA, {((day1, count))})
>>> (targetA, {((day2, count))})
>>> (targetB, {((day1, count))})
>>> It's hard to see, but the 2nd item now has a nested tuple as the first
>> value, which is still bad.
>>> 
>>> How to I get (key, {tuple})???  I wasn't sure where to post this (pig or
>> cassandra), so I'm posting to the pig list too.
>>> 
>>> will
>> 
>> 
> 
> 
> -- 
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E) ober...@civicscience.com

Re: prep for cassandra storage from pig

Reply via email to