I'll do a reply all, to keep this more consistent (sorry!). Rather than staying stuck, I wrote a custom function: TupleToBagOfTuple. I'm curious if I could have avoided it with proper pig scripting though.
On Wed, Jun 15, 2011 at 3:08 PM, William Oberman <ober...@civicscience.com>wrote: > My problem is the column names are dynamic (a date), and pygmalion seems to > want the column names to be fixed at "compile time" (the script). > > > On Wed, Jun 15, 2011 at 3:04 PM, Jeremy Hanna > <jeremy.hanna1...@gmail.com>wrote: > >> Hi Will, >> >> That's partly why I like to use FromCassandraBag and ToCassandraBag from >> pygmalion - it does the work for you to get it back into a form that >> cassandra understands. >> >> Others may know better how to massage the data into that form using just >> pig, but if all else fails, you could write a udf to do that. >> >> Jeremy >> >> On Jun 15, 2011, at 1:17 PM, William Oberman wrote: >> >> > I think I'm stuck on typing issues trying to store data in cassandra. >> To verify, cassandra wants (key, {tuples}) >> > >> > My pig script is fairly brief: >> > raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS >> (key:chararray, columns:bag {column:tuple (name, value)}); >> > --colums == timeUUID -> JSON >> > rows = FOREACH raw GENERATE key, FLATTEN(columns); >> > alias_target_day = FOREACH rows { >> > --I wrote a specialized parser that does exactly what I need >> > observation_map = com.civicscience.pig.ParseObservation($2); >> > GENERATE $0 as alias, observation_map#'_fqt' as target, >> observation_map#'_day' as day; >> > }; >> > grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day); >> > X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1, >> COUNT($1)) as day_count; >> > >> > This gets me: >> > (targetA, (day1, count)) >> > (targetA, (day2, count)) >> > (targetB, (day1, count)) >> > .... >> > >> > But, cassandra wants the 2nd item to be a bag. So, I tried: >> > X = FOREACH grouping GENERATE group.$0 as target, >> TOBAG(TOTUPLE(group.$1, COUNT($1))) as day_count; >> > >> > But this results in: >> > (targetA, {((day1, count))}) >> > (targetA, {((day2, count))}) >> > (targetB, {((day1, count))}) >> > It's hard to see, but the 2nd item now has a nested tuple as the first >> value, which is still bad. >> > >> > How to I get (key, {tuple})??? I wasn't sure where to post this (pig or >> cassandra), so I'm posting to the pig list too. >> > >> > will >> >> > > > -- > Will Oberman > Civic Science, Inc. > 3030 Penn Avenue., First Floor > Pittsburgh, PA 15201 > (M) 412-480-7835 > (E) ober...@civicscience.com > -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com