Yeah - for completely dynamic column names, then yeah - From/To Cassandra Bag doesn't handle that. It does handle prefixed names though - like link* will get a bag of all the columns that start with link. But sounds like you are doing what I would have to do if I got into a nested data conundrum. Like I said, others may have better advice for getting the data the way you want it.
On Jun 15, 2011, at 2:08 PM, William Oberman wrote: > My problem is the column names are dynamic (a date), and pygmalion seems to > want the column names to be fixed at "compile time" (the script). > > On Wed, Jun 15, 2011 at 3:04 PM, Jeremy Hanna > <jeremy.hanna1...@gmail.com>wrote: > >> Hi Will, >> >> That's partly why I like to use FromCassandraBag and ToCassandraBag from >> pygmalion - it does the work for you to get it back into a form that >> cassandra understands. >> >> Others may know better how to massage the data into that form using just >> pig, but if all else fails, you could write a udf to do that. >> >> Jeremy >> >> On Jun 15, 2011, at 1:17 PM, William Oberman wrote: >> >>> I think I'm stuck on typing issues trying to store data in cassandra. To >> verify, cassandra wants (key, {tuples}) >>> >>> My pig script is fairly brief: >>> raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS >> (key:chararray, columns:bag {column:tuple (name, value)}); >>> --colums == timeUUID -> JSON >>> rows = FOREACH raw GENERATE key, FLATTEN(columns); >>> alias_target_day = FOREACH rows { >>> --I wrote a specialized parser that does exactly what I need >>> observation_map = com.civicscience.pig.ParseObservation($2); >>> GENERATE $0 as alias, observation_map#'_fqt' as target, >> observation_map#'_day' as day; >>> }; >>> grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day); >>> X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1, >> COUNT($1)) as day_count; >>> >>> This gets me: >>> (targetA, (day1, count)) >>> (targetA, (day2, count)) >>> (targetB, (day1, count)) >>> .... >>> >>> But, cassandra wants the 2nd item to be a bag. So, I tried: >>> X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1, >> COUNT($1))) as day_count; >>> >>> But this results in: >>> (targetA, {((day1, count))}) >>> (targetA, {((day2, count))}) >>> (targetB, {((day1, count))}) >>> It's hard to see, but the 2nd item now has a nested tuple as the first >> value, which is still bad. >>> >>> How to I get (key, {tuple})??? I wasn't sure where to post this (pig or >> cassandra), so I'm posting to the pig list too. >>> >>> will >> >> > > > -- > Will Oberman > Civic Science, Inc. > 3030 Penn Avenue., First Floor > Pittsburgh, PA 15201 > (M) 412-480-7835 > (E) ober...@civicscience.com