I've dug deeper into this, since this got my script running but still left me at sea when dealing with the actual data. It's looking like there may be a mismatch between the schema that's being reported by CassandraStorage.java, and the data that's actually returned. Here's an example:
rows = LOAD 'cassandra://Frap/PhotoVotes' USING CassandraStorage(); DESCRIBE rows; rows: {key: chararray,columns: {(name: chararray,value: bytearray,photo_owner: chararray,value_photo_owner: bytearray,pid: chararray,value_pid: bytearray,matched_string: chararray,value_matched_string: bytearray,src_big: chararray,value_src_big: bytearray,time: chararray,value_time: bytearray,vote_type: chararray,value_vote_type: bytearray,voter: chararray,value_voter: bytearray)}} DUMP rows; (691831038_1317937188.48955,{(photo_owner,1596090180),(pid,6855155124568798560),(matched_string,),(src_big,),(time,Thu Oct 06 14:39:48 -0700 2011),(vote_type,album_dislike),(voter,691831038)}) getSchema() is reporting the columns as an inner bag of tuples, each of which contains 16 values. In fact, getNext() seems to return an inner bag containing 7 tuples, each of which contains two values. I'll file a JIRA and do my best to create a patch to do the right thing, but I wanted to sanity check what I'm seeing here since I'm a Pig newbie. Am I missing something? It appears that things got out of sync with this change: http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.8/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java?r1=1177083&r2=1177082&pathrev=1177083 While I'm in there, is there a reason for using an inner bag to hold the columns? Do we ever have more than one set of the same columns for a given key in Cassandra? I'm thinking of tweaking things to look like this for my example, since it would make my processing code easier: rows: {cassandra_key: chararray, photo_owner: chararray, pid: chararray, matched_string: chararray, src_big: chararray, time: chararray,vote_type: chararray,voter: chararray} The main downside I can see is the possible clash between the cassandra_key value and a column with the same name. cheers, Pete On Tue, Oct 11, 2011 at 11:59 PM, Pete Warden <p...@jetpac.com> wrote: > For posterity, I ended up hacking around this by renaming the repeated > 'value' alias in CassandraStorage and rebuilding it. Here's the patch: > > --- src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java.original > 2011-10-11 > 23:42:19.000000000 -0700 > +++ src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java 2011-10-11 > 23:44:26.000000000 -0700 > @@ -357,7 +357,7 @@ > validator = validators.get(cdef.getName()); > if (validator == null) > validator = marshallers.get(1); > - valSchema.setName("value"); > + valSchema.setName("value_"+new String(cdef.getName())); > valSchema.setType(getPigType(validator)); > tupleFields.add(valSchema); > } > > I'm not suggesting this is a correct fix, but it does allow me to move > forward. Another suggestion was to try Pig 0.8.1 instead, but I ran into > https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AWhatshallIdoifIsaw%22FailedtocreateDataStorage%22%3F > > On Tue, Oct 11, 2011 at 10:34 PM, Pete Warden <p...@jetpac.com> wrote: > >> Thanks for all your help Brandon and Jeremy, that got me to the point >> where I could load data. >> >> I'm now hitting a new issue that seems like it could possibly be related. >> When I try to access the data like this: >> >> grunt> rows = LOAD 'cassandra://Frap/FriendsAlreadyRanked' USING >> CassandraStorage(); >> grunt> parts = FOREACH rows GENERATE key, >> FromCassandraBag('time_last_ranked', columns); >> >> I see the following error: >> >> 2011-10-11 22:23:43,877 [main] ERROR org.apache.pig.tools.grunt.Grunt - >> ERROR 1108: >> <line 4, column 71> Duplicate schema alias: value in "columns" >> >> At first I thought it might be related to the Pygmalion helper functions, >> so I tried to strip it back to basics using this second line instead: >> >> parts = FOREACH rows GENERATE key,$1; >> >> and I still get an identical error. >> >> Any further thoughts on how I can dig into this? >> >> Thanks again, >> Pete >> >> On Tue, Oct 11, 2011 at 3:37 PM, Brandon Williams <dri...@gmail.com>wrote: >> >>> On Tue, Oct 11, 2011 at 4:24 PM, Pete Warden <p...@petewarden.com> >>> wrote: >>> > I'm trying to run the most basic example for pig_cassandra, counting >>> the >>> > number of rows in a column family, and I'm hitting the following error: >>> > 2011-10-11 14:13:32,321 [main] ERROR org.apache.pig.tools.grunt.Grunt - >>> > ERROR 1031: Incompatable field schema: left is >>> > "columns:bag{:tuple(name:bytearray,value:bytearray)}", right is >>> > >>> "columns:bag{:tuple(name:chararray,value:bytearray,time_last_ranked:chararray,value:bytearray)}" >>> >>> After https://issues.apache.org/jira/browse/CASSANDRA-2777 you need to >>> remove the 'AS' and everything after it; your schema definition >>> conflicts with what was inferred. >>> >>> -Brandon >>> >> >> >