I've dug deeper into this, since this got my script running but still left
me at sea when dealing with the actual data. It's looking like there may be
a mismatch between the schema that's being reported by
CassandraStorage.java, and the data that's actually returned. Here's an
example:

rows = LOAD 'cassandra://Frap/PhotoVotes' USING CassandraStorage();
DESCRIBE rows;
rows: {key: chararray,columns: {(name: chararray,value:
bytearray,photo_owner: chararray,value_photo_owner: bytearray,pid:
chararray,value_pid: bytearray,matched_string:
chararray,value_matched_string: bytearray,src_big: chararray,value_src_big:
bytearray,time: chararray,value_time: bytearray,vote_type:
chararray,value_vote_type: bytearray,voter: chararray,value_voter:
bytearray)}}
DUMP rows;
(691831038_1317937188.48955,{(photo_owner,1596090180),(pid,6855155124568798560),(matched_string,),(src_big,),(time,Thu
Oct 06 14:39:48 -0700 2011),(vote_type,album_dislike),(voter,691831038)})

getSchema() is reporting the columns as an inner bag of tuples, each of
which contains 16 values. In fact, getNext() seems to return an inner bag
containing 7 tuples, each of which contains two values.

I'll file a JIRA and do my best to create a patch to do the right thing, but
I wanted to sanity check what I'm seeing here since I'm a Pig newbie. Am I
missing something? It appears that things got out of sync with this change:
http://svn.apache.org/viewvc/cassandra/branches/cassandra-0.8/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java?r1=1177083&r2=1177082&pathrev=1177083

While I'm in there, is there a reason for using an inner bag to hold the
columns? Do we ever have more than one set of the same columns for a given
key in Cassandra? I'm thinking of tweaking things to look like this for my
example, since it would make my processing code easier:
rows: {cassandra_key: chararray, photo_owner: chararray, pid: chararray,
matched_string: chararray, src_big: chararray, time: chararray,vote_type:
chararray,voter: chararray}
The main downside I can see is the possible clash between the cassandra_key
value and a column with the same name.

cheers,
           Pete

On Tue, Oct 11, 2011 at 11:59 PM, Pete Warden <p...@jetpac.com> wrote:

> For posterity, I ended up hacking around this by renaming the repeated
> 'value' alias in CassandraStorage and rebuilding it. Here's the patch:
>
> --- src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java.original 
> 2011-10-11
> 23:42:19.000000000 -0700
> +++ src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java 2011-10-11
> 23:44:26.000000000 -0700
> @@ -357,7 +357,7 @@
>              validator = validators.get(cdef.getName());
>              if (validator == null)
>                  validator = marshallers.get(1);
> -            valSchema.setName("value");
> +            valSchema.setName("value_"+new String(cdef.getName()));
>              valSchema.setType(getPigType(validator));
>              tupleFields.add(valSchema);
>          }
>
> I'm not suggesting this is a correct fix, but it does allow me to move
> forward. Another suggestion was to try Pig 0.8.1 instead, but I ran into
> https://cwiki.apache.org/confluence/display/PIG/FAQ#FAQ-Q%3AWhatshallIdoifIsaw%22FailedtocreateDataStorage%22%3F
>
> On Tue, Oct 11, 2011 at 10:34 PM, Pete Warden <p...@jetpac.com> wrote:
>
>> Thanks for all your help Brandon and Jeremy, that got me to the point
>> where I could load data.
>>
>> I'm now hitting a new issue that seems like it could possibly be related.
>> When I try to access the data like this:
>>
>> grunt> rows = LOAD 'cassandra://Frap/FriendsAlreadyRanked' USING
>> CassandraStorage();
>> grunt> parts = FOREACH rows GENERATE key,
>> FromCassandraBag('time_last_ranked', columns);
>>
>> I see the following error:
>>
>> 2011-10-11 22:23:43,877 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>> ERROR 1108:
>> <line 4, column 71> Duplicate schema alias: value in "columns"
>>
>> At first I thought it might be related to the Pygmalion helper functions,
>> so I tried to strip it back to basics using this second line instead:
>>
>> parts = FOREACH rows GENERATE key,$1;
>>
>> and I still get an identical error.
>>
>> Any further thoughts on how I can dig into this?
>>
>> Thanks again,
>>                     Pete
>>
>> On Tue, Oct 11, 2011 at 3:37 PM, Brandon Williams <dri...@gmail.com>wrote:
>>
>>> On Tue, Oct 11, 2011 at 4:24 PM, Pete Warden <p...@petewarden.com>
>>> wrote:
>>> > I'm trying to run the most basic example for pig_cassandra, counting
>>> the
>>> > number of rows in a column family, and I'm hitting the following error:
>>> > 2011-10-11 14:13:32,321 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>>> > ERROR 1031: Incompatable field schema: left is
>>> > "columns:bag{:tuple(name:bytearray,value:bytearray)}", right is
>>> >
>>> "columns:bag{:tuple(name:chararray,value:bytearray,time_last_ranked:chararray,value:bytearray)}"
>>>
>>> After https://issues.apache.org/jira/browse/CASSANDRA-2777 you need to
>>> remove the 'AS' and everything after it; your schema definition
>>> conflicts with what was inferred.
>>>
>>> -Brandon
>>>
>>
>>
>

Reply via email to