Re: CqlStorage creates wrong schema for Pig

Chad Johnston Tue, 03 Sep 2013 09:04:54 -0700

You're trying to use FromCqlColumn on a tuple that has been flattened. The
schema still thinks it's {title: chararray}, but the flattened tuple is now
two values. I don't know how to retrieve the data values in this case.


Your code will work correctly if you do this:
*values3 = FOREACH rows GENERATE FromCqlColumn(title) AS title;*
*dump values3;*
*describe values3;*

(Use FromCqlColumn on the original data, not the flattened data.)

Chad


On Mon, Sep 2, 2013 at 8:45 AM, Miguel Angel Martin junquera <
mianmarjun.mailingl...@gmail.com> wrote:

> Hi
>
>
> 1.-
>
> May be?
>
> -- Register the UDF
> REGISTER /path/to/cqlstorageudf-1.0-SNAPSHOT
>
> -- FromCqlColumn will convert chararray, int, long, float, double
> DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();
>
> -- Load data as normal
> data_raw = LOAD 'cql://bookcrossing/books' USING CqlStorage();
>
> -- Use the UDF
> data = FOREACH data_raw GENERATE
>     *FromCqlColumn*(isbn) AS ISBN,
>     *FromCqlColumn*(bookauthor) AS BookAuthor,
>
>     *FromCqlColumn*(booktitle) AS BookTitle,
>     *FromCqlColumn*(publisher) AS Publisher,
>
>     *FromCqlColumn*(yearofpublication) AS YearOfPublication;
>
>
>
>
>
> and  2.:
>
> with  the data in cql cassandra 1.2.8, pig 0.11.11 and cql3:
>
> *CREATE KEYSPACE keyspace1*
>
> *  WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor'
> : 1 }*
>
> *  AND durable_writes = true;*
>
> *
> *
>
> *use keyspace2;*
>
> *
> *
>
> *  CREATE TABLE test (*
>
> *    id text PRIMARY KEY,*
>
> *    title text,*
>
> *    age int*
>
> *  )  WITH COMPACT STORAGE;*
>
> *
> *
>
> *
> *
>
> *  insert into test (id, title, age) values('1', 'child', 21);*
>
> *  insert into test (id, title, age) values('2', 'support', 21);*
>
> *  insert into test (id, title, age) values('3', 'manager', 31);*
>
> *  insert into test (id, title, age) values('4', 'QA', 41);*
>
> *  insert into test (id, title, age) values('5', 'QA', 30);*
>
> *  insert into test (id, title, age) values('6', 'QA', 30);*
>
>
>
>
>
> and script:
>
> *
> *
> *register './libs/cqlstorageudf-1.0-SNAPSHOT.jar';*
> *DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();*
> *rows = LOAD
> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
> CqlStorage();*
> *dump rows;*
> *ILLUSTRATE rows;*
> *describe rows;*
> *A = FOREACH rows GENERATE FLATTEN(title);*
> *dump A;*
> *values3 = FOREACH A GENERATE FromCqlColumn(title) AS title;*
> *dump values3;*
> *describe values3;*
>
>
> --
>
>
>
> I have this error:
>
>
>
>
> ....
>
> -------------------------------------------------------------
> | rows     | id:chararray   | age:int   | title:chararray   |
> -------------------------------------------------------------
> |          | (id, 5)        | (age, 30) | (title, QA)       |
> -------------------------------------------------------------
>
> rows: {id: chararray,age: int,title: chararray}
>
>
> ...
>
> (title,QA)
> (title,QA)
> ..
> 2013-09-02 16:40:52,454 [Thread-11] WARN
>  org.apache.hadoop.mapred.LocalJobRunner - job_local_0003
> *java.lang.ClassCastException: java.lang.String cannot be cast to
> org.apache.pig.data.Tuple*
> at com.megatome.pig.piggybank.tuple.ColumnBase.exec(ColumnBase.java:32)
>  at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
>  at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
>  at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
>  at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
>  at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
> at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
>  at
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>  at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> 2013-09-02 16:40:52,832 [main] INFO
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - HadoopJobId: job_local_0003
>
>
>
> 8-|
>
> Regards
>
> ...
>
>
> Miguel Angel Martín Junquera
> Analyst Engineer.
> miguelangel.mar...@brainsins.com
>
>
>
> 2013/9/2 Miguel Angel Martin junquera <mianmarjun.mailingl...@gmail.com>
>
>> hi all:
>>
>> More info :
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-5941
>>
>>
>>
>> I tried this (and gen. cassandra 1.2.9)  but do not work for me,
>>
>> git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
>> cd cassandra
>> git checkout cassandra-1.2
>> patch -p1 < 5867-bug-fix-filter-push-down-1.2-branch.txt
>> ant
>>
>>
>>
>> Miguel Angel Martín Junquera
>> Analyst Engineer.
>> miguelangel.mar...@brainsins.com
>>
>>
>>
>> 2013/9/2 Miguel Angel Martin junquera <mianmarjun.mailingl...@gmail.com>
>>
>>> *good/nice job !!!*
>>> *
>>> *
>>> *
>>> *
>>> *I'd testing with an udf only with  string schema type  this is better
>>> and elaborate work..*
>>> *
>>> *
>>> *Regads*
>>>
>>>
>>> Miguel Angel Martín Junquera
>>> Analyst Engineer.
>>> miguelangel.mar...@brainsins.com
>>>
>>>
>>>
>>> 2013/8/31 Chad Johnston <cjohns...@megatome.com>
>>>
>>>> I threw together a quick UDF to work around this issue. It just
>>>> extracts the value portion of the tuple while taking advantage of the
>>>> CqlStorage generated schema to keep the type correct.
>>>>
>>>> You can get it here: https://github.com/iamthechad/cqlstorage-udf
>>>>
>>>> I'll see if I can find more useful information and open a defect, since
>>>> that's what this seems to be.
>>>>
>>>> Chad
>>>>
>>>>
>>>> On Fri, Aug 30, 2013 at 2:02 AM, Miguel Angel Martin junquera <
>>>> mianmarjun.mailingl...@gmail.com> wrote:
>>>>
>>>>> I try this:
>>>>>
>>>>>  *rows = LOAD
>>>>> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' 
>>>>> USING
>>>>> CqlStorage();*
>>>>>
>>>>> *dump rows;*
>>>>>
>>>>> *ILLUSTRATE rows;*
>>>>>
>>>>> *describe rows;*
>>>>>
>>>>> *
>>>>> *
>>>>>
>>>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) as
>>>>> (mycolumn:tuple(name,value));*
>>>>>
>>>>> *dump values2;*
>>>>>
>>>>> *describe values2;*
>>>>> *
>>>>> *
>>>>>
>>>>> But I get this results:
>>>>>
>>>>>
>>>>>
>>>>> -------------------------------------------------------------
>>>>> | rows     | id:chararray   | age:int   | title:chararray   |
>>>>> -------------------------------------------------------------
>>>>> |          | (id, 6)        | (age, 30) | (title, QA)       |
>>>>> -------------------------------------------------------------
>>>>>
>>>>> rows: {id: chararray,age: int,title: chararray}
>>>>> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt
>>>>> - ERROR 1031: Incompatable field schema: left is
>>>>> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is
>>>>> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)"
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> or
>>>>>
>>>>>
>>>>>
>>>>> ....
>>>>>
>>>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
>>>>> *dump values2;*
>>>>> *describe values2;*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> and  the results are:
>>>>>
>>>>>
>>>>> ...
>>>>> (((id,6)))
>>>>> (((id,5)))
>>>>> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}
>>>>>
>>>>>
>>>>>
>>>>> Aggg!!!!!
>>>>>
>>>>>
>>>>> *
>>>>> *
>>>>>
>>>>>
>>>>>
>>>>> Miguel Angel Martín Junquera
>>>>> Analyst Engineer.
>>>>> miguelangel.mar...@brainsins.com
>>>>>
>>>>>
>>>>>
>>>>> 2013/8/26 Miguel Angel Martin junquera <
>>>>> mianmarjun.mailingl...@gmail.com>
>>>>>
>>>>>> hi Chad .
>>>>>>
>>>>>> I have this issue
>>>>>>
>>>>>> I send a mail to user-pig-list and  I still i can resolve this, and I
>>>>>> can not  access to column values.
>>>>>> In this mail  I write some things that I try without results... and
>>>>>> information about this issue.
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://mail-archives.apache.org/mod_mbox/pig-user/201308.mbox/%3ccajeg_hq9s2po3_xytzx5xki4j1mao8q26jydg2wndy_kyiv...@mail.gmail.com%3E
>>>>>>
>>>>>>
>>>>>>
>>>>>> I hope  someOne reply  one comment, idea or  solution about  this
>>>>>> issue or bug.
>>>>>>
>>>>>>
>>>>>> I have reviewed the CqlStorage class in code cassandra 1.2.8  but i
>>>>>> do not have configure the environmetn to debug  and trace this issue.
>>>>>>
>>>>>> Only  I find some comments like, but I do not understand at all.
>>>>>>
>>>>>>
>>>>>> /**
>>>>>>
>>>>>>  * A LoadStoreFunc for retrieving data from and storing data to
>>>>>> Cassandra
>>>>>>
>>>>>>  *
>>>>>>
>>>>>>  * A row from a standard CF will be returned as nested tuples:
>>>>>>
>>>>>>  * (((key1, value1), (key2, value2)), ((name1, val1), (name2, val2))).
>>>>>>  */
>>>>>>
>>>>>>
>>>>>> I you found some idea or solution, please post it
>>>>>>
>>>>>> thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2013/8/23 Chad Johnston <cjohns...@megatome.com>
>>>>>>
>>>>>>> (I'm using Cassandra 1.2.8 and Pig 0.11.1)
>>>>>>>
>>>>>>> I'm loading some simple data from Cassandra into Pig using
>>>>>>> CqlStorage. The CqlStorage loader defines a Pig schema based on the
>>>>>>> Cassandra schema, but it seems to be wrong.
>>>>>>>
>>>>>>> If I do:
>>>>>>>
>>>>>>> data = LOAD 'cql://bookdata/books' USING CqlStorage();
>>>>>>> DESCRIBE data;
>>>>>>>
>>>>>>> I get this:
>>>>>>>
>>>>>>> data: {isbn: chararray,bookauthor: chararray,booktitle:
>>>>>>> chararray,publisher: chararray,yearofpublication: int}
>>>>>>>
>>>>>>> However, if I DUMP data, I get results like these:
>>>>>>>
>>>>>>> ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in
>>>>>>> the Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))
>>>>>>>
>>>>>>> Clearly the results from Cassandra are key/value pairs, as would be
>>>>>>> expected. I don't know why the schema generated by CqlStorage() would 
>>>>>>> be so
>>>>>>> different.
>>>>>>>
>>>>>>> This is really causing me problems trying to access the column
>>>>>>> values. I tried a naive approach of FLATTENing each tuple, then trying 
>>>>>>> to
>>>>>>> access the values that way:
>>>>>>>
>>>>>>> flattened = FOREACH data GENERATE
>>>>>>>   FLATTEN(isbn),
>>>>>>>   FLATTEN(booktitle),
>>>>>>>   ...
>>>>>>> values = FOREACH flattened GENERATE
>>>>>>>   $1 AS ISBN,
>>>>>>>   $3 AS BookTitle,
>>>>>>>   ...
>>>>>>>
>>>>>>> As soon as I try to access field $5, Pig complains about the index
>>>>>>> being out of bounds.
>>>>>>>
>>>>>>> Is there a way to solve the schema/reality mismatch? Am I doing
>>>>>>> something wrong, or have I stumbled across a defect?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Chad
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: CqlStorage creates wrong schema for Pig

Reply via email to