Re: CqlStorage creates wrong schema for Pig

Miguel Angel Martin junquera Mon, 02 Sep 2013 07:45:58 -0700

Hi


1.-

May be?

-- Register the UDF
REGISTER /path/to/cqlstorageudf-1.0-SNAPSHOT

-- FromCqlColumn will convert chararray, int, long, float, double
DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();

-- Load data as normal
data_raw = LOAD 'cql://bookcrossing/books' USING CqlStorage();

-- Use the UDF
data = FOREACH data_raw GENERATE
    *FromCqlColumn*(isbn) AS ISBN,
    *FromCqlColumn*(bookauthor) AS BookAuthor,
    *FromCqlColumn*(booktitle) AS BookTitle,
    *FromCqlColumn*(publisher) AS Publisher,
    *FromCqlColumn*(yearofpublication) AS YearOfPublication;





and  2.:

with  the data in cql cassandra 1.2.8, pig 0.11.11 and cql3:

*CREATE KEYSPACE keyspace1*

*  WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' :
1 }*

*  AND durable_writes = true;*

*
*

*use keyspace2;*

*
*

*  CREATE TABLE test (*

*    id text PRIMARY KEY,*

*    title text,*

*    age int*

*  )  WITH COMPACT STORAGE;*

*
*

*
*

*  insert into test (id, title, age) values('1', 'child', 21);*

*  insert into test (id, title, age) values('2', 'support', 21);*

*  insert into test (id, title, age) values('3', 'manager', 31);*

*  insert into test (id, title, age) values('4', 'QA', 41);*

*  insert into test (id, title, age) values('5', 'QA', 30);*

*  insert into test (id, title, age) values('6', 'QA', 30);*





and script:

*
*
*register './libs/cqlstorageudf-1.0-SNAPSHOT.jar';*
*DEFINE FromCqlColumn com.megatome.pig.piggybank.tuple.FromCqlColumn();*
*rows = LOAD
'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
CqlStorage();*
*dump rows;*
*ILLUSTRATE rows;*
*describe rows;*
*A = FOREACH rows GENERATE FLATTEN(title);*
*dump A;*
*values3 = FOREACH A GENERATE FromCqlColumn(title) AS title;*
*dump values3;*
*describe values3;*


--



I have this error:




....

-------------------------------------------------------------
| rows     | id:chararray   | age:int   | title:chararray   |
-------------------------------------------------------------
|          | (id, 5)        | (age, 30) | (title, QA)       |
-------------------------------------------------------------

rows: {id: chararray,age: int,title: chararray}


...

(title,QA)
(title,QA)
..
2013-09-02 16:40:52,454 [Thread-11] WARN
 org.apache.hadoop.mapred.LocalJobRunner - job_local_0003
*java.lang.ClassCastException: java.lang.String cannot be cast to
org.apache.pig.data.Tuple*
at com.megatome.pig.piggybank.tuple.ColumnBase.exec(ColumnBase.java:32)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:337)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:434)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:340)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
2013-09-02 16:40:52,832 [main] INFO
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
- HadoopJobId: job_local_0003



8-|

Regards

...


Miguel Angel Martín Junquera
Analyst Engineer.
miguelangel.mar...@brainsins.com



2013/9/2 Miguel Angel Martin junquera <mianmarjun.mailingl...@gmail.com>

> hi all:
>
> More info :
>
> https://issues.apache.org/jira/browse/CASSANDRA-5941
>
>
>
> I tried this (and gen. cassandra 1.2.9)  but do not work for me,
>
> git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
> cd cassandra
> git checkout cassandra-1.2
> patch -p1 < 5867-bug-fix-filter-push-down-1.2-branch.txt
> ant
>
>
>
> Miguel Angel Martín Junquera
> Analyst Engineer.
> miguelangel.mar...@brainsins.com
>
>
>
> 2013/9/2 Miguel Angel Martin junquera <mianmarjun.mailingl...@gmail.com>
>
>> *good/nice job !!!*
>> *
>> *
>> *
>> *
>> *I'd testing with an udf only with  string schema type  this is better
>> and elaborate work..*
>> *
>> *
>> *Regads*
>>
>>
>> Miguel Angel Martín Junquera
>> Analyst Engineer.
>> miguelangel.mar...@brainsins.com
>>
>>
>>
>> 2013/8/31 Chad Johnston <cjohns...@megatome.com>
>>
>>> I threw together a quick UDF to work around this issue. It just extracts
>>> the value portion of the tuple while taking advantage of the CqlStorage
>>> generated schema to keep the type correct.
>>>
>>> You can get it here: https://github.com/iamthechad/cqlstorage-udf
>>>
>>> I'll see if I can find more useful information and open a defect, since
>>> that's what this seems to be.
>>>
>>> Chad
>>>
>>>
>>> On Fri, Aug 30, 2013 at 2:02 AM, Miguel Angel Martin junquera <
>>> mianmarjun.mailingl...@gmail.com> wrote:
>>>
>>>> I try this:
>>>>
>>>> *rows = LOAD
>>>> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
>>>> CqlStorage();*
>>>>
>>>> *dump rows;*
>>>>
>>>> *ILLUSTRATE rows;*
>>>>
>>>> *describe rows;*
>>>>
>>>> *
>>>> *
>>>>
>>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) as
>>>> (mycolumn:tuple(name,value));*
>>>>
>>>> *dump values2;*
>>>>
>>>> *describe values2;*
>>>> *
>>>> *
>>>>
>>>> But I get this results:
>>>>
>>>>
>>>>
>>>> -------------------------------------------------------------
>>>> | rows     | id:chararray   | age:int   | title:chararray   |
>>>> -------------------------------------------------------------
>>>> |          | (id, 6)        | (age, 30) | (title, QA)       |
>>>> -------------------------------------------------------------
>>>>
>>>> rows: {id: chararray,age: int,title: chararray}
>>>> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt -
>>>> ERROR 1031: Incompatable field schema: left is
>>>> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is
>>>> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)"
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> or
>>>>
>>>>
>>>>
>>>> ....
>>>>
>>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
>>>> *dump values2;*
>>>> *describe values2;*
>>>>
>>>>
>>>>
>>>>
>>>> and  the results are:
>>>>
>>>>
>>>> ...
>>>> (((id,6)))
>>>> (((id,5)))
>>>> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}
>>>>
>>>>
>>>>
>>>> Aggg!!!!!
>>>>
>>>>
>>>> *
>>>> *
>>>>
>>>>
>>>>
>>>> Miguel Angel Martín Junquera
>>>> Analyst Engineer.
>>>> miguelangel.mar...@brainsins.com
>>>>
>>>>
>>>>
>>>> 2013/8/26 Miguel Angel Martin junquera <
>>>> mianmarjun.mailingl...@gmail.com>
>>>>
>>>>> hi Chad .
>>>>>
>>>>> I have this issue
>>>>>
>>>>> I send a mail to user-pig-list and  I still i can resolve this, and I
>>>>> can not  access to column values.
>>>>> In this mail  I write some things that I try without results... and
>>>>> information about this issue.
>>>>>
>>>>>
>>>>>
>>>>> http://mail-archives.apache.org/mod_mbox/pig-user/201308.mbox/%3ccajeg_hq9s2po3_xytzx5xki4j1mao8q26jydg2wndy_kyiv...@mail.gmail.com%3E
>>>>>
>>>>>
>>>>>
>>>>> I hope  someOne reply  one comment, idea or  solution about  this
>>>>> issue or bug.
>>>>>
>>>>>
>>>>> I have reviewed the CqlStorage class in code cassandra 1.2.8  but i do
>>>>> not have configure the environmetn to debug  and trace this issue.
>>>>>
>>>>> Only  I find some comments like, but I do not understand at all.
>>>>>
>>>>>
>>>>> /**
>>>>>
>>>>>  * A LoadStoreFunc for retrieving data from and storing data to
>>>>> Cassandra
>>>>>
>>>>>  *
>>>>>
>>>>>  * A row from a standard CF will be returned as nested tuples:
>>>>>
>>>>>  * (((key1, value1), (key2, value2)), ((name1, val1), (name2, val2))).
>>>>>  */
>>>>>
>>>>>
>>>>> I you found some idea or solution, please post it
>>>>>
>>>>> thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2013/8/23 Chad Johnston <cjohns...@megatome.com>
>>>>>
>>>>>> (I'm using Cassandra 1.2.8 and Pig 0.11.1)
>>>>>>
>>>>>> I'm loading some simple data from Cassandra into Pig using
>>>>>> CqlStorage. The CqlStorage loader defines a Pig schema based on the
>>>>>> Cassandra schema, but it seems to be wrong.
>>>>>>
>>>>>> If I do:
>>>>>>
>>>>>> data = LOAD 'cql://bookdata/books' USING CqlStorage();
>>>>>> DESCRIBE data;
>>>>>>
>>>>>> I get this:
>>>>>>
>>>>>> data: {isbn: chararray,bookauthor: chararray,booktitle:
>>>>>> chararray,publisher: chararray,yearofpublication: int}
>>>>>>
>>>>>> However, if I DUMP data, I get results like these:
>>>>>>
>>>>>> ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in
>>>>>> the Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))
>>>>>>
>>>>>> Clearly the results from Cassandra are key/value pairs, as would be
>>>>>> expected. I don't know why the schema generated by CqlStorage() would be 
>>>>>> so
>>>>>> different.
>>>>>>
>>>>>> This is really causing me problems trying to access the column
>>>>>> values. I tried a naive approach of FLATTENing each tuple, then trying to
>>>>>> access the values that way:
>>>>>>
>>>>>> flattened = FOREACH data GENERATE
>>>>>>   FLATTEN(isbn),
>>>>>>   FLATTEN(booktitle),
>>>>>>   ...
>>>>>> values = FOREACH flattened GENERATE
>>>>>>   $1 AS ISBN,
>>>>>>   $3 AS BookTitle,
>>>>>>   ...
>>>>>>
>>>>>> As soon as I try to access field $5, Pig complains about the index
>>>>>> being out of bounds.
>>>>>>
>>>>>> Is there a way to solve the schema/reality mismatch? Am I doing
>>>>>> something wrong, or have I stumbled across a defect?
>>>>>>
>>>>>> Thanks,
>>>>>> Chad
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: CqlStorage creates wrong schema for Pig

Reply via email to