CqlStorage creates wrong schema for Pig

2013-08-23 Thread Chad Johnston
(I'm using Cassandra 1.2.8 and Pig 0.11.1)

I'm loading some simple data from Cassandra into Pig using CqlStorage. The
CqlStorage loader defines a Pig schema based on the Cassandra schema, but
it seems to be wrong.

If I do:

data = LOAD 'cql://bookdata/books' USING CqlStorage();
DESCRIBE data;

I get this:

data: {isbn: chararray,bookauthor: chararray,booktitle:
chararray,publisher: chararray,yearofpublication: int}

However, if I DUMP data, I get results like these:

((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in the
Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))

Clearly the results from Cassandra are key/value pairs, as would be
expected. I don't know why the schema generated by CqlStorage() would be so
different.

This is really causing me problems trying to access the column values. I
tried a naive approach of FLATTENing each tuple, then trying to access the
values that way:

flattened = FOREACH data GENERATE
  FLATTEN(isbn),
  FLATTEN(booktitle),
  ...
values = FOREACH flattened GENERATE
  $1 AS ISBN,
  $3 AS BookTitle,
  ...

As soon as I try to access field $5, Pig complains about the index being
out of bounds.

Is there a way to solve the schema/reality mismatch? Am I doing something
wrong, or have I stumbled across a defect?

Thanks,
Chad


Re: CqlStorage creates wrong schema for Pig

2013-08-30 Thread Chad Johnston
I threw together a quick UDF to work around this issue. It just extracts
the value portion of the tuple while taking advantage of the CqlStorage
generated schema to keep the type correct.

You can get it here: https://github.com/iamthechad/cqlstorage-udf

I'll see if I can find more useful information and open a defect, since
that's what this seems to be.

Chad


On Fri, Aug 30, 2013 at 2:02 AM, Miguel Angel Martin junquera <
mianmarjun.mailingl...@gmail.com> wrote:

> I try this:
>
> *rows = LOAD
> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' USING
> CqlStorage();*
>
> *dump rows;*
>
> *ILLUSTRATE rows;*
>
> *describe rows;*
>
> *
> *
>
> *values2= FOREACH rows GENERATE  TOTUPLE (id) as
> (mycolumn:tuple(name,value));*
>
> *dump values2;*
>
> *describe values2;*
> *
> *
>
> But I get this results:
>
>
>
> -
> | rows | id:chararray   | age:int   | title:chararray   |
> -
> |  | (id, 6)| (age, 30) | (title, QA)   |
> -
>
> rows: {id: chararray,age: int,title: chararray}
> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1031: Incompatable field schema: left is
> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is
> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)"
>
>
>
>
>
> or
>
>
>
> 
>
> *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
> *dump values2;*
> *describe values2;*
>
>
>
>
> and  the results are:
>
>
> ...
> (((id,6)))
> (((id,5)))
> values2: {org.apache.pig.builtin.totuple_id_8: (id: chararray)}
>
>
>
> Aggg!
>
>
> *
> *
>
>
>
> Miguel Angel Martín Junquera
> Analyst Engineer.
> miguelangel.mar...@brainsins.com
>
>
>
> 2013/8/26 Miguel Angel Martin junquera 
>
>> hi Chad .
>>
>> I have this issue
>>
>> I send a mail to user-pig-list and  I still i can resolve this, and I can
>> not  access to column values.
>> In this mail  I write some things that I try without results... and
>> information about this issue.
>>
>>
>>
>> http://mail-archives.apache.org/mod_mbox/pig-user/201308.mbox/%3ccajeg_hq9s2po3_xytzx5xki4j1mao8q26jydg2wndy_kyiv...@mail.gmail.com%3E
>>
>>
>>
>> I hope  someOne reply  one comment, idea or  solution about  this issue
>> or bug.
>>
>>
>> I have reviewed the CqlStorage class in code cassandra 1.2.8  but i do
>> not have configure the environmetn to debug  and trace this issue.
>>
>> Only  I find some comments like, but I do not understand at all.
>>
>>
>> /**
>>
>>  * A LoadStoreFunc for retrieving data from and storing data to Cassandra
>>
>>  *
>>
>>  * A row from a standard CF will be returned as nested tuples:
>>
>>  * (((key1, value1), (key2, value2)), ((name1, val1), (name2, val2))).
>>  */
>>
>>
>> I you found some idea or solution, please post it
>>
>> thanks
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> 2013/8/23 Chad Johnston 
>>
>>> (I'm using Cassandra 1.2.8 and Pig 0.11.1)
>>>
>>> I'm loading some simple data from Cassandra into Pig using CqlStorage.
>>> The CqlStorage loader defines a Pig schema based on the Cassandra schema,
>>> but it seems to be wrong.
>>>
>>> If I do:
>>>
>>> data = LOAD 'cql://bookdata/books' USING CqlStorage();
>>> DESCRIBE data;
>>>
>>> I get this:
>>>
>>> data: {isbn: chararray,bookauthor: chararray,booktitle:
>>> chararray,publisher: chararray,yearofpublication: int}
>>>
>>> However, if I DUMP data, I get results like these:
>>>
>>> ((isbn,0425093387),(bookauthor,Georgette Heyer),(booktitle,Death in the
>>> Stocks),(publisher,Berkley Pub Group),(yearofpublication,1986))
>>>
>>> Clearly the results from Cassandra are key/value pairs, as would be
>>> expected. I don't know why the schema generated by CqlStorage() would be so
>>> different.
>>>
>>> This is really causing me problems trying to access the column values. I
>>> tried a naive approach of FLATTENing each tuple, then trying to access the
>>> values that way:
>>>
>>> flattened = FOREACH data GENERATE
>>>   FLATTEN(isbn),
>>>   FLATTEN(booktitle),
>>>   ...
>>> values = FOREACH flattened GENERATE
>>>   $1 AS ISBN,
>>>   $3 AS BookTitle,
>>>   ...
>>>
>>> As soon as I try to access field $5, Pig complains about the index being
>>> out of bounds.
>>>
>>> Is there a way to solve the schema/reality mismatch? Am I doing
>>> something wrong, or have I stumbled across a defect?
>>>
>>> Thanks,
>>> Chad
>>>
>>
>>
>


Re: CqlStorage creates wrong schema for Pig

2013-09-03 Thread Chad Johnston
ask.runNewMapper(MapTask.java:764)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>  at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> 2013-09-02 16:40:52,832 [main] INFO
>  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
> - HadoopJobId: job_local_0003
>
>
>
> 8-|
>
> Regards
>
> ...
>
>
> Miguel Angel Martín Junquera
> Analyst Engineer.
> miguelangel.mar...@brainsins.com
>
>
>
> 2013/9/2 Miguel Angel Martin junquera 
>
>> hi all:
>>
>> More info :
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-5941
>>
>>
>>
>> I tried this (and gen. cassandra 1.2.9)  but do not work for me,
>>
>> git clone http://git-wip-us.apache.org/repos/asf/cassandra.git
>> cd cassandra
>> git checkout cassandra-1.2
>> patch -p1 < 5867-bug-fix-filter-push-down-1.2-branch.txt
>> ant
>>
>>
>>
>> Miguel Angel Martín Junquera
>> Analyst Engineer.
>> miguelangel.mar...@brainsins.com
>>
>>
>>
>> 2013/9/2 Miguel Angel Martin junquera 
>>
>>> *good/nice job !!!*
>>> *
>>> *
>>> *
>>> *
>>> *I'd testing with an udf only with  string schema type  this is better
>>> and elaborate work..*
>>> *
>>> *
>>> *Regads*
>>>
>>>
>>> Miguel Angel Martín Junquera
>>> Analyst Engineer.
>>> miguelangel.mar...@brainsins.com
>>>
>>>
>>>
>>> 2013/8/31 Chad Johnston 
>>>
>>>> I threw together a quick UDF to work around this issue. It just
>>>> extracts the value portion of the tuple while taking advantage of the
>>>> CqlStorage generated schema to keep the type correct.
>>>>
>>>> You can get it here: https://github.com/iamthechad/cqlstorage-udf
>>>>
>>>> I'll see if I can find more useful information and open a defect, since
>>>> that's what this seems to be.
>>>>
>>>> Chad
>>>>
>>>>
>>>> On Fri, Aug 30, 2013 at 2:02 AM, Miguel Angel Martin junquera <
>>>> mianmarjun.mailingl...@gmail.com> wrote:
>>>>
>>>>> I try this:
>>>>>
>>>>>  *rows = LOAD
>>>>> 'cql://keyspace1/test?page_size=1&split_size=4&where_clause=age%3D30' 
>>>>> USING
>>>>> CqlStorage();*
>>>>>
>>>>> *dump rows;*
>>>>>
>>>>> *ILLUSTRATE rows;*
>>>>>
>>>>> *describe rows;*
>>>>>
>>>>> *
>>>>> *
>>>>>
>>>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) as
>>>>> (mycolumn:tuple(name,value));*
>>>>>
>>>>> *dump values2;*
>>>>>
>>>>> *describe values2;*
>>>>> *
>>>>> *
>>>>>
>>>>> But I get this results:
>>>>>
>>>>>
>>>>>
>>>>> -
>>>>> | rows | id:chararray   | age:int   | title:chararray   |
>>>>> -
>>>>> |  | (id, 6)| (age, 30) | (title, QA)   |
>>>>> -
>>>>>
>>>>> rows: {id: chararray,age: int,title: chararray}
>>>>> 2013-08-30 09:54:37,831 [main] ERROR org.apache.pig.tools.grunt.Grunt
>>>>> - ERROR 1031: Incompatable field schema: left is
>>>>> "tuple_0:tuple(mycolumn:tuple(name:bytearray,value:bytearray))", right is
>>>>> "org.apache.pig.builtin.totuple_id_1:tuple(id:chararray)"
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> or
>>>>>
>>>>>
>>>>>
>>>>> 
>>>>>
>>>>> *values2= FOREACH rows GENERATE  TOTUPLE (id) ;*
>>>>> *dump values2;*
>>>>> *describe values2;*
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> and  the results are:
>>>>>
>>>>>
>>>>> ...
>>>>> (((id,6)))
>>>>> (((id,5)))
>>>>> values2: {org.apache.pig.builtin.totuple_id_8: (id

JIRA 5867 Fix causes Pig troubles

2013-09-20 Thread Chad Johnston
I've checked out and built the 1.2.10-tentative branch, and I've noticed
that all of my CQL prepared statements are now broken.

Looking into the code, it looks like the "#" -> "=" and "@" -> "?"
translations were removed. I tried to replace these in one of my scripts
with "=" and "?", but there's other code that splits the query string on
"=", causing the prepared statement to be malformed.

If I look at the comments on
https://issues.apache.org/jira/browse/CASSANDRA-5867, where this change was
made, I see a single mention of URL encoding the CQL query. Is this the
expectation going forward? Was there a reason that the "#" and "@" mappings
were removed?

Thanks,
Chad


FYI - Pig CQL queries in Cassandra 1.2.10

2013-09-23 Thread Chad Johnston
I don't see this formally documented anywhere, so I thought I'd give a
heads-up to folks using Pig with Cassandra.

In pre-1.2.10 versions, storing data into Cassandra required a query like
this:
STORE data INTO 'cql://keyspace/table?output_query=update table set
some_value @ #' USING CqlStorage();

In 1.2.10, the expansion of the "@" and "#" characters has been removed,
and the query is now expected to be URL encoded. The above query would
become:
STORE data INTO
'cql://keyspace/table?output_query=update+table+set+some_value+%3D+%3F'
USING CqlStorage();

I've also found that you can't store Pig ints into Cassandra bigints any
more - you have to cast it to a long in your Pig script.

I've opened JIRAs for these issues:
https://issues.apache.org/jira/browse/CASSANDRA-6073
https://issues.apache.org/jira/browse/CASSANDRA-6083

Thanks,
Chad


Re: Mystery PIG issue with 1.2.10

2013-09-25 Thread Chad Johnston
As an FYI, creating the table without the "WITH COMPACT STORAGE" and using
CqlStorage works just fine in 1.2.10.

I know that CqlStorage and AbstractCassandraStorage got changed for 1.2.10
- maybe there's a regression with the existing CassandraStorage?

Chad


On Wed, Sep 25, 2013 at 1:51 AM, Janne Jalkanen wrote:

> Heya!
>
> I am seeing something rather strange in the way Cass 1.2 + Pig seem to
> handle integer values.
>
> Setup: Cassandra 1.2.10, OSX 10.8, JDK 1.7u40, Pig 0.11.1.  Single node
> for testing this.
>
> First a table:
>
> > CREATE TABLE testc (
>   key text PRIMARY KEY,
>   ivalue int,
>   svalue text,
>   value bigint
> ) WITH COMPACT STORAGE;
>
> > insert into testc (key,ivalue,svalue,value) values ('foo',10,'bar',65);
> > select * from testc;
>
>  key | ivalue | svalue | value
> -+++---
>  foo | 10 |bar | 65
>
> For my Pig setup, I then use libraries from different C* versions to
> actually talk to my database (which stays on 1.2.10 all the time).
>
> Cassandra 1.0.12 (using cassandra_storage.jar):
>
> > testc = LOAD 'cassandra://keyspace/testc' USING CassandraStorage();
> > dump testc
> (foo,(svalue,bar),(ivalue,10),(value,65),{})
>
> Cassandra 1.1.10:
>
> > testc = LOAD 'cassandra://keyspace/testc' USING CassandraStorage();
> > dump testc
> (foo,(svalue,bar),(ivalue,10),(value,65),{})
>
> Cassandra 1.2.10:
>
> > (testc = LOAD 'cassandra://keyspace/testc' USING CassandraStorage();
> > dump testc
> foo,{(ivalue,
> ),(svalue,bar),(value,A)})
>
>
> To me it appears that ints and bigints are interpreted as ascii values in
> cass 1.2.10.  Did something change for CassandraStorage, is there a
> regression, or am I doing something wrong?  Quick perusal of the JIRA
> didn't reveal anything that I could directly pin on this.
>
> Note that using compact storage does not seem to affect the issue, though
> it obviously changes the resulting pig format.
>
> In addition, trying to use Pygmalion
>
> > tf = foreach testc generate key,
> flatten(FromCassandraBag('ivalue,svalue,value',columns)) as
> (ivalue:int,svalue:chararray,lvalue:long);
> > dump tf
>
> (foo,
> ,bar,A)
>
> So no help there. Explicitly casting the values to (long) or (int) just
> results in a ClassCastException.
>
> /Janne


Re: Mystery PIG issue with 1.2.10

2013-09-26 Thread Chad Johnston
The OP was using a Thrift table and CassandraStorage. I verified that the
problem does not exist with a CQL3 table and CqlStorage.

Chad


On Thu, Sep 26, 2013 at 7:05 PM, Aaron Morton wrote:

>
> Unfortunately no, as I have a dozen legacy columnfamilies… Since no clear
>> answers appeared, I'm going to assume that this is a regression and file a
>> JIRA ticket on this.
>>
> Could you explain that a little more?
>
> You tried using the CqlStorage read with a CQL 3 table and it did not work
> ?
>
> Cheers
>
> -
> Aaron Morton
> New Zealand
> @aaronmorton
>
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> On 27/09/2013, at 5:04 AM, Robert Coli  wrote:
>
> On Thu, Sep 26, 2013 at 1:00 AM, Janne Jalkanen 
> wrote:
>
>>
>> Unfortunately no, as I have a dozen legacy columnfamilies… Since no clear
>> answers appeared, I'm going to assume that this is a regression and file a
>> JIRA ticket on this.
>>
>
> Could you let the list know the ticket number, when you do? :)
>
> =Rob
>
>
>