About Cassandra-Hadoop(Pig) Integration issue

pradeep kumar Tue, 10 Dec 2013 17:49:52 -0800

Hello Cassandra users,

For one of our our new Big data BI projects, we are using Apache Cassandra
1.2.10 as our primary data store with the support of Hadoop for analytics.
For prototyping purpose we have 1 node each for Apache Cassandra/Hadoop.
Pig is our choice to process the data from/to C*.


Before coming to the actual problem, for one of the CFs, primary key is a
composite one. Eg: ((A,B,C),D). Based on the documentation combination of
(A,B,C) is our partition key.

We are facing couple of issues with Pig talking to C*. Most important are;

*Problem 1:*

 we are not able to apply filters on the columns which are part of
partition key. This seems to be very basic requirement to have for C*-Pig
integration.

To give you more idea on the problem, here is the jira on the same topic.
https://issues.apache.org/jira/browse/CASSANDRA-6151

Unfortunately the status of the issue is resolved with resolution as "Won't
Fix".
Not sure why.

Because of this we are dead in the water. No clue how to proceed.

Someone from community suggested to use filter using Pig. But that seems to
be
not a idea case as we load whole data from Cassandra and then applying
filters before processing. isn't?

As per my understanding, If we avoid using filter by partition key columns
on Cassandra query then there will be scan for data across clusters.

*Problem 2:*

If there is filter just on a secondary index column, we get timeouts.

Can anyone help me out to solve the above problems.

Thanks,
Pradeep

About Cassandra-Hadoop(Pig) Integration issue

Reply via email to