Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

2014-04-02 Thread Jonathan Ellis
The Thrift IF predates vnodes. I agree that's a reasonable alternative. On Apr 2, 2014 12:47 PM, "Clint Kelly" wrote: > Hi all, > > FWIW the HBase Hadoop InputFormat does not even do this kind of estimation > of data density over various ranges; it just creates one split for every > region betwee

Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

2014-04-02 Thread Clint Kelly
Hi all, FWIW the HBase Hadoop InputFormat does not even do this kind of estimation of data density over various ranges; it just creates one split for every region between the start and stop keys of the scan. I'll probably just do something similar by combining token ranges for virtual nodes that

Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

2014-04-01 Thread Aleksey Yeschenko
This doesn’t belong to CQL-the language. However, this could be implemented as a virtual system column family - sooner or later we’d need something like this anyway. Then you’d just run SELECT’s against it as if it were a regular column family. -- AY On Wednesday, April 2, 2014 at 00:03 AM

Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

2014-04-01 Thread Tyler Hobbs
Split calculation can't be done client-side because it requires key sampling (which requires reading the index summary). This would have to be added to CQL. Since I can't see any alternatives and this is required for good Hadoop support, would you mind opening a ticket to add support for this?

Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

2014-03-30 Thread Clint Kelly
Hi Shao-Chuan, I understand everything you said above except for how we can estimate the number of rows using the index interval. I understand that the index interval is a setting that controls how often samples from an SSTable index are stored in memory, correct? I was under the impression that

Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

2014-03-24 Thread Clint Kelly
Hi Shao-Chuan, That sounds like a good idea, thanks for your response. I think I may have missed the e-mail from Tyler that you reference --- I'll go back and look. FWIW the code that I have written so far is here: https://github.com/wibiclint/cassandra2-hadoop2 It is in rough shape now be

Re: Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

2014-03-24 Thread Shao-Chuan Wang
Tyler mentioned that client.describe_ring(myKeyspace); can be replaced by a query of system.peers table which has the ring information. The challenge here is to describe_splits_ex which needs the estimate the number of rows in each sub token range (as you mentioned). >From what I understand and tr

Replacing thrift calls in Hadoop input-split calculation with Java driver calls.

2014-03-24 Thread Clint Kelly
I just saw this question about thrift in the Hadoop / Cassandra integration in the discussion on the user list about freezing thrift. I have been working on a project to integrate Hadoop 2 and Cassandra 2 and have been trying to move all of the way over to the Java driver and away from thrift. I