Hi Aaron,

I've seen the code which you describe (working with splits and intersections) 
but that range is derived from keys and work only for ordered partitioners (in 
1.2.15).
I've already got one confirmation that in C* version I use (1.2.15) setting 
limits with setInputRange(startToken, endToken) doesn't work.
"between 1.2.6 and 2.0.6 the setInputRange(startToken, endToken) is not working"
Can you confirm or disprove?

WBR,
Anton

From: Aaron Morton [mailto:aa...@thelastpickle.com]
Sent: Monday, May 19, 2014 1:58 AM
To: Cassandra User
Subject: Re: Cassandra token range support for Hadoop (ColumnFamilyInputFormat)

The limit is just ignored and the entire column family is scanned.
Which limit ?


1. Am I right that there is no way to get some data limited by token range with 
ColumnFamilyInputFormat?
>From what I understand setting the input range is used when calculating the 
>splits. The token ranges in the cluster are iterated and if they intersect 
>with the supplied range the overlapping range is used to calculate the split. 
>Rather than the full token range.

2. Is there other way to limit the amount of data read from Cassandra with 
Spark and ColumnFamilyInputFormat,
so that this amount is predictable (like 5% of entire dataset)?
if you suppled a token range is that is 5% of the possible range of values for 
the token that should be close to a random 5% sample.


Hope that helps.
Aaron

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 14/05/2014, at 10:46 am, Anton Brazhnyk 
<anton.brazh...@genesys.com<mailto:anton.brazh...@genesys.com>> wrote:


Greetings,

I'm reading data from C* with Spark (via ColumnFamilyInputFormat) and I'd like 
to read just part of it - something like Spark's sample() function.
Cassandra's API seems allow to do it with its 
ConfigHelper.setInputRange(jobConfiguration, startToken, endToken) method, but 
it doesn't work.
The limit is just ignored and the entire column family is scanned. It seems 
this kind of feature is just not supported
and sources of AbstractColumnFamilyInputFormat.getSplits confirm that (IMO).
Questions:
1. Am I right that there is no way to get some data limited by token range with 
ColumnFamilyInputFormat?
2. Is there other way to limit the amount of data read from Cassandra with 
Spark and ColumnFamilyInputFormat,
so that this amount is predictable (like 5% of entire dataset)?


WBR,
Anton

Reply via email to