Re: get_indexed_slices ~ simple map-reduce

aaron morton Mon, 13 Jun 2011 16:28:54 -0700

From a quick read of the code in o.a.c.db.ColumnFamilyStore.scan()...

Candidate rows are first read by applying the most selected equality predicate.

From those candidate rows...

1) If the SlicePredicate has a SliceRange the query execution will read all 
columns for the candidate row  if the byte size of the largest tracked row is 
less than column_index_size_in_kb config setting (defaults to 64K). Meaning if 
no more than 1 column index page of columns is (probably) going to be read, 
they will all be read. 

2) Otherwise if the query will read the columns specified by the SliceRange. 

3) If the SlicePredicate uses a list of columns names, those columns and the 
ones referenced in the IndexExpressions (except the one selected as the primary 
pivot above) are read from disk. 

If additional columns are needed (in case 2 above) they are read in a separate 
reads from the candidate row. 

Then when applying the SlicePredicate to produce the final projection into the 
result set, all the columns required to satisfy the filter will be in memory.  

So, yes it reads just the columns from disk you you ask for. Unless it thinks 
it will take no more work to read more. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 13 Jun 2011, at 08:34, Michal Augustýn wrote:

> Hi,
> 
> as I wrote, I don't want to install Hadoop etc. - I want just to use
> the Thrift API. The core of my question is how does get_indexed_slices
> function work.
> 
> I know that it must get all keys using equality expression firstly -
> but what about additional expressions? Does Cassandra fetch whole
> filtered rows, or just columns used in additional filtering
> expression?
> 
> Thanks!
> 
> Augi
> 
> 2011/6/12 aaron morton <aa...@thelastpickle.com>:
>> Not exactly sure what you mean here, all data access is through the thrift
>> API unless you code java and embed cassandra in your app.
>> As well as Pig support there is also Hive support in brisk (which will also
>> have Pig support soon) http://www.datastax.com/products/brisk
>> Can you provide some more info on the use case ? Personally if you have a
>> read query you know you need to support, I would consider supporting it in
>> the data model without secondary indexes.
>> Cheers
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> On 11 Jun 2011, at 19:23, Michal Augustýn wrote:
>> 
>> Hi all,
>> 
>> I'm thinking of get_indexed_slices function as a simple map-reduce job
>> (that just maps) - am I right?
>> 
>> Well, I would like to be able to run simple queries on values but I
>> don't want to install Hadoop, write map-reduce jobs in Java (the whole
>> application is in C# and I don't want to introduce new development
>> stack - maybe Pig would help) and have some second interface to
>> Cassandra (in addition to Thrift). So secondary indexes seem to be
>> rescue for me. I would have just one indexed column that will have
>> day-timestamp value (~100k items per day) and the equality expression
>> for this column would be in each query (and I would add more ad-hoc
>> expressions).
>> Will this scenario work or is there some issue I could run in?
>> 
>> Thanks!
>> 
>> Augi
>> 
>>

Re: get_indexed_slices ~ simple map-reduce

Reply via email to