Re: Confused about get_slice SliceRange behavior with bloom filter

Aditya Narayan Mon, 14 Feb 2011 02:27:41 -0800

Thanks Sylvain,

I guess I might have misunderstood the meaning of column_index_size_in_kb,
My previous understanding about that was: it is the threshold size for a row
to pass, after which its columns will be indexed.


If I have understood it correctly, it implies the size of the "blocks
(containing columns) that are kept together on the same index". So if you
make that high, a large no of columns will need to be deseralized for a
single column access, in that block. And it you make it lower than optimal
than indexes size will grow up, right?

So I guess we should vary that depending on the size of our columns and not
the size of rows !? I have valueless columns for my usecase.




On Mon, Feb 14, 2011 at 2:06 PM, Sylvain Lebresne <sylv...@datastax.com>wrote:

> As said by aaron, if the whole row is under 64k, it won't matter. But since
> you spoke of very wide row, I'll assume the whole will be much more than
> 64k.
>
> If so, the row is indexed by block (of 64k, configurable). Then the read
> performance depends on how many of those block are needed for the query,
> since each block potentially means a seek (potentially because some block
> could happen to be sequential on disk). So if the columns you ask for are
> really randomly distributed, then yes, the biggest the row is, the biggest
> the chance is to have to hit many blocks and the biggest the chance is for
> these block to be far apart on disk.
>
> --
> Sylvain
>
> On Sun, Feb 13, 2011 at 10:19 PM, Aditya Narayan <ady...@gmail.com> wrote:
>
>> Jonathan,
>> If I ask for around 150-200 columns (totally random not sequential) from a
>> very wide row that contains more than a million or even more columns then,
>> is the read performance of the SliceQuery operation affected by or "depends
>> on the length of the row" ?? (For my use case, I would use the column names
>> list for this SliceQuery operation).
>>
>>
>> Thanks
>> Aditya
>>
>>
>> On Sun, Feb 13, 2011 at 8:41 PM, Jonathan Ellis <jbel...@gmail.com>wrote:
>>
>>> On Sun, Feb 13, 2011 at 12:37 AM, E S <tr1skl...@yahoo.com> wrote:
>>> > I've gotten myself really confused by
>>> > http://wiki.apache.org/cassandra/ArchitectureInternals and am hoping
>>> someone can
>>> > help me understand what the io behavior of this operation would be.
>>> >
>>> > When I do a get_slice for a column range, will it seek to every
>>> SSTable?  I had
>>> > thought that it would use the bloom filter on the row key so that it
>>> would only
>>> > do a seek to SSTables that have a very high probability of containing
>>> columns
>>> > for that row.
>>>
>>> Yes.
>>>
>>> > In the linked doc above, it seems to say that it is only used for
>>> > exact column names.  Am I misunderstanding this?
>>>
>>> Yes.  You may be confusing multi-row behavior with multi-column.
>>>
>>> > On a related note, if instead of using a SliceRange I provide an
>>> explicit list
>>> > of columns, will I have to read all SSTables that have values for the
>>> columns
>>>
>>> Yes.
>>>
>>> > or is it smart enough to stop after finding a value from the most
>>> recent
>>> > SSTable?
>>>
>>> There is no way to know which value is most recent without having to
>>> read it first.
>>>
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra support
>>> http://www.datastax.com
>>>
>>
>>
>

Re: Confused about get_slice SliceRange behavior with bloom filter

Reply via email to