Re: Low cardinality secondary index behaviour

DuyHai Doan Wed, 18 May 2016 05:53:25 -0700

Cassandra 3.0.6 does not have SASI. SASI is available only from C* 3.4 but
I advise C* 3.5/3.6 because some critical bugs have been fixed in 3.5


On Wed, May 18, 2016 at 1:58 PM, Atul Saroha <atul.sar...@snapdeal.com>
wrote:

> Thanks Tyler,
>
> SPARSE SASI index solves my use case. Planing to upgrade the cassandra to
> 3.0.6 now.
>
>
> ---------------------------------------------------------------------------------------------------------------------
> Atul Saroha
> *Lead Software Engineer*
> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>
> On Thu, May 12, 2016 at 9:18 PM, Tyler Hobbs <ty...@datastax.com> wrote:
>
>>
>> On Tue, May 10, 2016 at 6:41 AM, Atul Saroha <atul.sar...@snapdeal.com>
>> wrote:
>>
>>> I have concern over using secondary index on field with low cardinality.
>>> Lets say I have few billion rows and each row can be classified in 1000
>>> category. Lets say we have 50 node cluster.
>>>
>>> Now we want to fetch data for a single category using secondary index
>>> over a category. And query is paginated too with fetch size property say
>>> 5000.
>>>
>>> Since query on secondary index works as scatter and gatherer approach by
>>> coordinator node. Would it lead to out of memory on coordinator or timeout
>>> errors too much.
>>>
>>
>> Paging will prevent the coordinator from using excessive memory.  With
>> the type of data that you described, timeouts shouldn't be huge problem
>> because it will only take a few token ranges (assuming you're using vnodes)
>> to get enough matching rows to hit the page size.
>>
>>
>>>
>>> How does pagination (token level data fetch) behave in scatter and
>>> gatherer approach?
>>>
>>
>> Secondary index queries fetch token ranges in sequential order [1],
>> starting with the minimum token.  When you fetch a new page, it resumes
>> from the last token (and primary key) that it returned in the previous page.
>>
>> [1] As an optimization, multiple token ranges will be fetched in parallel
>> based on estimates of how many token ranges it will take to fill the page.
>>
>>
>>>
>>> Secondly, What If we create an inverted table with partition key as
>>> category. Then this will led to lots of data on single node. Then it might
>>> led to hot shard issue and performance issue of data fetching from single
>>> node as a single partition has  millions of rows.
>>>
>>> How should we tackle such low cardinality index in Cassandra?
>>
>>
>> The data distribution that you described sounds like a reasonable fit for
>> secondary indexes.  However, I would also take into account how frequently
>> you run this query and how fast you need it to be.  Even ignoring the
>> scatter-gather aspects of a secondary index query, they are still expensive
>> because they fetch many non-contiguous rows from an SSTable.  If you need
>> to run this query very frequently, that may add too much load to your
>> cluster, and some sort of inverted table approach may be more appropriate.
>>
>> --
>> Tyler Hobbs
>> DataStax <http://datastax.com/>
>>
>
>

Re: Low cardinality secondary index behaviour

Reply via email to