Thanks Tyler,

SPARSE SASI index solves my use case. Planing to upgrade the cassandra to
3.0.6 now.

---------------------------------------------------------------------------------------------------------------------
Atul Saroha
*Lead Software Engineer*
*M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA

On Thu, May 12, 2016 at 9:18 PM, Tyler Hobbs <ty...@datastax.com> wrote:

>
> On Tue, May 10, 2016 at 6:41 AM, Atul Saroha <atul.sar...@snapdeal.com>
> wrote:
>
>> I have concern over using secondary index on field with low cardinality.
>> Lets say I have few billion rows and each row can be classified in 1000
>> category. Lets say we have 50 node cluster.
>>
>> Now we want to fetch data for a single category using secondary index
>> over a category. And query is paginated too with fetch size property say
>> 5000.
>>
>> Since query on secondary index works as scatter and gatherer approach by
>> coordinator node. Would it lead to out of memory on coordinator or timeout
>> errors too much.
>>
>
> Paging will prevent the coordinator from using excessive memory.  With the
> type of data that you described, timeouts shouldn't be huge problem because
> it will only take a few token ranges (assuming you're using vnodes) to get
> enough matching rows to hit the page size.
>
>
>>
>> How does pagination (token level data fetch) behave in scatter and
>> gatherer approach?
>>
>
> Secondary index queries fetch token ranges in sequential order [1],
> starting with the minimum token.  When you fetch a new page, it resumes
> from the last token (and primary key) that it returned in the previous page.
>
> [1] As an optimization, multiple token ranges will be fetched in parallel
> based on estimates of how many token ranges it will take to fill the page.
>
>
>>
>> Secondly, What If we create an inverted table with partition key as
>> category. Then this will led to lots of data on single node. Then it might
>> led to hot shard issue and performance issue of data fetching from single
>> node as a single partition has  millions of rows.
>>
>> How should we tackle such low cardinality index in Cassandra?
>
>
> The data distribution that you described sounds like a reasonable fit for
> secondary indexes.  However, I would also take into account how frequently
> you run this query and how fast you need it to be.  Even ignoring the
> scatter-gather aspects of a secondary index query, they are still expensive
> because they fetch many non-contiguous rows from an SSTable.  If you need
> to run this query very frequently, that may add too much load to your
> cluster, and some sort of inverted table approach may be more appropriate.
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>

Reply via email to