Thanks Tyler, SPARSE SASI index solves my use case. Planing to upgrade the cassandra to 3.0.6 now.
--------------------------------------------------------------------------------------------------------------------- Atul Saroha *Lead Software Engineer* *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369 Plot # 362, ASF Centre - Tower A, Udyog Vihar, Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA On Thu, May 12, 2016 at 9:18 PM, Tyler Hobbs <ty...@datastax.com> wrote: > > On Tue, May 10, 2016 at 6:41 AM, Atul Saroha <atul.sar...@snapdeal.com> > wrote: > >> I have concern over using secondary index on field with low cardinality. >> Lets say I have few billion rows and each row can be classified in 1000 >> category. Lets say we have 50 node cluster. >> >> Now we want to fetch data for a single category using secondary index >> over a category. And query is paginated too with fetch size property say >> 5000. >> >> Since query on secondary index works as scatter and gatherer approach by >> coordinator node. Would it lead to out of memory on coordinator or timeout >> errors too much. >> > > Paging will prevent the coordinator from using excessive memory. With the > type of data that you described, timeouts shouldn't be huge problem because > it will only take a few token ranges (assuming you're using vnodes) to get > enough matching rows to hit the page size. > > >> >> How does pagination (token level data fetch) behave in scatter and >> gatherer approach? >> > > Secondary index queries fetch token ranges in sequential order [1], > starting with the minimum token. When you fetch a new page, it resumes > from the last token (and primary key) that it returned in the previous page. > > [1] As an optimization, multiple token ranges will be fetched in parallel > based on estimates of how many token ranges it will take to fill the page. > > >> >> Secondly, What If we create an inverted table with partition key as >> category. Then this will led to lots of data on single node. Then it might >> led to hot shard issue and performance issue of data fetching from single >> node as a single partition has millions of rows. >> >> How should we tackle such low cardinality index in Cassandra? > > > The data distribution that you described sounds like a reasonable fit for > secondary indexes. However, I would also take into account how frequently > you run this query and how fast you need it to be. Even ignoring the > scatter-gather aspects of a secondary index query, they are still expensive > because they fetch many non-contiguous rows from an SSTable. If you need > to run this query very frequently, that may add too much load to your > cluster, and some sort of inverted table approach may be more appropriate. > > -- > Tyler Hobbs > DataStax <http://datastax.com/> >