Cassandra 3.0.6 does not have SASI. SASI is available only from C* 3.4 but I advise C* 3.5/3.6 because some critical bugs have been fixed in 3.5
On Wed, May 18, 2016 at 1:58 PM, Atul Saroha <atul.sar...@snapdeal.com> wrote: > Thanks Tyler, > > SPARSE SASI index solves my use case. Planing to upgrade the cassandra to > 3.0.6 now. > > > --------------------------------------------------------------------------------------------------------------------- > Atul Saroha > *Lead Software Engineer* > *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369 > Plot # 362, ASF Centre - Tower A, Udyog Vihar, > Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA > > On Thu, May 12, 2016 at 9:18 PM, Tyler Hobbs <ty...@datastax.com> wrote: > >> >> On Tue, May 10, 2016 at 6:41 AM, Atul Saroha <atul.sar...@snapdeal.com> >> wrote: >> >>> I have concern over using secondary index on field with low cardinality. >>> Lets say I have few billion rows and each row can be classified in 1000 >>> category. Lets say we have 50 node cluster. >>> >>> Now we want to fetch data for a single category using secondary index >>> over a category. And query is paginated too with fetch size property say >>> 5000. >>> >>> Since query on secondary index works as scatter and gatherer approach by >>> coordinator node. Would it lead to out of memory on coordinator or timeout >>> errors too much. >>> >> >> Paging will prevent the coordinator from using excessive memory. With >> the type of data that you described, timeouts shouldn't be huge problem >> because it will only take a few token ranges (assuming you're using vnodes) >> to get enough matching rows to hit the page size. >> >> >>> >>> How does pagination (token level data fetch) behave in scatter and >>> gatherer approach? >>> >> >> Secondary index queries fetch token ranges in sequential order [1], >> starting with the minimum token. When you fetch a new page, it resumes >> from the last token (and primary key) that it returned in the previous page. >> >> [1] As an optimization, multiple token ranges will be fetched in parallel >> based on estimates of how many token ranges it will take to fill the page. >> >> >>> >>> Secondly, What If we create an inverted table with partition key as >>> category. Then this will led to lots of data on single node. Then it might >>> led to hot shard issue and performance issue of data fetching from single >>> node as a single partition has millions of rows. >>> >>> How should we tackle such low cardinality index in Cassandra? >> >> >> The data distribution that you described sounds like a reasonable fit for >> secondary indexes. However, I would also take into account how frequently >> you run this query and how fast you need it to be. Even ignoring the >> scatter-gather aspects of a secondary index query, they are still expensive >> because they fetch many non-contiguous rows from an SSTable. If you need >> to run this query very frequently, that may add too much load to your >> cluster, and some sort of inverted table approach may be more appropriate. >> >> -- >> Tyler Hobbs >> DataStax <http://datastax.com/> >> > >