Re: How does cassandra page through low cardinality indexes?

DuyHai Doan Thu, 29 May 2014 13:09:38 -0700

Hello Robert

 There are some maths involved when considering the performance of
secondary index in C*


 First, the current implementation is a distributed 2nd index, meaning that
each node that contains actual data also contains the index data.

 So considering a cluster of *N* nodes with replication factor *R*, to
fetch just the index data you'll need to do *N/R* reads. I'm not
considering the query with LIMIT clause.

 Once you get the index data, you'll need to fetch the "actual" data
related to this index. If your query returns *p* partitions, the complexity
would be O(N/R+p)

 Now, for very high cardinality secondary index (index on user email to
search user for instance), for 1 index data you only find one actual user
so the complexity is O(N/R) for  reading. If your cluster is big (N = 100
nodes) there will be a lot of wastefull reads...

 Because of its distributed nature, finding a *good* use-case for 2nd index
is quite tricky, partly because it  depends on the query pattern but also
on the cluster size and data distribution.

  Apart from the performance aspect, secondary index column families use
SizeTiered compaction so for an use case with a lot of update you'll have
plenty of tombstones... I'm not sure how end user can switch to Leveled
Compaction for 2nd index...

 Regards







On Thu, May 29, 2014 at 9:43 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Fri, May 16, 2014 at 10:53 AM, Kevin Burton <bur...@spinn3r.com> wrote:
>
>> I'm struggling with cassandra secondary indexes since the documentation
>> seems all over the place and I'm having to put together everything from
>> blog posts.
>>
>
> This mostly-complete summary content will eventually make it into a blog
> post :
>
> "
> Secondary Indexes in Cassandra
> ------------------------------------------
>
> Users frequently come into #cassandra or the cassandra-user@ mailing list
> and ask questions about Secondary Indexes. Here is my stock answer.
>
> “Unless you REALLY NEED the feature of atomic update of the secondary
> index with the underlying row, you are almost always better off just making
> your own manual secondary index column family.”
>
> In Cassandra, the unit of distribution is the partition (f/k/a “Row”). If
> your query needs to scan multiple partitions and inspect each of their
> contents, you have probably made a mistake in your data model. For queries
> which interact with sets of partitions one should use executeAsync() w/ the
> new CQL drivers, not multigets.
>
> Advantages of Secondary Indexes :
>
> - Atomic update of secondary index with underlying partition/storage row.
> - Don’t have to be maintained manually, including automated rebuild.
> - Provides the illusion that you are using a RDBMS.
>
> Disadvantages of Secondary Indexes :
>
> - Before 1.2, they do a read-before-write.
> https://issues.apache.org/jira/browse/CASSANDRA-2897
> - A steady trickle of occasionally-serious bugs which do not affect the
> normal read/write path. [3]
> - Bad for low cardinality cases. FIXME : detail (relates to checking each
> node)
> - Bad for high cardinality cases. FIXME : detail (certain cases? what
> about equality/non-equality?)
> - CFstats not exposed via nodetool cfstats before 1.2 :
> https://issues.apache.org/jira/browse/CASSANDRA-4464 ?
> - Lower availability than normal Cassandra read path. FIXME : citation
> - Unsorted results, in token order and not query value order.
> - Can only search on datatypes Cassandra understands.
> - Secondary index is located in the same directory as the primary
> SSTables.
> - Provides the illusion that you are using a RDBMS.
> "
>
> Readers will note that I am not very clear above on which cardinality
> cases they *are* good for, because I consider all the other problems
> sufficient to never use them.
>
> =Rob
> [1] Citations :
>
> https://issues.apache.org/jira/browse/CASSANDRA-5502
>
> https://issues.apache.org/jira/browse/CASSANDRA-5975
>
> https://issues.apache.org/jira/browse/CASSANDRA-2897 - 2i without
> read-before-write
>
> https://issues.apache.org/jira/browse/CASSANDRA-1571 - (0.7) Secondary
> Indexes aren't updated when removing whole row
>
> https://issues.apache.org/jira/browse/CASSANDRA-1747 - (0.7) Truncate is
> not secondary index aware
>
> https://issues.apache.org/jira/browse/CASSANDRA-1813 - (0.7) return
> invalidrequest when client attempts to create secondary index on
> supercolumns
>
> https://issues.apache.org/jira/browse/CASSANDRA-2619 - (0.8) secondary
> index not dropped until restart
>
> https://issues.apache.org/jira/browse/CASSANDRA-2628 - (0.8) Empty Result
> with Secondary Index Queries with "limit 1"
>
> https://issues.apache.org/jira/browse/CASSANDRA-3057 - (0.8) secondary
> index on a column that has a value of size > 64k will fail on flush
>
> https://issues.apache.org/jira/browse/CASSANDRA-3540 - (1.0) Wrong check
> of partitioner for secondary indexes
>
> https://issues.apache.org/jira/browse/CASSANDRA-3545 - (1.1) Fix very low
> Secondary Index performance
>
> https://issues.apache.org/jira/browse/CASSANDRA-4257 - (1.1) CQL3 range
> query with secondary index fails
>
> https://issues.apache.org/jira/browse/CASSANDRA-2897 - (1.2) Secondary
> indexes without read-before-write
>
> https://issues.apache.org/jira/browse/CASSANDRA-4289 - (1.2) Secondary
> Indexes fail following a system restart
>
> https://issues.apache.org/jira/browse/CASSANDRA-4785 - (1.2) Secondary
> Index Sporadically Doesn't Return Rows
>
> https://issues.apache.org/jira/browse/CASSANDRA-4973 - (1.1) Secondary
> Index stops returning rows when caching=ALL
>
> https://issues.apache.org/jira/browse/CASSANDRA-5079 - (1.1, but since
> 0.8) Compaction deletes ExpiringColumns in Secondary Indexes
>
> https://issues.apache.org/jira/browse/CASSANDRA-5732 - (1.2/2.0) Can not
> query secondary index
>
> https://issues.apache.org/jira/browse/CASSANDRA-5540 - (1.2) Concurrent
> secondary index updates remove rows from the index
>
> https://issues.apache.org/jira/browse/CASSANDRA-5599 - (1.2)
> Intermittently, CQL SELECT  with WHERE on secondary indexed field value
> returns null when there are rows
>
> https://issues.apache.org/jira/browse/CASSANDRA-5397 - (1.2) Updates to
> PerRowSecondaryIndex don't use most current values
>
> https://issues.apache.org/jira/browse/CASSANDRA-5161 - (1.2) Slow
> secondary index performance when using VNodes
>
> https://issues.apache.org/jira/browse/CASSANDRA-5851 - (2.0) Fix 2i on
> composite components omissions
>
> https://issues.apache.org/jira/browse/CASSANDRA-5614 - (2.0) W/O
> specified columns ASPCSI does not get notified of deletes
>
> https://issues.apache.org/jira/browse/CASSANDRA-5920 - (2.0) Allow
> secondary indexed columns to be used with IN operator
> https://issues.apache.org/jira/browse/CASSANDRA-5975 - (1.2/2.0)
> Filtering on Secondary Index Takes a Long Time Even with Limit 1, Trace Log
> Filled with Looping Messages
>
>

Re: How does cassandra page through low cardinality indexes?

Reply via email to