Hi Magnus,

I think the answer might be on
https://issues.apache.org/jira/browse/CASSANDRA-749. For example,
Jonathan writes:

<quote>
> Is it worth creating a secondary index that only contains local data, versus 
> a distributed secondary index (a normal ColumnFamily?)

I think my initial reasoning was wrong here. I was anti-local-indexes
because "we have to query the full cluster for any index lookup, since
we are throwing away our usual partitioning scheme."

Which is true, but it ignores the fact that, in most cases, you will
have to "query the full cluster" to get the actual matching rows, b/c
the indexed rows will be spread across all machines. So, having local
indexes is better in the common case, since it actually saves a round
trip from querying a the index to querying the rows.

Also, having each node index the rows it has locally means you don't
have to worry about sharding a very large index since it happens
automatically.

Finally, it lets us use the local commitlog to keep index + data in sync.
</quote>

Hope that helps,
Martin

On Mon, Sep 5, 2011 at 1:52 AM, Kaj Magnus Lindberg
<kajmagnu...@gmail.com> wrote:
> Hi,
>
> (This is the 2nd time I'm sending this message. I sent it the first
> time a few days ago but it does not appear in the archives.)
>
> I have a follow up question on a question from February 2011. In
> short, I wonder why one won't have to query all Cassandra nodes when
> doing a secondary index lookup -- although each node only indexes data
> that it holds locally.
>
> The question and answer was:
>  ( http://www.mail-archive.com/user@cassandra.apache.org/msg10506.html  )
> === Question ===
> As far as I understand automatic secondary indexes are generated for
> node local data.
>   In this case query by secondary index involve all nodes storing part of
> column family to get results (?) so (if i am right) if data is spread across
> 50 nodes then 50 nodes are involved in single query?
> [...]
> === Answer ===
> In practice, local secondary indexes scale to {RF * the limit of a single
> machine} for -low cardinality- values (ex: users living in a certain state)
> since the first node is likely to be able to answer your question. This also
> means they are good for performing filtering for analytics.
> [...]
>
> === Now I wonder ===
> Why would the first node be likely to be able to answer the question?
> It stores only index entries for users on that particular machine,
>     (says http://wiki.apache.org/cassandra/SecondaryIndexes:
>     "Each node only indexes data that it holds locally" )
> but users might be stored by user name? And would thus be stored on
> many different machines? Even if they happen to live in the same
> state?
>
> Why won't the client need to query the indexes of [all servers that
> store info on users] to find all relevant users, when doing a user
> property lookup?
>
>
> Best regards, KajMagnus
>

Reply via email to