In practice, local secondary indexes scale to {RF * the limit of a single machine} for -low cardinality- values (ex: users living in a certain state) since the first node is likely to be able to answer your question. This also means they are good for performing filtering for analytics.
On the other hand, they are not very useful for high cardinality values (ex: users born at a particular second), because in the worst case you have to query every node in your cluster, and you are much more likely to hit the worst case with rare values. If you have high cardinality values, it is currently recommended to build your own secondary indexes from the client side, as you suggested. Triggers may help you perform this distributed indexing in the near future: see CASSANDRA-1311. On Tue, Feb 22, 2011 at 4:45 PM, Piotr J. <pio...@gmail.com> wrote: > Hi, As far as I understand automatic secondary indexes are generated for > node local data. > > In this case query by secondary index involve all nodes storing part of > column family to get results (?) so (if i am right) if data is spread across > 50 nodes then 50 nodes are involved in single query? > > How far can this scale? Is this more scalable than manual secondary indexes > (inverted index column family)? Few nodes or hundred nodes? > > Regards >