Re: Why Cassandra secondary indexes are so slow on just 350k rows?

aaron morton Wed, 29 Aug 2012 21:41:21 -0700

>  *from 12 to 20 seconds (!!!) to find 5000 rows*.
More is not always better.


Cassandra must materialise the full 5000 rows and send them all over the wire 
to be materialised on the other side. Try asking for a few hundred at a time 
and see how it goes. 

Cheers
 
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 29/08/2012, at 6:46 PM, Robin Verlangen <ro...@us2.nl> wrote:

> @Edward: I think you should consider a queue for exporting the new rows. Just 
> store the rowkey in a queue (you might want to consider looking at  
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Distributed-work-queues-td5226248.html
>  ) and process that row every couple of minutes. Then manually delete columns 
> from that queue-row.
> 
> With kind regards,
> 
> Robin Verlangen
> Software engineer
> 
> W http://www.robinverlangen.nl
> E ro...@us2.nl
> 
> Disclaimer: The information contained in this message and attachments is 
> intended solely for the attention and use of the named addressee and may be 
> confidential. If you are not the intended recipient, you are reminded that 
> the information remains the property of the sender. You must not use, 
> disclose, distribute, copy, print or rely on this e-mail. If you have 
> received this message in error, please contact the sender immediately and 
> irrevocably delete this message and any copies.
> 
> 
> 
> 2012/8/29 Robin Verlangen <ro...@us2.nl>
> "What this means is that eventually you will have 1 row in the secondary 
> index table with 350K columns"
> 
> Is this really true? I would have expected that Cassandra used internal index 
> sharding/bucketing?
> 
> With kind regards,
> 
> Robin Verlangen
> Software engineer
> 
> W http://www.robinverlangen.nl
> E ro...@us2.nl
> 
> Disclaimer: The information contained in this message and attachments is 
> intended solely for the attention and use of the named addressee and may be 
> confidential. If you are not the intended recipient, you are reminded that 
> the information remains the property of the sender. You must not use, 
> disclose, distribute, copy, print or rely on this e-mail. If you have 
> received this message in error, please contact the sender immediately and 
> irrevocably delete this message and any copies.
> 
> 
> 
> 2012/8/29 Dave Brosius <dbros...@mebigfatguy.com>
> If i understand you correctly, you are only ever querying for the rows where 
> is_exported = false, and turning them into trues. What this means is that 
> eventually you will have 1 row in the secondary index table with 350K columns 
> that you will never look at.
> 
> It seems to me you that perhaps you should just hold your own "manual index" 
> cf that points to non exported rows, and just delete those columns when they 
> are exported.
> 
> 
> 
> On 08/28/2012 05:23 PM, Edward Kibardin wrote:
> I have a column family with the secondary index. The secondary index is 
> basically a binary field, but I'm using a string for it. The field called 
> *is_exported* and can be *'true'* or *'false'*. After request all loaded rows 
> are updated with *is_exported = 'false'*.
> 
> I'm polling this column table each ten minutes and exporting new rows as they 
> appear.
> 
> But here the problem: I'm seeing that time for this query grows pretty linear 
> with amount of data in column table, and currently it takes *from 12 to 20 
> seconds (!!!) to find 5000 rows*. From my understanding, indexed request 
> should not depend on number of rows in CF but from number of rows per one 
> index value (cardinality), as it's just another hidden CF like:
> 
>         "true" : rowKey1 rowKey2 rowKey3 ...
>         "false": rowKey1 rowKey2 rowKey3 ...
> 
> I'm using Pycassa to query the data, here the code I'm using:
> 
>         column_family = pycassa.ColumnFamily(cassandra_pool, 
> column_family_name, read_consistency_level=2)
>         is_exported_expr = create_index_expression('is_exported', 'false')
>         clause = create_index_clause([is_exported_expr], count = 5000)
>         column_family.get_indexed_slices(clause)
> 
> Am I doing something wrong, but I expect this operation to work MUCH faster.
> 
> Any ideas or suggestions?
> 
> Some config info:
>  - Cassandra 1.1.0
>  - RandomPartitioner
>  - I have 2 nodes and replication_factor = 2 (each server has a full data 
> copy)
>  - Using AWS EC2, large instances
>  - Software raid0 on ephemeral drives
> 
> Thanks in advance!
> 
> 
> 
>

Re: Why Cassandra secondary indexes are so slow on just 350k rows?

Reply via email to