> *from 12 to 20 seconds (!!!) to find 5000 rows*. More is not always better.
Cassandra must materialise the full 5000 rows and send them all over the wire to be materialised on the other side. Try asking for a few hundred at a time and see how it goes. Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 29/08/2012, at 6:46 PM, Robin Verlangen <ro...@us2.nl> wrote: > @Edward: I think you should consider a queue for exporting the new rows. Just > store the rowkey in a queue (you might want to consider looking at > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Distributed-work-queues-td5226248.html > ) and process that row every couple of minutes. Then manually delete columns > from that queue-row. > > With kind regards, > > Robin Verlangen > Software engineer > > W http://www.robinverlangen.nl > E ro...@us2.nl > > Disclaimer: The information contained in this message and attachments is > intended solely for the attention and use of the named addressee and may be > confidential. If you are not the intended recipient, you are reminded that > the information remains the property of the sender. You must not use, > disclose, distribute, copy, print or rely on this e-mail. If you have > received this message in error, please contact the sender immediately and > irrevocably delete this message and any copies. > > > > 2012/8/29 Robin Verlangen <ro...@us2.nl> > "What this means is that eventually you will have 1 row in the secondary > index table with 350K columns" > > Is this really true? I would have expected that Cassandra used internal index > sharding/bucketing? > > With kind regards, > > Robin Verlangen > Software engineer > > W http://www.robinverlangen.nl > E ro...@us2.nl > > Disclaimer: The information contained in this message and attachments is > intended solely for the attention and use of the named addressee and may be > confidential. If you are not the intended recipient, you are reminded that > the information remains the property of the sender. You must not use, > disclose, distribute, copy, print or rely on this e-mail. If you have > received this message in error, please contact the sender immediately and > irrevocably delete this message and any copies. > > > > 2012/8/29 Dave Brosius <dbros...@mebigfatguy.com> > If i understand you correctly, you are only ever querying for the rows where > is_exported = false, and turning them into trues. What this means is that > eventually you will have 1 row in the secondary index table with 350K columns > that you will never look at. > > It seems to me you that perhaps you should just hold your own "manual index" > cf that points to non exported rows, and just delete those columns when they > are exported. > > > > On 08/28/2012 05:23 PM, Edward Kibardin wrote: > I have a column family with the secondary index. The secondary index is > basically a binary field, but I'm using a string for it. The field called > *is_exported* and can be *'true'* or *'false'*. After request all loaded rows > are updated with *is_exported = 'false'*. > > I'm polling this column table each ten minutes and exporting new rows as they > appear. > > But here the problem: I'm seeing that time for this query grows pretty linear > with amount of data in column table, and currently it takes *from 12 to 20 > seconds (!!!) to find 5000 rows*. From my understanding, indexed request > should not depend on number of rows in CF but from number of rows per one > index value (cardinality), as it's just another hidden CF like: > > "true" : rowKey1 rowKey2 rowKey3 ... > "false": rowKey1 rowKey2 rowKey3 ... > > I'm using Pycassa to query the data, here the code I'm using: > > column_family = pycassa.ColumnFamily(cassandra_pool, > column_family_name, read_consistency_level=2) > is_exported_expr = create_index_expression('is_exported', 'false') > clause = create_index_clause([is_exported_expr], count = 5000) > column_family.get_indexed_slices(clause) > > Am I doing something wrong, but I expect this operation to work MUCH faster. > > Any ideas or suggestions? > > Some config info: > - Cassandra 1.1.0 > - RandomPartitioner > - I have 2 nodes and replication_factor = 2 (each server has a full data > copy) > - Using AWS EC2, large instances > - Software raid0 on ephemeral drives > > Thanks in advance! > > > >