When executing a query like: get events WHERE Firm=434550 AND ds_timestamp>=1341955958200 AND ds_timestamp<=1341955958200; what the 2ndary index implementation will do is: 1) it queries the index for Firm for the row with key 434550 (because that's the only one restricted by an equal clause, and that is why you need at least one equal clause). 2) the query from 1 will return a bunch of events row keys for those events whose Firm=434550. So for each of those row key it queries the corresponding event 3) if a given queried event matches the remaining clauses (here ds_timestamp>=1341955958200 AND ds_timestamp<=1341955958200), it adds it to the result, otherwise it skips.
So what I suspect is happening is that you have *lots* of events matching 'Firm=434550' but only one matches 'ds_timestamp>=1341955958200 AND ds_timestamp<=1341955958200'. And given that by default, it tries to find 100 results, it will scan all the events having 'Firm=434550' before returning. Which it probably cannot do within the timeout. But when you do get events WHERE Firm=434550 AND ds_timestamp>=1341955958200 given that lots of event having 'Firm=434550' probably match ds_timestamp>=1341955958200, it is able to find 100 of them quickly. Lastly, when you do get events WHERE Firm=434550 AND ds_timestamp=1341955958200; the implementation has now both clause that are equal, and based on internal stats it is able to determine that querying the ds_timestamp index will discriminate potential results more efficiently. So it will query the ds_timestamp index instead of the Firm one, which will yield all events whose timestamp is 1341955958200, but since there isn't many such events, it quickly finds the one matching Firm=434550 In other words, you are not doing something wrong, you are just hitting a limitation/weakness of the 2ndary index implementation. So if the query you really want to do is for a specific timestamp, you definitively want to use an equal rather than two non-strict inequalities. But if what you want is query events in a very small but non-discrete window of time, then using 2ndary indexes might just not fit the bill. In that case, one option would be to do a custom/specialized index yourself. If you construct an index where the row key is the Firm and the column name is the ds_timestamp (and the colum value is the event identifier), then finding events having a specific firm for a time window will be (always) efficient. However that's more work on your side since unfortunately Cassandra is not able to automatically create such specialized index itself (at least not yet). -- Sylvain On Wed, Jul 11, 2012 at 10:01 PM, JohnB <j...@tiac.net> wrote: > Hi: > > We are currently using Cassandra 0.8.10 and have run into some strange issues > surrounding > querying for a range of data > > > I ran a couple of get statements via the Cassandra client and found some > interesting results: > > > Consider the following Column Family Definition: > > ColumnFamily: events > Key Validation Class: org.apache.cassandra.db.marshal.BytesType > Default column value validator: > org.apache.cassandra.db.marshal.BytesType > Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type > Row cache size / save period in seconds: 0.0/0 > Row Cache Provider: > org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider > Key cache size / save period in seconds: 200000.0/14400 > Memtable thresholds: 0.2953125/1440/63 (millions of ops/minutes/MB) > GC grace seconds: 864000 > Compaction min/max thresholds: 4/32 > Read repair chance: 1.0 > Replicate on write: true > Built indexes: [events.events_Firm_idx, events.events_OrdType_idx, > events.events_OrderID_idx > , events.events_OrderQty_idx, events.events_Price_idx, > events.events_Symbol_idx, events.events_ds_timestamp_idx] > Column Metadata: > Column Name: Firm > Validation Class: org.apache.cassandra.db.marshal.BytesType > Index Name: events_Firm_idx > Index Type: KEYS > Column Name: OrdType > Validation Class: org.apache.cassandra.db.marshal.BytesType > Index Name: events_OrdType_idx > Index Type: KEYS > Column Name: OrderID > Validation Class: org.apache.cassandra.db.marshal.BytesType > Index Name: events_OrderID_idx > Index Type: KEYS > Column Name: OrderQty > Validation Class: org.apache.cassandra.db.marshal.LongType > Index Name: events_OrderQty_idx > Index Type: KEYS > Column Name: Price > Validation Class: org.apache.cassandra.db.marshal.LongType > Index Name: events_Price_idx > Index Type: KEYS > Column Name: Symbol > Validation Class: org.apache.cassandra.db.marshal.BytesType > Index Name: events_Symbol_idx > Column Name: ds_timestamp > Validation Class: org.apache.cassandra.db.marshal.LongType > Index Name: events_ds_timestamp_idx > Index Type: KEYS > > > get events WHERE Firm=434550 AND ds_timestamp=1341955958200; > > > …and the results are pretty much instantaneous. > > 1 Row Returned. > > [default@FIX] get events WHERE Firm=434550 AND ds_timestamp=1341955958200; > > ------------------- > > RowKey: > 64326430363362302d636164362d313165312d626637622d333836303737306639303133 > => (column=ClOrdID, value=32323833, timestamp=1341955980651010) > => (column=Firm, value=434550, timestamp=1341955980651026) > => (column=OrdType, value=31, timestamp=1341955980651008) > => (column=OrderQty, value=8200, timestamp=1341955980651013) > => (column=Price, value=433561, timestamp=1341955980651019) > => (column=Symbol, value=544e54, timestamp=1341955980651018) > => (column=ds_timestamp, value=1341955958200, timestamp=1341955980651020) > > > If I run the following query: > > get events WHERE Firm=434550 AND ds_timestamp>=1341955958200 AND > ds_timestamp<=1341955958200; > > (which in theory would should return the same 1 row result) > > > > It runs for around 12 seconds, > > > > And I get: > > TimedOutException() > > > > > If I run: > > get events WHERE Firm=434550 AND ds_timestamp>=1341955958200; > > or > > get events WHERE Firm=434550 AND ds_timestamp<=1341955958200; > > The results return quickly. > > > > Curious, I also ran a similar set of queries against the price field: > > get events WHERE Firm=434550 AND Price=433561; > get events WHERE Firm=434550 AND Price>=433561; > get events WHERE Firm=434550 AND Price<=433561; > > These all work fine. > > > While, > > get events WHERE Firm=434550 AND Price=433561 AND Price <= 433561; > > returns an IO Exception. > > > > This feels like it’s attempting to do a full table scan here…. > > What is going on here? > > Am I doing something incorrectly? > > We also see similar behavior when submit the query through our app via the > Thrift API. > > > Thanks, > JohnB >