Querying a table with 5000 thousands tombstones take 3 minutes to complete! But Querying the same table with the same data pattern with 10,000 entries takes a fraction of second to complete!
Details: 1. created the following table: CREATE KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}; use test; CREATE TABLE job_index ( stage text, "timestamp" text, PRIMARY KEY (stage, "timestamp")); 2. inserted 5000 entries to the table: INSERT INTO job_index (stage, timestamp) VALUES ( 'a', '00000001' ); INSERT INTO job_index (stage, timestamp) VALUES ( 'a', '00000002' ); .... INSERT INTO job_index (stage, timestamp) VALUES ( 'a', '00004999' ); INSERT INTO job_index (stage, timestamp) VALUES ( 'a', '00005000' ); 3. flushed the table: nodetool flush test job_index 4. deleted the 5000 entries: DELETE from job_index WHERE stage ='a' AND timestamp = '00000001' ; DELETE from job_index WHERE stage ='a' AND timestamp = '00000002' ; ... DELETE from job_index WHERE stage ='a' AND timestamp = '00004999' ; DELETE from job_index WHERE stage ='a' AND timestamp = '00005000' ; 5. flushed the table: nodetool flush test job_index 6. querying the table takes 3 minutes to complete: cqlsh:test> SELECT * from job_index limit 20000; tracing: http://pastebin.com/jH2rZN2X while query was getting executed I saw a lot of GC entries in cassandra's log: DEBUG [ScheduledTasks:1] 2013-07-01 23:47:59,221 GCInspector.java (line 121) GC for ParNew: 30 ms for 6 collections, 263993608 used; max is 2093809664 DEBUG [ScheduledTasks:1] 2013-07-01 23:48:00,222 GCInspector.java (line 121) GC for ParNew: 29 ms for 6 collections, 186209616 used; max is 2093809664 DEBUG [ScheduledTasks:1] 2013-07-01 23:48:01,223 GCInspector.java (line 121) GC for ParNew: 29 ms for 6 collections, 108731464 used; max is 2093809664 It seems that something very inefficient is happening in managing tombstones. If I start with a clean table and do the following: 1. insert 5000 entries 2. flush to disk 3. insert new 5000 entries 4. flush to disk Querying the job_index for all the 10,000 entries takes a fraction of second to complete: tracing: http://pastebin.com/scUN9JrP The fact that iterating over 5000 tombstones takes 3 minutes but iterating over 10,000 live cells takes fraction of a second to suggest that something very inefficient is happening in managing tombstones. I appreciate if any developer can look into this. -M