Hello,

I don't know if this is a good place to talk about that, but I think this might 
help some people running into the same issue, so I will simply give some 
feedback here about what I was running in the last few months

I've been using secondary index (yes, this is bad), but the side effects I've 
been fighting at were absolutely not documented, nor discussed, this is why I 
want to share my experience with it

Background:
- The cluster was 8 nodes (around 30GB per node) with 4000 read/s and 4000 
write/s
- The biggest table (event_data) is 200GB cluster wide, and take something like 
1000 write/s and a few reads/s
- This table had a secondary index (1,2GB cluster wide) (yes, this is huge)

CREATE TABLE event_data (
  object int,
  created_at timeuuid,
  message text,
  source text,
... few other things...
  PRIMARY KEY ((object), created_at)
) WITH
... few other things...
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};

CREATE INDEX event_data_source_index ON event_data (source);

We had many issues with the cluster, and some of them was hard to correlate:
- Unable to add a node, it was joining forever, with a 100% CPU and no 
WARN/ERROR logs
- When running a repair, all my thrift clients were timing out randomly (wtf)

I decided to turn the joining node to DEBUG logging, and the logs were talking 
only about the index being sync, but at a very slow pace (showing every indexed 
keys). I concluded it was because the index was getting synchronized slower 
than it was actually changing on the other nodes, resulting into something 
infinitely long

I decided to drop the secondary index, and everything is now running fine!

We also changed the compaction strategy from STCS to DTCS, and it rocks!

I hope this message will help someone someday,

Edouard COLE

tags: secondary index join stuck timeout thrift feedback random repair

Reply via email to