Hi,all I use zipkin (https://github.com/openzipkin/zipkin <https://github.com/openzipkin/zipkin>) to trace my system.
When I upgraded to the latest version ,3.23 be specific. I met a problem which our monitor keep alerting that there is not enough disk space for cassandra. After some investigation,I found the biggest file is the index file. And also I have googled some blogs like (http://www.doanduyhai.com/blog/?p=2058 <http://www.doanduyhai.com/blog/?p=2058>). Which said as belows: As we can see, using CONTAINS mode can increase the disk usage by x4 - x6. Since album titles tends to be a long text, the inflation rate is x6. It will be more if we chose the NonTokenizingAnalyzer because the StandardAnalyzer splits the text into tokens, remove stop words and perform stemming. All this help reducing the total size of the term. As a conclusion, use CONTAINS mode wisely and be ready to pay the price in term of disk space. There is no way to avoid it. Even with efficient search engines like ElasticSearch or Solr, it is officially recommended to avoid substring search (LIKE %substring%) for the sake of performance. Zipkin2 create index as follows : CREATE CUSTOM INDEX IF NOT EXISTS ON zipkin2.span (annotation_query) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { 'mode': 'CONTAINS', 'analyzed': 'true', 'analyzer_class':'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', 'case_sensitive': 'false' }; I cannot understand why it will use more disk space when we choose NonTokenizingAnalyzer rather than StandardAnalyzer as analyzer_class. As I debug the code , there is only one term returned when use NonTokenizingAnalyzer Need Some Help! Thanks a lot
