[ 
https://issues.apache.org/jira/browse/CASSANDRA-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-18278:
-------------------------------------------
    Fix Version/s: 5.x
                       (was: 5.0)

> Add a tool to clean  redundant data for native secondary index 
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-18278
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18278
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Feature/2i Index, Tool/nodetool
>            Reporter: Maxwell Guo
>            Assignee: Maxwell Guo
>            Priority: Normal
>             Fix For: 5.x
>
>
> As we know Cassandra' secondary index is a local secondary index , and for 
> every data update , and the every update hit the indexed columns. The old 
> redundant data (Stale Entries) for index table are keeped in the table only 
> when the data are read (may be a little like read repair ).
> So there may exist some old and useless data for index table if they are not 
> read. So we would like to support a tool that can remove the old useless data 
> .See the picture below , we create a table with a secondary index on c1 
> column , then update data with same pk ,different c1 value, and we flush 
> after every update, after that we force a major on the index table . See the 
> sstable dump for secondary index (The dump tool for secondary index can not 
> be used but fortunately we use the 
> [CASSANDRA-17698|https://issues.apache.org/jira/browse/CASSANDRA-17698]), and 
> we can see the content of index sstable.
> Below are the cql and dump result.
> {code:java}
> cqlsh> DESC ks.tb
> CREATE TABLE ks.tb (
>     pk int PRIMARY KEY,
>     c1 int
> ) WITH additional_write_policy = '99p'
>     AND allow_auto_snapshot = true
>     AND bloom_filter_fp_chance = 0.01
>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>     AND cdc = false
>     AND comment = ''
>     AND compaction = {'class': 
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
> 'max_threshold': '32', 'min_threshold': '4'}
>     AND compression = {'chunk_length_in_kb': '16', 'class': 
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND memtable = 'default'
>     AND crc_check_chance = 1.0
>     AND default_time_to_live = 0
>     AND extensions = {}
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair = 'BLOCKING'
>     AND speculative_retry = '99p';
> CREATE INDEX idx ON ks.tb (c1);
> cqlsh> INSERT INTO ks.tb(pk, c1)values (1, 1);
> cqlsh> INSERT INTO ks.tb(pk, c1)values (1, 2);
> cqlsh> INSERT INTO ks.tb(pk, c1)values (1, 3);
> cqlsh> 
> {code}
> On the other hand we flush after every update and force a major at the end.
> {code:java}
>   bin git:(trunk) ✗ ./nodetool flush
> ➜  bin git:(trunk) ✗ ./nodetool flush
> ➜  bin git:(trunk) ✗ ./nodetool flush
> ➜  bin git:(trunk) ✗ ./nodetool compact ks tb.idx
> ➜  bin git:(trunk) ✗ ../tools/bin/sstabledump 
> ../data/data/ks/tb-65d902b0b2bc11ed86ed81daebeca99d/.idx/nb-13-big-Data.db 
> [
>   {
>     "table kind" : "INDEX",
>     "partition" : {
>       "key" : [ "1" ],
>       "position" : 0
>     },
>     "rows" : [
>       {
>         "type" : "row",
>         "position" : 18,
>         "clustering" : [ 1 ],
>         "liveness_info" : { "tstamp" : "2023-02-23T03:21:57.638558Z" },
>         "cells" : [ ]
>       }
>     ]
>   },
>   {
>     "table kind" : "INDEX",
>     "partition" : {
>       "key" : [ "2" ],
>       "position" : 29
>     },
>     "rows" : [
>       {
>         "type" : "row",
>         "position" : 47,
>         "clustering" : [ 1 ],
>         "liveness_info" : { "tstamp" : "2023-02-23T03:22:19.834466Z" },
>         "cells" : [ ]
>       }
>     ]
>   },
>   {
>     "table kind" : "INDEX",
>     "partition" : {
>       "key" : [ "3" ],
>       "position" : 61
>     },
>     "rows" : [
>       {
>         "type" : "row",
>         "position" : 79,
>         "clustering" : [ 1 ],
>         "liveness_info" : { "tstamp" : "2023-02-23T03:22:27.532174Z" },
>         "cells" : [ ]
>       }
>     ]
>   }
> ]%       
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to