[
https://issues.apache.org/jira/browse/CASSANDRA-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Semb Wever updated CASSANDRA-18278:
-------------------------------------------
Fix Version/s: 5.x
(was: 5.0)
> Add a tool to clean redundant data for native secondary index
> ---------------------------------------------------------------
>
> Key: CASSANDRA-18278
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18278
> Project: Cassandra
> Issue Type: Improvement
> Components: Feature/2i Index, Tool/nodetool
> Reporter: Maxwell Guo
> Assignee: Maxwell Guo
> Priority: Normal
> Fix For: 5.x
>
>
> As we know Cassandra' secondary index is a local secondary index , and for
> every data update , and the every update hit the indexed columns. The old
> redundant data (Stale Entries) for index table are keeped in the table only
> when the data are read (may be a little like read repair ).
> So there may exist some old and useless data for index table if they are not
> read. So we would like to support a tool that can remove the old useless data
> .See the picture below , we create a table with a secondary index on c1
> column , then update data with same pk ,different c1 value, and we flush
> after every update, after that we force a major on the index table . See the
> sstable dump for secondary index (The dump tool for secondary index can not
> be used but fortunately we use the
> [CASSANDRA-17698|https://issues.apache.org/jira/browse/CASSANDRA-17698]), and
> we can see the content of index sstable.
> Below are the cql and dump result.
> {code:java}
> cqlsh> DESC ks.tb
> CREATE TABLE ks.tb (
> pk int PRIMARY KEY,
> c1 int
> ) WITH additional_write_policy = '99p'
> AND allow_auto_snapshot = true
> AND bloom_filter_fp_chance = 0.01
> AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
> AND cdc = false
> AND comment = ''
> AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32', 'min_threshold': '4'}
> AND compression = {'chunk_length_in_kb': '16', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
> AND memtable = 'default'
> AND crc_check_chance = 1.0
> AND default_time_to_live = 0
> AND extensions = {}
> AND gc_grace_seconds = 864000
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair = 'BLOCKING'
> AND speculative_retry = '99p';
> CREATE INDEX idx ON ks.tb (c1);
> cqlsh> INSERT INTO ks.tb(pk, c1)values (1, 1);
> cqlsh> INSERT INTO ks.tb(pk, c1)values (1, 2);
> cqlsh> INSERT INTO ks.tb(pk, c1)values (1, 3);
> cqlsh>
> {code}
> On the other hand we flush after every update and force a major at the end.
> {code:java}
> bin git:(trunk) ✗ ./nodetool flush
> ➜ bin git:(trunk) ✗ ./nodetool flush
> ➜ bin git:(trunk) ✗ ./nodetool flush
> ➜ bin git:(trunk) ✗ ./nodetool compact ks tb.idx
> ➜ bin git:(trunk) ✗ ../tools/bin/sstabledump
> ../data/data/ks/tb-65d902b0b2bc11ed86ed81daebeca99d/.idx/nb-13-big-Data.db
> [
> {
> "table kind" : "INDEX",
> "partition" : {
> "key" : [ "1" ],
> "position" : 0
> },
> "rows" : [
> {
> "type" : "row",
> "position" : 18,
> "clustering" : [ 1 ],
> "liveness_info" : { "tstamp" : "2023-02-23T03:21:57.638558Z" },
> "cells" : [ ]
> }
> ]
> },
> {
> "table kind" : "INDEX",
> "partition" : {
> "key" : [ "2" ],
> "position" : 29
> },
> "rows" : [
> {
> "type" : "row",
> "position" : 47,
> "clustering" : [ 1 ],
> "liveness_info" : { "tstamp" : "2023-02-23T03:22:19.834466Z" },
> "cells" : [ ]
> }
> ]
> },
> {
> "table kind" : "INDEX",
> "partition" : {
> "key" : [ "3" ],
> "position" : 61
> },
> "rows" : [
> {
> "type" : "row",
> "position" : 79,
> "clustering" : [ 1 ],
> "liveness_info" : { "tstamp" : "2023-02-23T03:22:27.532174Z" },
> "cells" : [ ]
> }
> ]
> }
> ]%
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]