Re: Cassandra collection tombstones

Alain RODRIGUEZ Mon, 28 Jan 2019 05:49:15 -0800

Hello,

@Chris, I mostly agree with you. I will try to make clear what I had in
mind, as it was not well-expressed obviously.



> it doesn't matter if the tombstone is overlapped it still need to be kept
> for the gc_grace before purging or it can result in data resurrection.


Yes, I agree. I do not recommend lowering the gc_grace without giving it a
thought. I was saying that this was not part of the ratio calculation as
you explained.



> sstablemetadata cannot reliably or safely know the table parameters that
> are not kept in the sstable so to get an accurate value you have to provide
> a -g or --gc-grace-seconds parameter. I am not sure where the "always
> wrong" comes in as the quantity of data thats being shadowed is not what
> its tracking (although it would be more meaningful for single sstable
> compactions if it did), just when tombstones can be purged.



What I tried to say is that the "estimated droppable tombstone" ratio does
not account for overlaps or gc_grace_seconds. Thus when the ratio shows
0.7, after running compactions you have no guarantee that this number will
be any lower, and can almost be sure it will not reach 0. Tombstones will
stay around. In that sense, I said a bit too strongly that this value is
always "wrong".

I was not saying it's easy or even possible to get accurate information. I
was rather warning users that in practice the gap between what this value
tells you to be the number of actually dropped tombstones is often far from
what this "estimated" ratio provides.

@Ayub

Firstly I am not seeing any difference when using gc_grace_seconds with
> sstablemetadata.
>

As we both said (or tried to say), this is expected. Yet, during
compactions, the tombstones will be eligible sooner for eviction
(definitive removal). You can test (test cluster) it and the tombstone
should go away with a compaction (only after 'gc_grace_seconds').

To work with this in prod, you should be sure that it won't harm the
cluster...
About gc_grace_seconds, remember that:
- gc_grace_seconds > repair interval (full, all cluster) - if you are
performing delete. Which I think might be your case if you're inserting
collections on the top of existing ones (instead of updating specific keys
in it). This, as Chris said, can lead to inconsistencies. TTLs are OK (no
repair needed - no more than for other columns without TTLs).
- gc_grace_seconds impacts hints TTL, Radovan wrote about this exact topic
here:
http://thelastpickle.com/blog/2018/03/21/hinted-handoff-gc-grace-demystified.html

This is a risky path to follow if you are not sure about the impacts.

Second question, for this test table I see the original record which I
> inserted got its tombstone removed by autocompaction which ran today as its
> gc_grace_seconds is set to one day. But I see some tables whose
> gc_grace_seconds is set to 3 days but when I do sstabledump on them I see
> tombstone entries for them in the json output and sstablemetadata also
> shows them to be estimated tombstones records. I see autocompaction is
> running on the stables of this table and I also manually ran using jmx
> shell but they are still there...any reason why they are not getting
> deleted?


As we also mentioned earlier, there are some conditions that can prevent a
tombstone to be actually dropped, even if it's part of a compaction after
gc_grace_seconds.
In particular, if the data is 'covering'/'shadowing' previously existing
data, that still exist in some other sstable(s) that are not part of the
compation, then Cassandra cannot safely remove the tombstone as the latest
still existing value for this cell(s) would reappear. That's what we call
'overlaps'. And it can prevent compactions.

There are some possibilities:
- Major compaction: To make it short we do not recommend that, almost never
unless you know what you're doing, because it will lead to one big sstable
that would no longer be automatically compacted for a long while. Making
thing quickly worst after a short term improvement.
- Compact using jmx, but selecting all the needed sstables. Depending on
your use case / compaction strategy it can quickly lead either to the first
point above as at some point all the nodes might be involved, or to a
complex strategy to handle compactions.
- Use a different compaction strategy to ease the tombstone removal
- if the tombstones are not an issue not in the read path (thus not
creating latency), not either in for the disk space, then maybe just ignore
it. The tombstones are here by design (
https://jsravn.com/2015/05/13/cassandra-tombstones-collections/#lists) when
inserting collection objects.

But if you're not performing deletes, and that tombstones appear way before
the TTL time, I would say it's an issue with the insert of the 'map' and I
would suggest you to rather focus on changing the queries/model to fix the
massive tombstone creation in the first place, that will probably make
things way nicer/easier.
The person in the post above used a json as his way out of this problem.
In the past, I kept the collection but changed queries from the 'insert' to
'update' from here: http://cassandra.apache.org/doc/4.0/cql/types.html#id5.
Using this form worked without creating tombstones if I remember correctly.
I would guess this is because when you 'update'  you accept that the
previously set values in the map remain unchanged if they are not
specified, that's why Cassandra doesn't have to 'clean' first. For inserts,
it would be weird to insert a map and find 2 columns from the previous
'insert', Cassandra does not read, it cleans and writes on top.

Good luck,

C*heers,
-----------------------
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



Le dim. 27 janv. 2019 à 09:02, Ayub M <hia...@gmail.com> a écrit :

> Thanks Alain/Chris.
>
> Firstly I am not seeing any difference when using gc_grace_seconds with
> sstablemetadata.
>
> CREATE TABLE ks.nmtest (
>     reservation_id text,
>     order_id text,
>     c1 int,
>     order_details map<text, text>,
>     PRIMARY KEY (reservation_id, order_id)
> ) WITH CLUSTERING ORDER BY (order_id ASC)
>     AND bloom_filter_fp_chance = 0.01
>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>     AND comment = ''
>     AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32', 'min_threshold': '4'}
>     AND compression = {'chunk_length_in_kb': '64', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND crc_check_chance = 1.0
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 86400
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99PERCENTILE';
>
> [root@ip-xxx-xxx-xxx-xxx nmtest-e1302500201d11e983bb693c02c04c62]#
> sstabledump mc-11-big-Data.db
> WARN  08:27:32,793 memtable_cleanup_threshold has been deprecated and
> should be removed from cassandra.yaml
> [
>   {
>     "partition" : {
>       "key" : [ "4" ],
>       "position" : 0
>     },
>     "rows" : [
>       {
>         "type" : "row",
>         "position" : 40,
>         "clustering" : [ "4" ],
>         "cells" : [
>           { "name" : "order_details", "path" : [ "key1" ], "value" :
> "value1", "tstamp" : "2019-01-27T08:26:49.633240Z" }
>         ]
>       }
>     ]
>   },
>   {
>     "partition" : {
>       "key" : [ "5" ],
>       "position" : 41
>     },
>     "rows" : [
>       {
>         "type" : "row",
>         "position" : 82,
>         "clustering" : [ "5" ],
>         "liveness_info" : { "tstamp" : "2019-01-27T08:23:29.782506Z" },
>         "cells" : [
>           { "name" : "c1", "value" : 5 },
>           { "name" : "order_details", "deletion_info" : { "marked_deleted"
> : "2019-01-27T08:23:29.782505Z", "local_delete_time" :
> "2019-01-27T08:23:29Z" } },
>           { "name" : "order_details", "path" : [ "key" ], "value" :
> "value" }
>         ]
>       }
>     ]
>   }
>
> Partition 5 is a newly inserted record, no matter what gc_grace_seconds
> value I pass it still shows this record as estimated tombstone.
>
> [root@ip-xxx-xxx-xxx-xxx nmtest-e1302500201d11e983bb693c02c04c62]#
> sstablemetadata mc-11-big-Data.db | grep "Estimated tombstone drop times"
> -A3
> Estimated tombstone drop times:
> 1548577440:         1
> Count               Row Size        Cell Count
>
> [root@ip-xxx-xxx-xxx-xxx nmtest-e1302500201d11e983bb693c02c04c62]#
> sstablemetadata  --gc_grace_seconds 86400 mc-11-big-Data.db | grep
> "Estimated tombstone drop times" -A4
> Estimated tombstone drop times:
> 1548577440:         1
> Count               Row Size        Cell Count
>
> Second question, for this test table I see the original record which I
> inserted got its tombstone removed by autocompaction which ran today as its
> gc_grace_seconds is set to one day. But I see some tables whose
> gc_grace_seconds is set to 3 days but when I do sstabledump on them I see
> tombstone entries for them in the json output and sstablemetadata also
> shows them to be estimated tombstones records. I see autocompaction is
> running on the stables of this table and I also manually ran using jmx
> shell but they are still there...any reason why they are not getting
> deleted?
>
> sstablemetadata  --gc_grace_seconds 259200 mc-732-big-Data.db | grep
> "Estimated tombstone drop times" -A10
>
> WARN  07:28:03,086 memtable_cleanup_threshold has been deprecated and
> should be removed from cassandra.yaml
>
> Estimated tombstone drop times:
>
> 1537475340:         7
>
> 1537476150:        14
>
> 1537476360:         7
>
> 1537476660:         7
>
>
> one record from son file having old tombstone markers....
>
>   {
>
>     "partition" : {
>
>       "key" : [ "2945132807" ],
>
>       "position" : 9036596
>
>     },
>
> "rows" : [
>
>       {
>
>         "type" : "row",
>
>         "position" : 9037781,
>
>         "clustering" : [ "2018-08-15 00:00:00.000Z", "233359" ],
>
>         "liveness_info" : { "tstamp" : "2018-09-26T14:52:54.255395Z" },
>
>         "cells" : [
>
>    .....
>
>           { "name" : "col1", "deletion_info" : { "marked_deleted" :
> "2018-09-26T14:52:54.255394Z", "local_delete_time" : "2018-09-26T14:52:54Z"
> } },
>
>           { "name" : "col1", "path" : [ "zczxc" ], "value" : "ZXczx" },
>
>           { "name" : "col1", "path" : [ "ZXCxzc" ], "value" : "ZCzxc" },
>
>           { "name" : "col2", "deletion_info" : { "marked_deleted" :
> "2018-09-26T14:52:54.255394Z", "local_delete_time" : "2018-09-26T14:52:54Z"
> } },
>
>           { "name" : "col2", "path" : [ "zcxxc" ], "value" : false },
>
>           { "name" : "col2", "path" : [ "hjhkjh" ], "value" : false },
>
>           { "name" : "col2", "path" : [ "LEGACY" ], "value" : true },
>
>           { "name" : "col2", "path" : [ "NON_SITE_SPECIFIC" ], "value" :
> true },
>
>           { "name" : "issuance_data", "deletion_info" : { "marked_deleted"
> : "2018-09-26T14:52:54.255394Z", "local_delete_time" :
> "2018-09-26T14:52:54Z" } },
>
>   .....
>

Re: Cassandra collection tombstones

Reply via email to