Re: Old tombstones not being cleaned up

kurt greaves Tue, 20 Feb 2018 16:14:49 -0800

Typically you'll end up with an unrepaired SSTable and a repaired SSTable.
You'll only end up with one if there's absolutely no unrepaired data (which
is very unlikely).


On 13 February 2018 at 09:54, Bo Finnerup Madsen <bo.gunder...@gmail.com>
wrote:

> Hi Eric,
>
> I had not seen your talk, it was very informative thank you! :)
>
> Based on your talk, I can see how tombstones might noget get removed
> during normal operations under certain conditions. But I am not sure our
> scenario fit those conditions.
>
> We have less than 100.000 live rows in the table in question, and when
> flushed the table is roughly 60Mb. Using "nodetool compact" I did several
> full compactions of the table. How ever, I always ended up with two
> sstables as Jeff mentions, so perhaps some kind of issue with the
> incremental repair...
>
>
> man. 12. feb. 2018 kl. 15.46 skrev Eric Stevens <migh...@gmail.com>:
>
>> Hi,
>>
>> Just in case you haven't seen it, I gave a talk last year at the summit.
>> In the first part of the talk I speak for a while about the lifecycle of a
>> tombstone, and how they don't always get cleaned up when you might expect.
>>
>> https://youtu.be/BhGkSnBZgJA
>>
>> It looks like you're deleting cluster keys on a partition that you always
>> append to?  If so those tombstones can never be cleaned up - see the talk.
>> I don't know if this is what's affecting you or not, but it might be
>> worthwhile to consider.
>>
>> On Mon, Feb 12, 2018, 3:17 AM Jeff Jirsa <jji...@gmail.com> wrote:
>>
>>> When you force compacted, did you end up with 1 sstable or 2?
>>>
>>> If 2, did you ever run (incremental) repair on some of the data? If so,
>>> it moves the repaired sstable to a different compaction manager, which
>>> means it won’t purge the tombstone if it shadows data in the unrepaired set
>>>
>>>
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> On Feb 12, 2018, at 12:46 AM, Bo Finnerup Madsen <bo.gunder...@gmail.com>
>>> wrote:
>>>
>>> Well for anyone having the same issue, I "fixed" it by dropping and
>>> re-creating the table.
>>>
>>> fre. 2. feb. 2018 kl. 07.29 skrev Steinmaurer, Thomas <
>>> thomas.steinmau...@dynatrace.com>:
>>>
>>>> Right. In this case, cleanup should have done the necessary work here.
>>>>
>>>>
>>>>
>>>> Thomas
>>>>
>>>>
>>>>
>>>> *From:* Bo Finnerup Madsen [mailto:bo.gunder...@gmail.com]
>>>> *Sent:* Freitag, 02. Februar 2018 06:59
>>>>
>>>>
>>>> *To:* user@cassandra.apache.org
>>>> *Subject:* Re: Old tombstones not being cleaned up
>>>>
>>>>
>>>>
>>>> We did start with a 3 node cluster and a RF of 3, then added another 3
>>>> nodes and again another 3 nodes. So it is a good guess :)
>>>>
>>>> But I have run both repair and cleanup against the table on all nodes,
>>>> would that not have removed any stray partitions?
>>>>
>>>> tor. 1. feb. 2018 kl. 22.31 skrev Steinmaurer, Thomas <
>>>> thomas.steinmau...@dynatrace.com>:
>>>>
>>>> Did you started with a 9 node cluster from the beginning or did you
>>>> extend / scale out your cluster (with vnodes) beyond the replication 
>>>> factor?
>>>>
>>>>
>>>>
>>>> If later applies and if you are deleting by explicit deletes and not
>>>> via TTL, then nodes might not see the deletes anymore, as a node might not
>>>> own the partition anymore after a topology change (e.g. scale out beyond
>>>> the keyspace RF).
>>>>
>>>>
>>>>
>>>> Just a very wild guess.
>>>>
>>>>
>>>>
>>>> Thomas
>>>>
>>>>
>>>>
>>>> *From:* Bo Finnerup Madsen [mailto:bo.gunder...@gmail.com]
>>>> *Sent:* Donnerstag, 01. Februar 2018 22:14
>>>>
>>>>
>>>> *To:* user@cassandra.apache.org
>>>> *Subject:* Re: Old tombstones not being cleaned up
>>>>
>>>>
>>>>
>>>> We do not use TTL anywhere...records are inserted and deleted
>>>> "manually" by our software.
>>>>
>>>> tor. 1. feb. 2018 kl. 18.29 skrev Jonathan Haddad <j...@jonhaddad.com>:
>>>>
>>>> Changing the defaul TTL doesn’t change the TTL on the existing data,
>>>> only new data. It’s only set if you don’t supply one yourself.
>>>>
>>>>
>>>>
>>>> On Wed, Jan 31, 2018 at 11:35 PM Bo Finnerup Madsen <
>>>> bo.gunder...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> We are running a small 9 node Cassandra v2.1.17 cluster. The cluster
>>>> generally runs fine, but we have one table that are causing OOMs because an
>>>> enormous amount of tombstones.
>>>>
>>>> Looking at the data in the table (sstable2json), the first of the
>>>> tombstones are almost a year old. The table was initially created with a
>>>> gc_grace_period of 10 days, but I have now lowered it to 1 hour.
>>>>
>>>> I have run a full repair of the table across all nodes. I have forced
>>>> several major compactions of the table by using "nodetool compact", and
>>>> also tried to switch from LeveledCompaction to SizeTierCompaction and back.
>>>>
>>>>
>>>>
>>>> What could cause cassandra to keep these tombstones?
>>>>
>>>>
>>>>
>>>> sstable2json:
>>>>
>>>> {"key": "foo",
>>>>
>>>>  "cells": [["0000082f-25ef-4324-bb8a-8cf013c823c1:_","0000082f-
>>>> 25ef-4324-bb8a-8cf013c823c1:!",1507819135148000,"t",1507819135],
>>>>
>>>>            ["000010f3-c05d-4ab9-9b8a-e6ebd8f5818a:_","000010f3-
>>>> c05d-4ab9-9b8a-e6ebd8f5818a:!",1503661731697000,"t",1503661731],
>>>>
>>>>            ["00001d7a-ce95-4c74-b67e-f8cdffec4f85:_","00001d7a-
>>>> ce95-4c74-b67e-f8cdffec4f85:!",1509542102909000,"t",1509542102],
>>>>
>>>>            ["00001dd3-ae22-4f6e-944a-8cfa147cde68:_","00001dd3-
>>>> ae22-4f6e-944a-8cfa147cde68:!",1512418006838000,"t",1512418006],
>>>>
>>>>            ["000022cc-d69c-4596-89e5-3e976c0cb9a8:_","000022cc-
>>>> d69c-4596-89e5-3e976c0cb9a8:!",1497377448737001,"t",1497377448],
>>>>
>>>>            ["00002777-4b1a-4267-8efc-c43054e63170:_","00002777-
>>>> 4b1a-4267-8efc-c43054e63170:!",1491014691515001,"t",1491014691],
>>>>
>>>>            ["000061e8-f48b-4484-96f1-f8b6a3ed8f9f:_","000061e8-
>>>> f48b-4484-96f1-f8b6a3ed8f9f:!",1500820300544000,"t",1500820300],
>>>>
>>>>            ["000063da-f165-449b-b65d-2b7869368734:_","000063da-
>>>> f165-449b-b65d-2b7869368734:!",1512806634968000,"t",1512806634],
>>>>
>>>>            ["0000656f-f8b5-472b-93ed-1a893002f027:_","0000656f-
>>>> f8b5-472b-93ed-1a893002f027:!",1514554716141000,"t",1514554716],
>>>>
>>>> ...
>>>>
>>>> {"key": "bar",
>>>>
>>>>  "metadata": {"deletionInfo": {"markedForDeleteAt":1517402198585982,"
>>>> localDeletionTime":1517402198}},
>>>>
>>>>  "cells": [["000af8c2-ffe9-4217-9032-61a1cd21781d:_","000af8c2-
>>>> ffe9-4217-9032-61a1cd21781d:!",1495094965916000,"t",1495094965],
>>>>
>>>>            ["005b96cb-7eb3-4ec3-bfa2-8573e46892f4:_","005b96cb-
>>>> 7eb3-4ec3-bfa2-8573e46892f4:!",1516360186865000,"t",1516360186],
>>>>
>>>>            ["005ec167-aa61-4868-a3ae-a44b00099eb6:_","005ec167-
>>>> aa61-4868-a3ae-a44b00099eb6:!",1516671840920002,"t",1516671840],
>>>>
>>>> ....
>>>>
>>>>
>>>>
>>>> sstablemetadata:
>>>>
>>>> stablemetadata /data/cassandra/data/xxx/yyy-
>>>> 9ed502c0734011e6a128fdafd829b1c6/ddp-yyy-ka-2741-Data.db
>>>>
>>>> SSTable: /data/cassandra/data/xxx/yyy-9ed502c0734011e6a128fdafd829b1
>>>> c6/ddp-yyy-ka-2741
>>>>
>>>> Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
>>>>
>>>> Bloom Filter FP chance: 0.100000
>>>>
>>>> Minimum timestamp: 1488976211688000
>>>>
>>>> Maximum timestamp: 1517468644066000
>>>>
>>>> SSTable max local deletion time: 2147483647 <(214)%20748-3647>
>>>>
>>>> Compression ratio: 0.5121956624389545
>>>>
>>>> Estimated droppable tombstones: 18.00161766553587
>>>>
>>>> SSTable Level: 0
>>>>
>>>> Repaired at: 0
>>>>
>>>> ReplayPosition(segmentId=1517168739626, position=22690189
>>>> <22%2069%2001%2089>)
>>>>
>>>> Estimated tombstone drop times:%n
>>>>
>>>> 1488976211:         1
>>>>
>>>> 1489906506:      4706
>>>>
>>>> 1490174752:      6111
>>>>
>>>> 1490449759:      6554
>>>>
>>>> 1490735410:      6559
>>>>
>>>> 1491016789:      6369
>>>>
>>>> 1491347982:     10216
>>>>
>>>> 1491680214:     13502
>>>>
>>>> ...
>>>>
>>>>
>>>>
>>>> desc:
>>>>
>>>> CREATE TABLE xxx.yyy (
>>>>
>>>>     ti text,
>>>>
>>>>     uuid text,
>>>>
>>>>     json_data text,
>>>>
>>>>     PRIMARY KEY (ti, uuid)
>>>>
>>>> ) WITH CLUSTERING ORDER BY (uuid ASC)
>>>>
>>>>     AND bloom_filter_fp_chance = 0.1
>>>>
>>>>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>>>>
>>>>     AND comment = ''
>>>>
>>>>     AND compaction = {'class': 'org.apache.cassandra.db.compaction.
>>>> LeveledCompactionStrategy'}
>>>>
>>>>     AND compression = {'sstable_compression': 'org.apache.cassandra.io.
>>>> compress.LZ4Compressor'}
>>>>
>>>>     AND dclocal_read_repair_chance = 0.1
>>>>
>>>>     AND default_time_to_live = 0
>>>>
>>>>     AND gc_grace_seconds = 3600
>>>>
>>>>     AND max_index_interval = 2048
>>>>
>>>>     AND memtable_flush_period_in_ms = 0
>>>>
>>>>     AND min_index_interval = 128
>>>>
>>>>     AND read_repair_chance = 0.0
>>>>
>>>>     AND speculative_retry = '99.0PERCENTILE';
>>>>
>>>>
>>>>
>>>> jmx props(picture):
>>>>
>>>> [image: image001.png]
>>>>
>>>> The contents of this e-mail are intended for the named addressee only.
>>>> It contains information that may be confidential. Unless you are the named
>>>> addressee or an authorized designee, you may not copy or use it, or
>>>> disclose it to anyone else. If you received it in error please notify us
>>>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>>>> number FN 91482h) is a company registered in Linz whose registered office
>>>> is at 4040 Linz, Austria, Freistädterstraße 313
>>>> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313&entry=gmail&source=g>
>>>>
>>>> The contents of this e-mail are intended for the named addressee only.
>>>> It contains information that may be confidential. Unless you are the named
>>>> addressee or an authorized designee, you may not copy or use it, or
>>>> disclose it to anyone else. If you received it in error please notify us
>>>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>>>> number FN 91482h) is a company registered in Linz whose registered office
>>>> is at 4040 Linz, Austria, Freistädterstraße 313
>>>> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313&entry=gmail&source=g>
>>>>
>>>

Re: Old tombstones not being cleaned up

Reply via email to