Re: Choosing a compaction strategy (TWCS)

Voytek Jarnot Wed, 21 Dec 2016 07:40:10 -0800

Just want to bump this thread if possible... having trouble ferreting out
the specifics of TWCS configuration, google's not being particularly
helpful.


If tombstone compactions are disabled by default in TWCS, does one enable
them by setting values for tombstone_compaction_interval and
tombstone_threshold?  Or am I was off - is there more to it?



On Sat, Dec 17, 2016 at 11:08 AM, Voytek Jarnot <voytek.jar...@gmail.com>
wrote:

> Thanks again.
>
> I swear I'd look this up instead, but my google-fu is failing me
> completely ... That said, I presume that they're enabled by setting values
> for tombstone_compaction_interval and tombstone_threshold?  Or is there
> more to it?
>
> On Fri, Dec 16, 2016 at 10:41 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com>
> wrote:
>
>> With the caveat that tombstone compactions are disabled by default in
>> TWCS (and DTCS)
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Dec 16, 2016, at 8:34 PM, Voytek Jarnot <voytek.jar...@gmail.com>
>> wrote:
>>
>> Gotcha.  "never compacted" has an implicit asterisk referencing
>> tombstone_compaction_interval and tombstone_threshold, sounds like.  More
>> of a "never compacted" via strategy selection, but eligible for
>> tombstone-triggered compaction.
>>
>> On Fri, Dec 16, 2016 at 10:07 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com>
>> wrote:
>>
>>> Tombstone compaction subproperties can handle tombstone removal for you
>>> (you’ll set a ratio of tombstones worth compacting away – for example, 80%,
>>> and set an interval to prevent continuous compaction – for example, 24
>>> hours, and then anytime there’s no other work to do, if there’s an sstable
>>> over 24 hours old that’s at least 80% tombstones, it’ll compact it in a
>>> single sstable compaction).
>>>
>>>
>>>
>>> -          Jeff
>>>
>>>
>>>
>>> *From: *Voytek Jarnot <voytek.jar...@gmail.com>
>>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>>> *Date: *Friday, December 16, 2016 at 7:34 PM
>>>
>>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>>> *Subject: *Re: Choosing a compaction strategy (TWCS)
>>>
>>>
>>>
>>> Thanks again, Jeff.
>>>
>>>
>>>
>>> Thinking about this some more, I'm wondering if I'm overthinking or if
>>> there's a potential issue:
>>>
>>>
>>>
>>> If my compaction_window_size is 7 (DAYS), and I've got TTLs of 7 days on
>>> some (relatively small percentage) of my records - am I going to be leaving
>>> tombstones around all over the place?  My noob-read on this is that TWCS
>>> will not compact tables comprised of records older than 7 days (
>>> https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dm
>>> lHowDataMaintain.html#dmlHowDataMaintain__twcs
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.datastax.com_en_cassandra_3.x_cassandra_dml_dmlHowDataMaintain.html-23dmlHowDataMaintain-5F-5Ftwcs&d=DgMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=L4TzIyjP32pjustWSsxm3_fFNKA2QK84X7oK9lBKhvo&s=De9MdTP7WY7skYPIsIt8ZM5G0cMAquAkSFun7iqCV_g&e=>),
>>> but Cassandra will not evict my tombstones until 7 days + consideration for
>>> gc_grace_seconds have passed ... resulting in no tombstone removal (?).
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Dec 16, 2016 at 1:17 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com>
>>> wrote:
>>>
>>> The issue is that your partitions will likely be in 2 sstables instead
>>> of “theoretically” 1. In practice, they’re probably going to bleed into 2
>>> anyway (memTable flush to sstable isn’t going to happen exactly when the
>>> window expires, so it’ll bleed a bit anyway), so I bet no meaningful impact.
>>>
>>>
>>>
>>> -          Jeff
>>>
>>>
>>>
>>> *From: *Voytek Jarnot <voytek.jar...@gmail.com>
>>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>>> *Date: *Friday, December 16, 2016 at 11:12 AM
>>>
>>>
>>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>>> *Subject: *Re: Choosing a compaction strategy (TWCS)
>>>
>>>
>>>
>>> Thank you Jeff - always nice to hear straight from the source.
>>>
>>>
>>>
>>> Any issues you can see with 3 (my calendar-week bucket not aligning with
>>> the arbitrary 7-day window)? Or am I confused (I'd put money on this
>>> option, but I've been wrong once or twice before)?
>>>
>>>
>>>
>>> On Fri, Dec 16, 2016 at 12:50 PM, Jeff Jirsa <jeff.ji...@crowdstrike.com>
>>> wrote:
>>>
>>> I skipped over the more important question  - loading data in. Two
>>> options:
>>>
>>> 1)       Load data in order through the normal writepath and use “USING
>>> TIMESTAMP” to set the timestamp, or
>>>
>>> 2)       Use CQLSSTableWriter and “USING TIMESTAMP” to create sstables,
>>> then sstableloader them into the cluster.
>>>
>>>
>>>
>>> Either way, try not to mix writes of old data and new data in the
>>> “normal” write path  at the same time, even if you write “USING TIMESTAMP”,
>>> because it’ll get mixed in the memTable, and flushed into the same sstable
>>> – it won’t kill you, but if you can avoid it, avoid it.
>>>
>>>
>>>
>>> -                      Jeff
>>>
>>>
>>>
>>>
>>>
>>> *From: *Jeff Jirsa <jeff.ji...@crowdstrike.com>
>>> *Date: *Friday, December 16, 2016 at 10:47 AM
>>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>>> *Subject: *Re: Choosing a compaction strategy (TWCS)
>>>
>>>
>>>
>>> With a 10 year retention, just ignore the target sstable count (I should
>>> remove that guidance, to be honest), and go for a 1 week window to match
>>> your partition size. 520 sstables on disk isn’t going to hurt you as long
>>> as you’re not reading from all of them, and with a partition-per-week the
>>> bloom filter is going to make things nice and easy for you.
>>>
>>>
>>>
>>> -          Jeff
>>>
>>>
>>>
>>>
>>>
>>> *From: *Voytek Jarnot <voytek.jar...@gmail.com>
>>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>>> *Date: *Friday, December 16, 2016 at 10:37 AM
>>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>>> *Subject: *Choosing a compaction strategy (TWCS)
>>>
>>>
>>>
>>> Scenario:
>>>
>>> Converting an Oracle table to Cassandra, one Oracle table to 4 Cassandra
>>> tables, basically time-series - think log or auditing.  Retention is 10
>>> years, but greater than 95% of reads will occur on data written within the
>>> last year. 7 day TTL used on a small percentage of the records, majority do
>>> not use TTL. Other than the aforementioned TTL, and the 10-year purge, no
>>> updates or deletes are done.
>>>
>>>
>>>
>>> Seems like TWCS is the right choice, but I have a few questions/concerns:
>>>
>>>
>>>
>>> 1) I'll be bulk loading a few years of existing data upon deployment -
>>> any issues with that?  I assume using "with timestamp" when inserting this
>>> data will be mandatory if I choose TWCS?
>>>
>>>
>>>
>>> 2) I read here (https://github.com/jeffjirsa/twcs/
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_jeffjirsa_twcs_&d=DgMFaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=mFIirekKLKHUeQ-Jop1JR4gIXJx8KEQcmtgh15v0Vqo&s=m0O2Z6XGdat-bljOtiuWnVblHHtyJM4TKZ80mhwVBDs&e=>)
>>> that "You should target fewer than 50 buckets per table based on your TTL."
>>> That's going to be a tough goal with a 10 year retention ... can anyone
>>> speak to how important this target really is?
>>>
>>>
>>>
>>> 3) If I'm bucketing my data with week/year (i.e., partition on year,
>>> week - so today would be in 2016, 50), it seems like a natural fit for
>>> compaction_window_size would be 7 days, but I'm thinking my calendar-based
>>> weeks will never align with TWCS 7-day-period weeks anyway - am I missing
>>> something there?
>>>
>>>
>>>
>>> I'd appreciate any other thoughts on compaction and/or twcs.
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Re: Choosing a compaction strategy (TWCS)

Reply via email to