On Thu, Oct 8, 2020 at 5:31 AM Attila Wind <attilaw@swf.technology> wrote:

> Hey Guys,
>
> We already started to feel that however Cassandra performance is awesome
> in the beginning over time
> - as more and more data is present in the tables,
> - more and more deletes creating tombstones,
> - cluster gets here and there not that well balanced
> performance can drop quickly and significantly...
>
> After ~1 year of learning curve we had to realize that time by time we run
> into things like "running repairs", "running compactions", understand
> tombstones (row and range), TTLs, etc etc becomes critical as data is
> growing.
> But on the other hand we also see often lots of warnings... Like "if you
> start Cassandra Reaper you can not stop doing that" ...
>
> I feel a bit confused now, and so far never ran into an article which
> really deeply explains: why?
> Why this? Why that? Why not this?
>
I know you're asking in general, but let me describe why it's hard - for
repair, in particular, there's a ton of nuance. In particular, there are
two types of repair (full and incremental), and then different scopes (-pr
for primary range, using start/end tokens for sub range, repairing all the
ranges on a host, etc).

With full repair, you compare all the data in a token range, stream
differences, and you're done. If you run the same command 30 seconds later,
it has to do the exact same amount of work.

With incremental repair, it uses clean/dirty bits on data files, and
optimizes so you dont have to scan as much data on subsequent runs. This
ALSO means you have 2 different sets of data files - clean and dirty - and
they won't ever compact together until you promote dirty files to clean
files! THAT is the magic bit of knowledge that most people don't describe
when they say "once you start running incremental repair, you can't stop".

If you're using reaper for full subrange repairs, you could stop at any
time. But if you're doing it for incremental, and you stop, you need to
unset all the repaired bits on the data files or you end up with data that
can't be compacted.

The time it takes to type out every single one of these surprising edge
cases / nuances is just too high for anyone to do it for free. Some books
try, many of the books are incomplete or out of date. It's a shame.

One day, hopefully, the database matures to a point where you don't need to
know how repair works in order to run a cluster. Oct 8 2020 is not that
day.


>
> So I think the time has come for us in the team to start focusing on these
> topics now. Invest time to better understanding. Really learn what "repair"
> means, and all consequences of it, etc
>
> So
> Does anyone have any "you must read it" recommendations around these "long
> term maintenance" topics?
>
Unfortunately, not really. There's some notes here
https://cassandra.apache.org/doc/latest/operating/index.html but it's
imperfect. May be good for people to keep adding docs.

> I mean really well explained blog post(s), article(s), book(s). Not some
> "half done" or  "I quickly write a post because it was too long ago when I
> blogged something..." things  :-)
>
> Good pointers would be appreciated!
>
> thanks
> --
> Attila Wind
>
> http://www.linkedin.com/in/attilaw
> Mobile: +49 176 43556932
>
>
>

Reply via email to