Thanks for bringing this up Juha, and good catch.

We actually are disabling WAL for routine writes by default when using
RocksDB and never encountered segment fault issues. However, from history
in FLINK-8922, segment fault issue occurs during restore if WAL is
disabled, so I guess the root cause lies in RocksDB batch write
(org.rocksdb.WriteBatch). And IMHO this is a RocksDB bug (it should work
well when WAL is disabled, no matter under single or batch write).

+1 for opening a new JIRA to figure the root cause out, fix it and disable
WAL during restore by default (maybe checking the fixes around WriteBatch
in later RocksDB versions could help locate the issue more quickly), and
thanks for volunteering taking the efforts. I will follow up and help
review if any findings / PR submission.

Best Regards,
Yu


On Wed, 16 Sep 2020 at 13:58, Juha Mynttinen <juha.myntti...@king.com>
wrote:

> Hello there,
>
> I'd like to bring to discussion a previously discussed topic - disabling
> WAL in RocksDB recovery.
>
> It's clear that WAL is not needed during the process, the reason being
> that the WAL is never read, so there's no need to write it.
>
> AFAIK the last thing that was done with WAL during recovery is an attempt
> to remove it and later reverting that removal (
> https://issues.apache.org/jira/browse/FLINK-8922). If I interpret the
> comments in the ticket correctly, what happened was that a) WAL was kept in
> the recovery, 2) it's unknown why removing WAL causes segfault.
>
> What can be seen in the ticket is that having WAL causes a significant
> performance penalty. Thus, getting rid of WAL would be a very nice
> performance improvement. I think it'd be worth to creating a new JIRA
> ticket at least as a reminder that WAL should be removed?
>
> I'm planning adding an experimental flag to remove WAL in the environment
> I'm using Flink and trying it out. If the flag is made configurable, WAL
> can always be re-enabled if removing it causes issues.
>
> Thoughts?
>
> Regards,
> Juha
>
>

Reply via email to