Good,

I opened this JIRA for the issue 
https://issues.apache.org/jira/browse/FLINK-19303. The discussion can be moved 
there.

Regards,
Juha
________________________________
From: Yu Li <car...@gmail.com>
Sent: Friday, September 18, 2020 3:58 PM
To: Juha Mynttinen <juha.myntti...@king.com>
Cc: user@flink.apache.org <user@flink.apache.org>
Subject: Re: Disable WAL in RocksDB recovery

Thanks for bringing this up Juha, and good catch.

We actually are disabling WAL for routine writes by default when using RocksDB 
and never encountered segment fault issues. However, from history in 
FLINK-8922, segment fault issue occurs during restore if WAL is disabled, so I 
guess the root cause lies in RocksDB batch write (org.rocksdb.WriteBatch). And 
IMHO this is a RocksDB bug (it should work well when WAL is disabled, no matter 
under single or batch write).

+1 for opening a new JIRA to figure the root cause out, fix it and disable WAL 
during restore by default (maybe checking the fixes around WriteBatch in later 
RocksDB versions could help locate the issue more quickly), and thanks for 
volunteering taking the efforts. I will follow up and help review if any 
findings / PR submission.

Best Regards,
Yu


On Wed, 16 Sep 2020 at 13:58, Juha Mynttinen 
<juha.myntti...@king.com<mailto:juha.myntti...@king.com>> wrote:
Hello there,

I'd like to bring to discussion a previously discussed topic - disabling WAL in 
RocksDB recovery.

It's clear that WAL is not needed during the process, the reason being that the 
WAL is never read, so there's no need to write it.

AFAIK the last thing that was done with WAL during recovery is an attempt to 
remove it and later reverting that removal 
(https://issues.apache.org/jira/browse/FLINK-8922 
[issues.apache.org]<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D8922&d=DwMFaQ&c=-0jfte1J3SKEE6FyZmTngg&r=-2x4lRPm2yEX3Ylri2jKFRC6zr9S6Iqg2kAJIspWwfA&m=AxIzKYnvz1WPfhVBb3h7dasyjYw21mR3x-cuBH3L3Ww&s=EFZry0q99qolXx6Ml-joOUoVEBQXgvsvTg5Ww0Y8ha8&e=>).
 If I interpret the comments in the ticket correctly, what happened was that a) 
WAL was kept in the recovery, 2) it's unknown why removing WAL causes segfault.

What can be seen in the ticket is that having WAL causes a significant 
performance penalty. Thus, getting rid of WAL would be a very nice performance 
improvement. I think it'd be worth to creating a new JIRA ticket at least as a 
reminder that WAL should be removed?

I'm planning adding an experimental flag to remove WAL in the environment I'm 
using Flink and trying it out. If the flag is made configurable, WAL can always 
be re-enabled if removing it causes issues.

Thoughts?

Regards,
Juha

Reply via email to