Great, thanks for the follow up. Best Regards, Yu
On Mon, 21 Sep 2020 at 15:04, Juha Mynttinen <juha.myntti...@king.com> wrote: > Good, > > I opened this JIRA for the issue > https://issues.apache.org/jira/browse/FLINK-19303. The discussion can be > moved there. > > Regards, > Juha > ------------------------------ > *From:* Yu Li <car...@gmail.com> > *Sent:* Friday, September 18, 2020 3:58 PM > *To:* Juha Mynttinen <juha.myntti...@king.com> > *Cc:* user@flink.apache.org <user@flink.apache.org> > *Subject:* Re: Disable WAL in RocksDB recovery > > Thanks for bringing this up Juha, and good catch. > > We actually are disabling WAL for routine writes by default when using > RocksDB and never encountered segment fault issues. However, from history > in FLINK-8922, segment fault issue occurs during restore if WAL is > disabled, so I guess the root cause lies in RocksDB batch write > (org.rocksdb.WriteBatch). And IMHO this is a RocksDB bug (it should work > well when WAL is disabled, no matter under single or batch write). > > +1 for opening a new JIRA to figure the root cause out, fix it and disable > WAL during restore by default (maybe checking the fixes around WriteBatch > in later RocksDB versions could help locate the issue more quickly), and > thanks for volunteering taking the efforts. I will follow up and help > review if any findings / PR submission. > > Best Regards, > Yu > > > On Wed, 16 Sep 2020 at 13:58, Juha Mynttinen <juha.myntti...@king.com> > wrote: > > Hello there, > > I'd like to bring to discussion a previously discussed topic - disabling > WAL in RocksDB recovery. > > It's clear that WAL is not needed during the process, the reason being > that the WAL is never read, so there's no need to write it. > > AFAIK the last thing that was done with WAL during recovery is an attempt > to remove it and later reverting that removal > (https://issues.apache.org/jira/browse/FLINK-8922 > [issues.apache.org] > <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_FLINK-2D8922&d=DwMFaQ&c=-0jfte1J3SKEE6FyZmTngg&r=-2x4lRPm2yEX3Ylri2jKFRC6zr9S6Iqg2kAJIspWwfA&m=AxIzKYnvz1WPfhVBb3h7dasyjYw21mR3x-cuBH3L3Ww&s=EFZry0q99qolXx6Ml-joOUoVEBQXgvsvTg5Ww0Y8ha8&e=>). > If I interpret the comments in the ticket correctly, what happened was that > a) WAL was kept in the recovery, 2) it's unknown why removing WAL causes > segfault. > > What can be seen in the ticket is that having WAL causes a significant > performance penalty. Thus, getting rid of WAL would be a very nice > performance improvement. I think it'd be worth to creating a new JIRA > ticket at least as a reminder that WAL should be removed? > > I'm planning adding an experimental flag to remove WAL in the environment > I'm using Flink and trying it out. If the flag is made configurable, WAL > can always be re-enabled if removing it causes issues. > > Thoughts? > > Regards, > Juha > >