Re: Reconsider default WAL mode: we need something between LOG_ONLY and FSYNC

Valentin Kulichenko Fri, 23 Mar 2018 15:41:38 -0700

Dmitry,

Thanks for clarification. So it sounds like if we fix all other modes as we
discuss here, NONE would be the only one allowing corruption. I also don't
see much sense in this and I think we should clearly state this in the doc,
as well print out a warning if NONE mode is used. Eventually, if it's
confirmed that there are no reasonable use cases for it, we can deprecate
it.


-Val

On Fri, Mar 23, 2018 at 3:26 PM, Dmitry Pavlov <dpavlov....@gmail.com>
wrote:

> Hi Val,
>
> NONE means that the WAL log is disabled and not written at all. Use of the
> mode is at your own risk. It is possible that restore state after the crash
> at the middle of checkpoint will not succeed. I do not see much sence in
> it, especially in production.
>
> BACKGROUND is full functional WAL mode, but allows some delay before flush
> to disk.
>
> Sincerely,
> Dmitriy Pavlov
>
> сб, 24 мар. 2018 г. в 1:07, Valentin Kulichenko <
> valentin.kuliche...@gmail.com>:
>
> > I agree. In my view, any possibility to get a corrupted storage is a bug
> > which needs to be fixed.
> >
> > BTW, can someone explain semantics of NONE mode? What is the difference
> > from BACKGROUND from user's perspective? Is there any particular use case
> > where it can be used?
> >
> > -Val
> >
> > On Fri, Mar 23, 2018 at 2:49 AM, Dmitry Pavlov <dpavlov....@gmail.com>
> > wrote:
> >
> > > Hi Ivan,
> > >
> > > IMO we have to add extra FSYNCS for BACKGROUND WAL. Agree?
> > >
> > > Sincerely,
> > > Dmitriy Pavlov
> > >
> > > пт, 23 мар. 2018 г. в 12:23, Ivan Rakov <ivan.glu...@gmail.com>:
> > >
> > > > Igniters, there's another important question about this matter.
> > > > Do we want to add extra FSYNCS for BACKGROUND WAL mode? I think that
> we
> > > > have to do it: it will cause similar performance drop, but if we
> > > > consider LOG_ONLY broken without these fixes, BACKGROUND is broken as
> > > well.
> > > >
> > > > Best Regards,
> > > > Ivan Rakov
> > > >
> > > > On 23.03.2018 10:27, Ivan Rakov wrote:
> > > > > Fixes are quite simple.
> > > > > I expect them to be merged in master in a week in worst case.
> > > > >
> > > > > Best Regards,
> > > > > Ivan Rakov
> > > > >
> > > > > On 22.03.2018 17:49, Denis Magda wrote:
> > > > >> Ivan,
> > > > >>
> > > > >> How quick are you going to merge the fix into the master? Many
> > > > >> persistence
> > > > >> related optimizations have already stacked up. Probably, we can
> > > release
> > > > >> them sooner if the community agrees.
> > > > >>
> > > > >> --
> > > > >> Denis
> > > > >>
> > > > >> On Thu, Mar 22, 2018 at 5:22 AM, Ivan Rakov <
> ivan.glu...@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >>> Thanks all!
> > > > >>> We seem to have reached a consensus on this issue. I'll just add
> > > > >>> necessary
> > > > >>> fsyncs under IGNITE-7754.
> > > > >>>
> > > > >>> Best Regards,
> > > > >>> Ivan Rakov
> > > > >>>
> > > > >>>
> > > > >>> On 22.03.2018 15:13, Ilya Lantukh wrote:
> > > > >>>
> > > > >>>> +1 for fixing LOG_ONLY. If current implementation doesn't
> protect
> > > from
> > > > >>>> data
> > > > >>>> corruption, it doesn't make sence.
> > > > >>>>
> > > > >>>> On Wed, Mar 21, 2018 at 10:38 PM, Denis Magda <
> dma...@apache.org>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>> +1 for the fix of LOG_ONLY
> > > > >>>>> On Wed, Mar 21, 2018 at 11:23 AM, Alexey Goncharuk <
> > > > >>>>> alexey.goncha...@gmail.com> wrote:
> > > > >>>>>
> > > > >>>>> +1 for fixing LOG_ONLY to enforce corruption safety given the
> > > > >>>>> provided
> > > > >>>>>> performance results.
> > > > >>>>>>
> > > > >>>>>> 2018-03-21 18:20 GMT+03:00 Vladimir Ozerov <
> > voze...@gridgain.com
> > > >:
> > > > >>>>>>
> > > > >>>>>> +1 for accepting drop in LOG_ONLY. 7% is not that much and
> not a
> > > > >>>>>> drop
> > > > >>>>>> at
> > > > >>>>>> all, provided that we fixing a bug. I.e. should we implement
> it
> > > > >>>>>> correctly
> > > > >>>>>> in the first place we would never notice any "drop".
> > > > >>>>>>> I do not understand why someone would like to use current
> > broken
> > > > >>>>>>> mode.
> > > > >>>>>>>
> > > > >>>>>>> On Wed, Mar 21, 2018 at 6:11 PM, Dmitry Pavlov
> > > > >>>>>>> <dpavlov....@gmail.com>
> > > > >>>>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>> Hi, I think option 1 is better. As Val said any mode that
> > allows
> > > > >>>>>>> corruption
> > > > >>>>>>>
> > > > >>>>>>>> does not make much sense.
> > > > >>>>>>>>
> > > > >>>>>>>> What Ivan mentioned here as drop, in relation to old mode
> > > DEFAULT
> > > > >>>>>>>>
> > > > >>>>>>> (FSYNC
> > > > >>>>>>> now), is still significant perfromance boost.
> > > > >>>>>>>> Sincerely,
> > > > >>>>>>>> Dmitriy Pavlov
> > > > >>>>>>>>
> > > > >>>>>>>> ср, 21 мар. 2018 г. в 17:56, Ivan Rakov <
> > ivan.glu...@gmail.com
> > > >:
> > > > >>>>>>>>
> > > > >>>>>>>> I've attached benchmark results to the JIRA ticket.
> > > > >>>>>>>>> We observe ~7% drop in "fair" LOG_ONLY_SAFE mode,
> independent
> > > of
> > > > >>>>>>>>>
> > > > >>>>>>>> WAL
> > > > >>>>>> compaction enabled flag. It's pretty significant drop: WAL
> > > > >>>>>>>> compaction
> > > > >>>>>> itself gives only ~3% drop.
> > > > >>>>>>>>> I see two options here:
> > > > >>>>>>>>> 1) Change LOG_ONLY behavior. That implies that we'll be
> ready
> > > to
> > > > >>>>>>>>>
> > > > >>>>>>>> release
> > > > >>>>>>>> AI 2.5 with 7% drop.
> > > > >>>>>>>>> 2) Introduce LOG_ONLY_SAFE, make it default, add release
> note
> > > > >>>>>>>>> to AI
> > > > >>>>>>>>>
> > > > >>>>>>>> 2.5
> > > > >>>>>>> that we added power loss durability in default mode, but user
> > may
> > > > >>>>>>>>> fallback to previous LOG_ONLY in order to retain
> performance.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Thoughts?
> > > > >>>>>>>>>
> > > > >>>>>>>>> Best Regards,
> > > > >>>>>>>>> Ivan Rakov
> > > > >>>>>>>>>
> > > > >>>>>>>>> On 20.03.2018 16:00, Ivan Rakov wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>>> Val,
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> If a storage is in
> > > > >>>>>>>>>>> corrupted state, does it mean that it needs to be
> > completely
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> removed
> > > > >>>>>>> and
> > > > >>>>>>>>> cluster needs to be restarted without data?
> > > > >>>>>>>>>> Yes, there's a chance that in LOG_ONLY all local data will
> > be
> > > > >>>>>>>>>>
> > > > >>>>>>>>> lost,
> > > > >>>>>> but only in *power loss**/ OS crash* case.
> > > > >>>>>>>>>> kill -9, JVM crash, death of critical system thread and
> all
> > > > >>>>>>>>>> other
> > > > >>>>>>>>>> cases that usually take place are variations of *process
> > > crash*.
> > > > >>>>>>>>>>
> > > > >>>>>>>>> All
> > > > >>>>>>> WAL modes (except NONE, of course) ensure corruption-safety
> in
> > > > >>>>>>>>> case
> > > > >>>>>> of
> > > > >>>>>>>> process crash.
> > > > >>>>>>>>>> If so, I'm not sure any mode
> > > > >>>>>>>>>>> that allows corruption makes much sense to me.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> It depends on performance impact of enforcing power-loss
> > > > >>>>>>>>>>
> > > > >>>>>>>>> corruption
> > > > >>>>>> safety. Price of full protection from power loss is high -
> FSYNC
> > > > >>>>>>>>> is
> > > > >>>>>> way slower (2-10 times) than other WAL modes. The question is
> > > > >>>>>>>>> whether
> > > > >>>>>>> ensuring weaker guarantees (corruption can't happen, but loss
> > of
> > > > >>>>>>>>> last
> > > > >>>>>>> updates can) will affect performance as badly as strong
> > > > >>>>>>>>> guarantees.
> > > > >>>>>> I'll share benchmark results soon.
> > > > >>>>>>>>>> Best Regards,
> > > > >>>>>>>>>> Ivan Rakov
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On 20.03.2018 5:09, Valentin Kulichenko wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>> Guys,
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> What do we understand under "data corruption" here? If a
> > > > >>>>>>>>>>> storage
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> is
> > > > >>>>>>> in
> > > > >>>>>>>
> > > > >>>>>>>> corrupted state, does it mean that it needs to be completely
> > > > >>>>>>>>>> removed
> > > > >>>>>>> and
> > > > >>>>>>>>> cluster needs to be restarted without data? If so, I'm not
> > sure
> > > > >>>>>>>>>> any
> > > > >>>>>>> mode
> > > > >>>>>>>>> that allows corruption makes much sense to me. How am I
> > > supposed
> > > > >>>>>>>>>> to
> > > > >>>>>>> use a
> > > > >>>>>>>>>>> database, if virtually any failure can end with complete
> > > > >>>>>>>>>>> loss of
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> data?
> > > > >>>>>>>> In any case, this definitely should not be a default
> behavior.
> > > > >>>>>>>>>> If
> > > > >>>>>> user ever
> > > > >>>>>>>>>>> switches to corruption-unsafe mode, there should be a
> clear
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> warning
> > > > >>>>>>> about
> > > > >>>>>>>>>>> this.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> -Val
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> On Fri, Mar 16, 2018 at 1:06 AM, Ivan Rakov <
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> ivan.glu...@gmail.com>
> > > > >>>>>>> wrote:
> > > > >>>>>>>>>>> Ticket to track changes:
> > > > >>>>>>>>>>>> https://issues.apache.org/jira/browse/IGNITE-7754
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Best Regards,
> > > > >>>>>>>>>>>> Ivan Rakov
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> On 16.03.2018 10:58, Dmitriy Setrakyan wrote:
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> On Fri, Mar 16, 2018 at 12:55 AM, Ivan Rakov <
> > > > >>>>>>>>>>>> ivan.glu...@gmail.com
> > > > >>>>>>>> wrote:
> > > > >>>>>>>>>>>>> Vladimir,
> > > > >>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Unlike BACKGROUND, LOG_ONLY provides strict write
> > > guarantees
> > > > >>>>>>>>>>>>>> unless power
> > > > >>>>>>>>>>>>>> loss has happened.
> > > > >>>>>>>>>>>>>> Seems like we need to measure performance difference
> to
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> decide
> > > > >>>>>> whether do
> > > > >>>>>>>>>>>>>> we need separate WAL mode. If it will be invisible,
> > we'll
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> just
> > > > >>>>>> fix
> > > > >>>>>>>> these
> > > > >>>>>>>>>>>>>> bugs without introducing new mode; if it will be
> > > > >>>>>>>>>>>>>> perceptible,
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>> we'll
> > > > >>>>>>>> continue the discussion about introducing LOG_ONLY_SAFE.
> > > > >>>>>>>>>>>>>> Makes sense?
> > > > >>>>>>>>>>>>>>
> > > > >>>>>>>>>>>>>> Yes, this sounds like the right approach.
> > > > >>>>>>>>>>>>>>
> > > > >>>>
> > > > >
> > > >
> > > >
> > >
> >
>

Re: Reconsider default WAL mode: we need something between LOG_ONLY and FSYNC

Reply via email to