Igniters,

Looks like commit:

d0adb61ecd9af0d9907e480ec747ea1465f97cd7 is the first bad commit
> commit d0adb61ecd9af0d9907e480ec747ea1465f97cd7
> Author: Ivan Rakov <ivan.glu...@gmail.com>
> Date:   Tue Mar 27 20:11:52 2018 +0300
>     IGNITE-7754 WAL in LOG_ONLY mode doesn't execute fsync on checkpoint
> begin - Fixes #3656.


was the cause of performance drop ( > 10% vs AI 2.4.0) on the following
benchmarks (LOG_ONLY):

   - atomic-put  (16 %)
   - atomic-putAll (14 %)
   - tx-putAll (11 %)

As I understand it is greater than initial assessment.

Thoughts?

2018-03-27 20:13 GMT+03:00 Dmitry Pavlov <dpavlov....@gmail.com>:

> Ivan, sure :)
>
> Thank you for this contribution, merged to master.
>
> вт, 27 мар. 2018 г. в 20:08, Ivan Rakov <ivan.glu...@gmail.com>:
>
> > Dmitry,
> >
> > Firstly PR contained dirty fix for performance measurement, but now it
> > contains good fix. :) Sorry for inconvenience.
> > I've renamed the PR.
> >
> > Best Regards,
> > Ivan Rakov
> >
> > On 27.03.2018 19:40, Dmitry Pavlov wrote:
> > > Hi Eduard, thank you for review.
> > >
> > > Hi Ivan,
> > >
> > > I'm confused on PR naming
> > > https://github.com/apache/ignite/pull/3656
> > >
> > > Could you rename?
> > >
> > > Sincerely,
> > > Dmitriy Pavlov
> > >
> > > вт, 27 мар. 2018 г. в 19:38, Eduard Shangareev <
> > eduard.shangar...@gmail.com
> > >> :
> > >> Ivan, I have reviewed your changes, looks good.
> > >>
> > >> On Tue, Mar 27, 2018 at 2:56 PM, Ivan Rakov <ivan.glu...@gmail.com>
> > wrote:
> > >>
> > >>> Igniters,
> > >>>
> > >>> I've completed development of https://issues.apache.org/jira
> > >>> /browse/IGNITE-7754. TeamCity state is ok. Please, review my changes.
> > >>> Please note that it will be possible to track time of WAL fsync on
> > >>> checkpoint begin by *walCpRecordFsyncDuration *metric in "Checkpoint
> > >>> started" message.
> > >>>
> > >>> Also, I've created https://issues.apache.org/jira/browse/IGNITE-8057
> > >> with
> > >>> description of possible further improvement of WAL fsync on
> checkpoint
> > >>> begin.
> > >>>
> > >>> Best Regards,
> > >>> Ivan Rakov
> > >>>
> > >>>
> > >>> On 26.03.2018 23:45, Valentin Kulichenko wrote:
> > >>>
> > >>>> Ivan,
> > >>>>
> > >>>> It's all good then :) Thanks!
> > >>>>
> > >>>> -Val
> > >>>>
> > >>>> On Mon, Mar 26, 2018 at 1:50 AM, Ivan Rakov <ivan.glu...@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>> Val,
> > >>>>> There's no any sense to use WalMode.NONE in production environment,
> > >> it's
> > >>>>> kept for testing and debugging purposes (including possible user
> > >>>>> activities
> > >>>>> like capacity planning).
> > >>>>> We already print a warning at node start in case WalMode.NONE is
> set:
> > >>>>>
> > >>>>> U.quietAndWarn(log,"Started write-ahead log manager in NONE mode,
> > >>>>>
> > >>>>>> persisted data may be lost in " +
> > >>>>>>        "a case of unexpected node failure. Make sure to deactivate
> > the
> > >>>>>> cluster before shutdown.");
> > >>>>>>
> > >>>>>> Best Regards,
> > >>>>> Ivan Rakov
> > >>>>>
> > >>>>>
> > >>>>> On 24.03.2018 1:40, Valentin Kulichenko wrote:
> > >>>>>
> > >>>>> Dmitry,
> > >>>>>> Thanks for clarification. So it sounds like if we fix all other
> > modes
> > >> as
> > >>>>>> we
> > >>>>>> discuss here, NONE would be the only one allowing corruption. I
> also
> > >>>>>> don't
> > >>>>>> see much sense in this and I think we should clearly state this in
> > the
> > >>>>>> doc,
> > >>>>>> as well print out a warning if NONE mode is used. Eventually, if
> > it's
> > >>>>>> confirmed that there are no reasonable use cases for it, we can
> > >>>>>> deprecate
> > >>>>>> it.
> > >>>>>>
> > >>>>>> -Val
> > >>>>>>
> > >>>>>> On Fri, Mar 23, 2018 at 3:26 PM, Dmitry Pavlov <
> > dpavlov....@gmail.com
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>> Hi Val,
> > >>>>>>
> > >>>>>>> NONE means that the WAL log is disabled and not written at all.
> Use
> > >> of
> > >>>>>>> the
> > >>>>>>> mode is at your own risk. It is possible that restore state after
> > the
> > >>>>>>> crash
> > >>>>>>> at the middle of checkpoint will not succeed. I do not see much
> > sence
> > >>>>>>> in
> > >>>>>>> it, especially in production.
> > >>>>>>>
> > >>>>>>> BACKGROUND is full functional WAL mode, but allows some delay
> > before
> > >>>>>>> flush
> > >>>>>>> to disk.
> > >>>>>>>
> > >>>>>>> Sincerely,
> > >>>>>>> Dmitriy Pavlov
> > >>>>>>>
> > >>>>>>> сб, 24 мар. 2018 г. в 1:07, Valentin Kulichenko <
> > >>>>>>> valentin.kuliche...@gmail.com>:
> > >>>>>>>
> > >>>>>>> I agree. In my view, any possibility to get a corrupted storage
> is
> > a
> > >>>>>>> bug
> > >>>>>>>
> > >>>>>>>> which needs to be fixed.
> > >>>>>>>>
> > >>>>>>>> BTW, can someone explain semantics of NONE mode? What is the
> > >>>>>>>> difference
> > >>>>>>>> from BACKGROUND from user's perspective? Is there any particular
> > use
> > >>>>>>>> case
> > >>>>>>>> where it can be used?
> > >>>>>>>>
> > >>>>>>>> -Val
> > >>>>>>>>
> > >>>>>>>> On Fri, Mar 23, 2018 at 2:49 AM, Dmitry Pavlov <
> > >> dpavlov....@gmail.com
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>> Hi Ivan,
> > >>>>>>>>
> > >>>>>>>>> IMO we have to add extra FSYNCS for BACKGROUND WAL. Agree?
> > >>>>>>>>>
> > >>>>>>>>> Sincerely,
> > >>>>>>>>> Dmitriy Pavlov
> > >>>>>>>>>
> > >>>>>>>>> пт, 23 мар. 2018 г. в 12:23, Ivan Rakov <ivan.glu...@gmail.com
> >:
> > >>>>>>>>>
> > >>>>>>>>> Igniters, there's another important question about this matter.
> > >>>>>>>>>
> > >>>>>>>>>> Do we want to add extra FSYNCS for BACKGROUND WAL mode? I
> think
> > >> that
> > >>>>>>>>>> we
> > >>>>>>>> have to do it: it will cause similar performance drop, but if we
> > >>>>>>>>
> > >>>>>>>>> consider LOG_ONLY broken without these fixes, BACKGROUND is
> > broken
> > >> as
> > >>>>>>>>>> well.
> > >>>>>>>>> Best Regards,
> > >>>>>>>>>> Ivan Rakov
> > >>>>>>>>>>
> > >>>>>>>>>> On 23.03.2018 10:27, Ivan Rakov wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> Fixes are quite simple.
> > >>>>>>>>>>> I expect them to be merged in master in a week in worst case.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best Regards,
> > >>>>>>>>>>> Ivan Rakov
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 22.03.2018 17:49, Denis Magda wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Ivan,
> > >>>>>>>>>>>> How quick are you going to merge the fix into the master?
> Many
> > >>>>>>>>>>>> persistence
> > >>>>>>>>>>>> related optimizations have already stacked up. Probably, we
> > can
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> release
> > >>>>>>>>>> them sooner if the community agrees.
> > >>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>>>>>> Denis
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Thu, Mar 22, 2018 at 5:22 AM, Ivan Rakov <
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> ivan.glu...@gmail.com>
> > >>>>>>>>>> wrote:
> > >>>>>>>>> Thanks all!
> > >>>>>>>>>>>>> We seem to have reached a consensus on this issue. I'll
> just
> > >> add
> > >>>>>>>>>>>>> necessary
> > >>>>>>>>>>>>> fsyncs under IGNITE-7754.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Best Regards,
> > >>>>>>>>>>>>> Ivan Rakov
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 22.03.2018 15:13, Ilya Lantukh wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> +1 for fixing LOG_ONLY. If current implementation doesn't
> > >>>>>>>>>>>>> protect
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>> from
> > >>>>>>>>> data
> > >>>>>>>>>>> corruption, it doesn't make sence.
> > >>>>>>>>>>>>>> On Wed, Mar 21, 2018 at 10:38 PM, Denis Magda <
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> dma...@apache.org>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>> +1 for the fix of LOG_ONLY
> > >>>>>>>>>>>>>> On Wed, Mar 21, 2018 at 11:23 AM, Alexey Goncharuk <
> > >>>>>>>>>>>>>>> alexey.goncha...@gmail.com> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> +1 for fixing LOG_ONLY to enforce corruption safety given
> > the
> > >>>>>>>>>>>>>>> provided
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> performance results.
> > >>>>>>>>>>>>>>>> 2018-03-21 18:20 GMT+03:00 Vladimir Ozerov <
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> voze...@gridgain.com
> > >>>>>>>>>>>>>> :
> > >>>>>>>>>> +1 for accepting drop in LOG_ONLY. 7% is not that much and
> > >>>>>>>>>>>> not a
> > >>>>>>>>>>>>>> drop
> > >>>>>>>>> at
> > >>>>>>>>>>>>>>>> all, provided that we fixing a bug. I.e. should we
> > implement
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> it
> > >>>>>>>>>>>>>> correctly
> > >>>>>>>>> in the first place we would never notice any "drop".
> > >>>>>>>>>>>>>>>> I do not understand why someone would like to use
> current
> > >>>>>>>>>>>>>>>>> broken
> > >>>>>>>>>>>>>>> mode.
> > >>>>>>>>>> On Wed, Mar 21, 2018 at 6:11 PM, Dmitry Pavlov
> > >>>>>>>>>>>>>>>>> <dpavlov....@gmail.com>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Hi, I think option 1 is better. As Val said any mode
> that
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> allows
> > >>>>>>>>>>>>>>> corruption
> > >>>>>>>>>> does not make much sense.
> > >>>>>>>>>>>>>>>>>> What Ivan mentioned here as drop, in relation to old
> > mode
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> DEFAULT
> > >>>>>>>>>>>>>>>> (FSYNC
> > >>>>>>>>>>> now), is still significant perfromance boost.
> > >>>>>>>>>>>>>>>>> Sincerely,
> > >>>>>>>>>>>>>>>>>> Dmitriy Pavlov
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> ср, 21 мар. 2018 г. в 17:56, Ivan Rakov <
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> ivan.glu...@gmail.com
> > >>>>>>>>>>>>>>>> :
> > >>>>>>>>>> I've attached benchmark results to the JIRA ticket.
> > >>>>>>>>>>>> We observe ~7% drop in "fair" LOG_ONLY_SAFE mode,
> > >>>>>>>>>>>>>>>>>>> independent
> > >>>>>>>>>>>>>>>>> of
> > >>>>>>>>> WAL
> > >>>>>>>>>>> compaction enabled flag. It's pretty significant drop: WAL
> > >>>>>>>>>>>>>>>>> compaction
> > >>>>>>>>>>>>>>>>> itself gives only ~3% drop.
> > >>>>>>>>>>>>>>>>> I see two options here:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> 1) Change LOG_ONLY behavior. That implies that we'll
> be
> > >>>>>>>>>>>>>>>>>>> ready
> > >>>>>>>>>>>>>>>>> to
> > >>>>>>>>> release
> > >>>>>>>>>>> AI 2.5 with 7% drop.
> > >>>>>>>>>>>>>>>>>> 2) Introduce LOG_ONLY_SAFE, make it default, add
> release
> > >>>>>>>>>>>>>>>>>>> note
> > >>>>>>>>>>>>>>>>> to AI
> > >>>>>>>>> 2.5
> > >>>>>>>>>>>>>>>>>> that we added power loss durability in default mode,
> but
> > >>>>>>>>>>>>>>>>> user
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> may
> > >>>>>>>>>>>>>>> fallback to previous LOG_ONLY in order to retain
> > >>>>>>>>>> performance.
> > >>>>>>>>>>>>>>>>> Thoughts?
> > >>>>>>>>> Best Regards,
> > >>>>>>>>>>>>>>>>>>> Ivan Rakov
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> On 20.03.2018 16:00, Ivan Rakov wrote:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Val,
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> If a storage is in
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> corrupted state, does it mean that it needs to be
> > >>>>>>>>>>>>>>>>>>>>> completely
> > >>>>>>>>>>>>>>>>>>> removed
> > >>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>> cluster needs to be restarted without data?
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Yes, there's a chance that in LOG_ONLY all local data
> > >> will
> > >>>>>>>>>>>>>>>>>>>> be
> > >>>>>>>>>>>>>>>>>> lost,
> > >>>>>>>>>> but only in *power loss**/ OS crash* case.
> > >>>>>>>>>>>>>>>>> kill -9, JVM crash, death of critical system thread and
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> all
> > >>>>>>>>>>>>>>>>>> other
> > >>>>>>>>> cases that usually take place are variations of *process
> > >>>>>>>>>>>>>>>>>>>> crash*.
> > >>>>>>>>>>>>>>>>>> All
> > >>>>>>>>>>> WAL modes (except NONE, of course) ensure corruption-safety
> > >>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>> case
> > >>>>>>>>> of
> > >>>>>>>>>>>>>>>>> process crash.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> If so, I'm not sure any mode
> > >>>>>>>>>>>>>>>>>>>> that allows corruption makes much sense to me.
> > >>>>>>>>>>>>>>>>>>>>> It depends on performance impact of enforcing
> > >> power-loss
> > >>>>>>>>>>>>>>>>>>>> corruption
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> safety. Price of full protection from power loss is
> > high
> > >> -
> > >>>>>>>>>>>>>>>>> FSYNC
> > >>>>>>>>>>>>>> is
> > >>>>>>>>> way slower (2-10 times) than other WAL modes. The question is
> > >>>>>>>>>>>>>>>>> whether
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> ensuring weaker guarantees (corruption can't happen,
> but
> > >>>>>>>>>>>>>>>>>> loss
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>> last
> > >>>>>>>>>> updates can) will affect performance as badly as strong
> > >>>>>>>>>>>>>>>>>> guarantees.
> > >>>>>>>>>>>>>>>>>> I'll share benchmark results soon.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Best Regards,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Ivan Rakov
> > >>>>>>>>>>>>>>>>>>>> On 20.03.2018 5:09, Valentin Kulichenko wrote:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Guys,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> What do we understand under "data corruption" here?
> > If
> > >> a
> > >>>>>>>>>>>>>>>>>>>>> storage
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> is
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>>>> corrupted state, does it mean that it needs to be
> > >> completely
> > >>>>>>>>>>>>>>>>>> removed
> > >>>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> cluster needs to be restarted without data? If so, I'm
> > not
> > >>>>>>>>>>>>>>>>>> sure
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> any
> > >>>>>>>>>> mode
> > >>>>>>>>>>>>>>>>>> that allows corruption makes much sense to me. How am
> I
> > >>>>>>>>>>>>>>>>>> supposed
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>> use a
> > >>>>>>>>>>>>>>>>>> database, if virtually any failure can end with
> complete
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> loss of
> > >>>>>>>>>>>>>>>>>>>>> data?
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> In any case, this definitely should not be a default
> > >>>>>>>>>>>>>>>>>> behavior.
> > >>>>>>>>>>>>>>>> If
> > >>>>>>>>> user ever
> > >>>>>>>>>>>>>>>>>> switches to corruption-unsafe mode, there should be a
> > >>>>>>>>>>>>>>>>>> clear
> > >>>>>>>>>>>>>>>>>>> warning
> > >>>>>>>>> about
> > >>>>>>>>>>>>>>>>>> this.
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> -Val
> > >>>>>>>>>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 1:06 AM, Ivan Rakov <
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> ivan.glu...@gmail.com>
> > >>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>> Ticket to track changes:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/IGNITE-7754
> > >>>>>>>>>>>>>>>>>>>>>> Best Regards,
> > >>>>>>>>>>>>>>>>>>>>>> Ivan Rakov
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> On 16.03.2018 10:58, Dmitriy Setrakyan wrote:
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 12:55 AM, Ivan Rakov <
> > >>>>>>>>>>>>>>>>>>>>>> ivan.glu...@gmail.com
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>> Vladimir,
> > >>>>>>>>>>>>>>>>>>>> Unlike BACKGROUND, LOG_ONLY provides strict write
> > >>>>>>>>>>>>>>>>>>>>>>> guarantees
> > >>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>> unless power
> > >>>>>>>>>>> loss has happened.
> > >>>>>>>>>>>>>>>>>>>>>>>> Seems like we need to measure performance
> > difference
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>>>>>> decide
> > >>>>>>>>> whether do
> > >>>>>>>>>>>>>>>>>>>>> we need separate WAL mode. If it will be invisible,
> > >>>>>>>>>>>>>>>>>> we'll
> > >>>>>>>>>>>>>>>>>>>>>> just
> > >>>>>>>>>> fix
> > >>>>>>>>>>>>>>>>>>>>> these
> > >>>>>>>>>>>>>>>>>> bugs without introducing new mode; if it will be
> > >>>>>>>>>>>>>>>>>>>> perceptible,
> > >>>>>>>>>>>>>>>>>>>>>>>> we'll
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>> continue the discussion about introducing
> > >>>>>>>>>>>>>>>>>>>>>> LOG_ONLY_SAFE.
> > >>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>> Makes sense?
> > >>>>>>>>>>>>>>>>>>>> Yes, this sounds like the right approach.
> > >>>>>>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>>>>>
> >
> >
>



-- 
Best Regards,
Ilya Suntsov
email: isunt...@gridgain.com
*GridGain Systems*
www.gridgain.com

Reply via email to