Hi Eduard, thank you for review. Hi Ivan,
I'm confused on PR naming https://github.com/apache/ignite/pull/3656 Could you rename? Sincerely, Dmitriy Pavlov вт, 27 мар. 2018 г. в 19:38, Eduard Shangareev <eduard.shangar...@gmail.com >: > Ivan, I have reviewed your changes, looks good. > > On Tue, Mar 27, 2018 at 2:56 PM, Ivan Rakov <ivan.glu...@gmail.com> wrote: > > > Igniters, > > > > I've completed development of https://issues.apache.org/jira > > /browse/IGNITE-7754. TeamCity state is ok. Please, review my changes. > > Please note that it will be possible to track time of WAL fsync on > > checkpoint begin by *walCpRecordFsyncDuration *metric in "Checkpoint > > started" message. > > > > Also, I've created https://issues.apache.org/jira/browse/IGNITE-8057 > with > > description of possible further improvement of WAL fsync on checkpoint > > begin. > > > > Best Regards, > > Ivan Rakov > > > > > > On 26.03.2018 23:45, Valentin Kulichenko wrote: > > > >> Ivan, > >> > >> It's all good then :) Thanks! > >> > >> -Val > >> > >> On Mon, Mar 26, 2018 at 1:50 AM, Ivan Rakov <ivan.glu...@gmail.com> > >> wrote: > >> > >> Val, > >>> > >>> There's no any sense to use WalMode.NONE in production environment, > it's > >>> kept for testing and debugging purposes (including possible user > >>> activities > >>> like capacity planning). > >>> We already print a warning at node start in case WalMode.NONE is set: > >>> > >>> U.quietAndWarn(log,"Started write-ahead log manager in NONE mode, > >>> > >>>> persisted data may be lost in " + > >>>> "a case of unexpected node failure. Make sure to deactivate the > >>>> cluster before shutdown."); > >>>> > >>>> Best Regards, > >>> Ivan Rakov > >>> > >>> > >>> On 24.03.2018 1:40, Valentin Kulichenko wrote: > >>> > >>> Dmitry, > >>>> > >>>> Thanks for clarification. So it sounds like if we fix all other modes > as > >>>> we > >>>> discuss here, NONE would be the only one allowing corruption. I also > >>>> don't > >>>> see much sense in this and I think we should clearly state this in the > >>>> doc, > >>>> as well print out a warning if NONE mode is used. Eventually, if it's > >>>> confirmed that there are no reasonable use cases for it, we can > >>>> deprecate > >>>> it. > >>>> > >>>> -Val > >>>> > >>>> On Fri, Mar 23, 2018 at 3:26 PM, Dmitry Pavlov <dpavlov....@gmail.com > > > >>>> wrote: > >>>> > >>>> Hi Val, > >>>> > >>>>> NONE means that the WAL log is disabled and not written at all. Use > of > >>>>> the > >>>>> mode is at your own risk. It is possible that restore state after the > >>>>> crash > >>>>> at the middle of checkpoint will not succeed. I do not see much sence > >>>>> in > >>>>> it, especially in production. > >>>>> > >>>>> BACKGROUND is full functional WAL mode, but allows some delay before > >>>>> flush > >>>>> to disk. > >>>>> > >>>>> Sincerely, > >>>>> Dmitriy Pavlov > >>>>> > >>>>> сб, 24 мар. 2018 г. в 1:07, Valentin Kulichenko < > >>>>> valentin.kuliche...@gmail.com>: > >>>>> > >>>>> I agree. In my view, any possibility to get a corrupted storage is a > >>>>> bug > >>>>> > >>>>>> which needs to be fixed. > >>>>>> > >>>>>> BTW, can someone explain semantics of NONE mode? What is the > >>>>>> difference > >>>>>> from BACKGROUND from user's perspective? Is there any particular use > >>>>>> case > >>>>>> where it can be used? > >>>>>> > >>>>>> -Val > >>>>>> > >>>>>> On Fri, Mar 23, 2018 at 2:49 AM, Dmitry Pavlov < > dpavlov....@gmail.com > >>>>>> > > >>>>>> wrote: > >>>>>> > >>>>>> Hi Ivan, > >>>>>> > >>>>>>> IMO we have to add extra FSYNCS for BACKGROUND WAL. Agree? > >>>>>>> > >>>>>>> Sincerely, > >>>>>>> Dmitriy Pavlov > >>>>>>> > >>>>>>> пт, 23 мар. 2018 г. в 12:23, Ivan Rakov <ivan.glu...@gmail.com>: > >>>>>>> > >>>>>>> Igniters, there's another important question about this matter. > >>>>>>> > >>>>>>>> Do we want to add extra FSYNCS for BACKGROUND WAL mode? I think > that > >>>>>>>> > >>>>>>>> we > >>>>>>> > >>>>>> have to do it: it will cause similar performance drop, but if we > >>>>>> > >>>>>>> consider LOG_ONLY broken without these fixes, BACKGROUND is broken > as > >>>>>>>> > >>>>>>>> well. > >>>>>>> > >>>>>>> Best Regards, > >>>>>>>> Ivan Rakov > >>>>>>>> > >>>>>>>> On 23.03.2018 10:27, Ivan Rakov wrote: > >>>>>>>> > >>>>>>>> Fixes are quite simple. > >>>>>>>>> I expect them to be merged in master in a week in worst case. > >>>>>>>>> > >>>>>>>>> Best Regards, > >>>>>>>>> Ivan Rakov > >>>>>>>>> > >>>>>>>>> On 22.03.2018 17:49, Denis Magda wrote: > >>>>>>>>> > >>>>>>>>> Ivan, > >>>>>>>>>> > >>>>>>>>>> How quick are you going to merge the fix into the master? Many > >>>>>>>>>> persistence > >>>>>>>>>> related optimizations have already stacked up. Probably, we can > >>>>>>>>>> > >>>>>>>>>> release > >>>>>>>>> > >>>>>>>> them sooner if the community agrees. > >>>>>>>> > >>>>>>>>> -- > >>>>>>>>>> Denis > >>>>>>>>>> > >>>>>>>>>> On Thu, Mar 22, 2018 at 5:22 AM, Ivan Rakov < > >>>>>>>>>> > >>>>>>>>>> ivan.glu...@gmail.com> > >>>>>>>>> > >>>>>>>> wrote: > >>>>>> > >>>>>>> Thanks all! > >>>>>>>>>> > >>>>>>>>>>> We seem to have reached a consensus on this issue. I'll just > add > >>>>>>>>>>> necessary > >>>>>>>>>>> fsyncs under IGNITE-7754. > >>>>>>>>>>> > >>>>>>>>>>> Best Regards, > >>>>>>>>>>> Ivan Rakov > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On 22.03.2018 15:13, Ilya Lantukh wrote: > >>>>>>>>>>> > >>>>>>>>>>> +1 for fixing LOG_ONLY. If current implementation doesn't > >>>>>>>>>>> protect > >>>>>>>>>>> > >>>>>>>>>> from > >>>>>> > >>>>>>> data > >>>>>>>> > >>>>>>>>> corruption, it doesn't make sence. > >>>>>>>>>>>> > >>>>>>>>>>>> On Wed, Mar 21, 2018 at 10:38 PM, Denis Magda < > >>>>>>>>>>>> > >>>>>>>>>>>> dma...@apache.org> > >>>>>>>>>>> > >>>>>>>>>> wrote: > >>>>>> > >>>>>>> +1 for the fix of LOG_ONLY > >>>>>>>>>>>> > >>>>>>>>>>>> On Wed, Mar 21, 2018 at 11:23 AM, Alexey Goncharuk < > >>>>>>>>>>>>> alexey.goncha...@gmail.com> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> +1 for fixing LOG_ONLY to enforce corruption safety given the > >>>>>>>>>>>>> provided > >>>>>>>>>>>>> > >>>>>>>>>>>>> performance results. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> 2018-03-21 18:20 GMT+03:00 Vladimir Ozerov < > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> voze...@gridgain.com > >>>>>>>>>>>>> > >>>>>>>>>>>> : > >>>>>>> > >>>>>>>> +1 for accepting drop in LOG_ONLY. 7% is not that much and > >>>>>>>>> > >>>>>>>>>> not a > >>>>>>>>>>>>> > >>>>>>>>>>>> drop > >>>>>> > >>>>>>> at > >>>>>>>>>>>>>> all, provided that we fixing a bug. I.e. should we implement > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> it > >>>>>>>>>>>>> > >>>>>>>>>>>> correctly > >>>>>> > >>>>>>> in the first place we would never notice any "drop". > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I do not understand why someone would like to use current > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> broken > >>>>>>>>>>>>>> > >>>>>>>>>>>>> mode. > >>>>>>> > >>>>>>>> On Wed, Mar 21, 2018 at 6:11 PM, Dmitry Pavlov > >>>>>>>>>>>>>>> <dpavlov....@gmail.com> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi, I think option 1 is better. As Val said any mode that > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> allows > >>>>>>>>>>>>>> > >>>>>>>>>>>>> corruption > >>>>>>> > >>>>>>>> does not make much sense. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> What Ivan mentioned here as drop, in relation to old mode > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> DEFAULT > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> (FSYNC > >>>>>>>> > >>>>>>>>> now), is still significant perfromance boost. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Sincerely, > >>>>>>>>>>>>>>>> Dmitriy Pavlov > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> ср, 21 мар. 2018 г. в 17:56, Ivan Rakov < > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> ivan.glu...@gmail.com > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> : > >>>>>>> > >>>>>>>> I've attached benchmark results to the JIRA ticket. > >>>>>>>>> > >>>>>>>>>> We observe ~7% drop in "fair" LOG_ONLY_SAFE mode, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> independent > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> of > >>>>>> > >>>>>>> WAL > >>>>>>>> > >>>>>>>>> compaction enabled flag. It's pretty significant drop: WAL > >>>>>>>>>>>>>>> compaction > >>>>>>>>>>>>>>> itself gives only ~3% drop. > >>>>>>>>>>>>>>> I see two options here: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> 1) Change LOG_ONLY behavior. That implies that we'll be > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> ready > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> to > >>>>>> > >>>>>>> release > >>>>>>>> > >>>>>>>>> AI 2.5 with 7% drop. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> 2) Introduce LOG_ONLY_SAFE, make it default, add release > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> note > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> to AI > >>>>>> > >>>>>>> 2.5 > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> that we added power loss durability in default mode, but > >>>>>>>>>>>>>>> user > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> may > >>>>>>>>>>>>>> > >>>>>>>>>>>>> fallback to previous LOG_ONLY in order to retain > >>>>>>> > >>>>>>>> performance. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Thoughts? > >>>>>> > >>>>>>> Best Regards, > >>>>>>>>>>>>>>>>> Ivan Rakov > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On 20.03.2018 16:00, Ivan Rakov wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Val, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> If a storage is in > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> corrupted state, does it mean that it needs to be > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> completely > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> removed > >>>>>>> > >>>>>>>> and > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> cluster needs to be restarted without data? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Yes, there's a chance that in LOG_ONLY all local data > will > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> be > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> lost, > >>>>>>> > >>>>>>>> but only in *power loss**/ OS crash* case. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> kill -9, JVM crash, death of critical system thread and > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> all > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> other > >>>>>> > >>>>>>> cases that usually take place are variations of *process > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> crash*. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> All > >>>>>>>> > >>>>>>>>> WAL modes (except NONE, of course) ensure corruption-safety > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> in > >>>>>>>>>>>>>> > >>>>>>>>>>>>> case > >>>>>> > >>>>>>> of > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> process crash. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> If so, I'm not sure any mode > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> that allows corruption makes much sense to me. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> It depends on performance impact of enforcing > power-loss > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> corruption > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> safety. Price of full protection from power loss is high > - > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> FSYNC > >>>>>>>>>>>>> > >>>>>>>>>>>> is > >>>>>> > >>>>>>> way slower (2-10 times) than other WAL modes. The question is > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> whether > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> ensuring weaker guarantees (corruption can't happen, but > >>>>>>>>>>>>>>>> loss > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> of > >>>>>>>>>>>>>> > >>>>>>>>>>>>> last > >>>>>>> > >>>>>>>> updates can) will affect performance as badly as strong > >>>>>>>>>>>>>>>> guarantees. > >>>>>>>>>>>>>>>> I'll share benchmark results soon. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Best Regards, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Ivan Rakov > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On 20.03.2018 5:09, Valentin Kulichenko wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Guys, > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> What do we understand under "data corruption" here? If > a > >>>>>>>>>>>>>>>>>>> storage > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> is > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> in > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> corrupted state, does it mean that it needs to be > completely > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> removed > >>>>>>>>>>>>>>>>> and > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> cluster needs to be restarted without data? If so, I'm not > >>>>>>>>>>>>>>>> sure > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> any > >>>>>>> > >>>>>>>> mode > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> that allows corruption makes much sense to me. How am I > >>>>>>>>>>>>>>>> supposed > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> to > >>>>>>>> > >>>>>>>>> use a > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> database, if virtually any failure can end with complete > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> loss of > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> data? > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> In any case, this definitely should not be a default > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> behavior. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> If > >>>>>> > >>>>>>> user ever > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> switches to corruption-unsafe mode, there should be a > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> clear > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> warning > >>>>>> > >>>>>>> about > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> this. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> -Val > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 1:06 AM, Ivan Rakov < > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> ivan.glu...@gmail.com> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Ticket to track changes: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/IGNITE-7754 > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Best Regards, > >>>>>>>>>>>>>>>>>>>> Ivan Rakov > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> On 16.03.2018 10:58, Dmitriy Setrakyan wrote: > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> On Fri, Mar 16, 2018 at 12:55 AM, Ivan Rakov < > >>>>>>>>>>>>>>>>>>>> ivan.glu...@gmail.com > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Vladimir, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Unlike BACKGROUND, LOG_ONLY provides strict write > >>>>>>>>>>>>>>>>>>>>> guarantees > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> unless power > >>>>>>>> > >>>>>>>>> loss has happened. > >>>>>>>>>>>>>>>>>>>>>> Seems like we need to measure performance difference > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> decide > >>>>>> > >>>>>>> whether do > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> we need separate WAL mode. If it will be invisible, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> we'll > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> just > >>>>>>> > >>>>>>>> fix > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> these > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> bugs without introducing new mode; if it will be > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> perceptible, > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> we'll > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> continue the discussion about introducing > >>>>>>>>>>>>>>>>>>>> LOG_ONLY_SAFE. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Makes sense? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Yes, this sounds like the right approach. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > > >