Re: [PROPOSAL] Change approach to store checkpoint recovery data

Anton Vinogradov Thu, 09 Nov 2023 09:05:26 -0800

Alex, agree to the proposal.

On Thu, Nov 9, 2023 at 5:31 PM Alex Plehanov <plehanov.a...@gmail.com>
wrote:


> Anton,
>
> Async physical logging is a target and most promising solution.
>
> In this scenario:
> 1. Implement logical and physical records split.
> 2. Implement async physical logging (actually, already implemented as PoC).
> 3. Drop solution, implemented in (1) after some time, if solution,
> implemented in (2) has no critical issues.
> We do some useless job, which we assume will be dropped soon.
>
> Instead, I propose:
> 1. Implement async physical logging
> 2. Drop old physical logging implementation if (1) has no critical
> issues after some time.
> 3. Or implement logical and physical records split, if critical issues
> found in (1).
> In this case, we proceed to the alternative approach only if the main
> approach fails.
>
> чт, 9 нояб. 2023 г. в 13:18, Anton Vinogradov <a...@apache.org>:
> >
> > In this case, we can split logs to logical and physical at the initial
> fix.
> > This should not cause any negative side effects.
> > And, then implement an async physical logging as an alternative solution?
> >
> > On Thu, Nov 9, 2023 at 12:52 PM Alex Plehanov <plehanov.a...@gmail.com>
> > wrote:
> >
> > > Anton,
> > >
> > > My concern is not only about compatibility. The new recovery data
> > > storing approach is not a silver bullet, it has drawbacks as well.
> > > Also, we can't be sure that the new approach is applicable for all
> > > environments: increased checkpoint time can lead to throttling or even
> > > OOM in some cases. So, in my opinion, it's better to keep both
> > > approaches and allow users to configure it. We should keep both
> > > approaches at least for a one release after the new approach will be
> > > enabled by default. In case of a critical problem users can raise the
> > > issue and switch to the old approach.
> > >
> > > пт, 3 нояб. 2023 г. в 16:33, Anton Vinogradov <a...@apache.org>:
> > > >
> > > > Sounds good to me, except the compatibility proposal.
> > > > No need to keep the old behaviour. Noone will update the node after
> the
> > > > crash.
> > > > Update must happen only after the plain node stop, let's just check
> this
> > > > instead of groving the code complexity.
> > > >
> > > > On Thu, Nov 2, 2023 at 4:55 PM Alex Plehanov <
> plehanov.a...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hello, Igniters!
> > > > >
> > > > > I'd like to discuss the way of storing checkpoint recovery data.
> > > > > Now, we are writing extra data to WAL files to protect from
> failures
> > > > > during checkpoints. Later, we read and write WAL files with this
> extra
> > > > > data a couple of times, causing excessive disk load, which can
> lead to
> > > > > performance drop.
> > > > > We can try to improve this by changing the approach for storing
> > > > > checkpoint recovery data. I've prepared the IEP [1] with my
> proposals.
> > > > > The main idea - move checkpoint recovery data from WAL physical
> > > > > records to some file written right before the checkpoint. Please
> have
> > > > > a look at IEP for more information.
> > > > > I've implemented PoC [2] for the described ideas. We will benchmark
> > > > > this PoC soon and I will share the results.
> > > > >
> > > > > WDYT about this proposal?
> > > > >
> > > > > [1]:
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-113+Change+approach+to+store+checkpoint+recovery+data
> > > > > [2]: https://github.com/apache/ignite/pull/11024/files
> > > > >
> > >
>

Re: [PROPOSAL] Change approach to store checkpoint recovery data

Reply via email to