Alex, agree to the proposal. On Thu, Nov 9, 2023 at 5:31 PM Alex Plehanov <plehanov.a...@gmail.com> wrote:
> Anton, > > Async physical logging is a target and most promising solution. > > In this scenario: > 1. Implement logical and physical records split. > 2. Implement async physical logging (actually, already implemented as PoC). > 3. Drop solution, implemented in (1) after some time, if solution, > implemented in (2) has no critical issues. > We do some useless job, which we assume will be dropped soon. > > Instead, I propose: > 1. Implement async physical logging > 2. Drop old physical logging implementation if (1) has no critical > issues after some time. > 3. Or implement logical and physical records split, if critical issues > found in (1). > In this case, we proceed to the alternative approach only if the main > approach fails. > > чт, 9 нояб. 2023 г. в 13:18, Anton Vinogradov <a...@apache.org>: > > > > In this case, we can split logs to logical and physical at the initial > fix. > > This should not cause any negative side effects. > > And, then implement an async physical logging as an alternative solution? > > > > On Thu, Nov 9, 2023 at 12:52 PM Alex Plehanov <plehanov.a...@gmail.com> > > wrote: > > > > > Anton, > > > > > > My concern is not only about compatibility. The new recovery data > > > storing approach is not a silver bullet, it has drawbacks as well. > > > Also, we can't be sure that the new approach is applicable for all > > > environments: increased checkpoint time can lead to throttling or even > > > OOM in some cases. So, in my opinion, it's better to keep both > > > approaches and allow users to configure it. We should keep both > > > approaches at least for a one release after the new approach will be > > > enabled by default. In case of a critical problem users can raise the > > > issue and switch to the old approach. > > > > > > пт, 3 нояб. 2023 г. в 16:33, Anton Vinogradov <a...@apache.org>: > > > > > > > > Sounds good to me, except the compatibility proposal. > > > > No need to keep the old behaviour. Noone will update the node after > the > > > > crash. > > > > Update must happen only after the plain node stop, let's just check > this > > > > instead of groving the code complexity. > > > > > > > > On Thu, Nov 2, 2023 at 4:55 PM Alex Plehanov < > plehanov.a...@gmail.com> > > > > wrote: > > > > > > > > > Hello, Igniters! > > > > > > > > > > I'd like to discuss the way of storing checkpoint recovery data. > > > > > Now, we are writing extra data to WAL files to protect from > failures > > > > > during checkpoints. Later, we read and write WAL files with this > extra > > > > > data a couple of times, causing excessive disk load, which can > lead to > > > > > performance drop. > > > > > We can try to improve this by changing the approach for storing > > > > > checkpoint recovery data. I've prepared the IEP [1] with my > proposals. > > > > > The main idea - move checkpoint recovery data from WAL physical > > > > > records to some file written right before the checkpoint. Please > have > > > > > a look at IEP for more information. > > > > > I've implemented PoC [2] for the described ideas. We will benchmark > > > > > this PoC soon and I will share the results. > > > > > > > > > > WDYT about this proposal? > > > > > > > > > > [1]: > > > > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-113+Change+approach+to+store+checkpoint+recovery+data > > > > > [2]: https://github.com/apache/ignite/pull/11024/files > > > > > > > > >