Re: [Discussion] Race on heartbeat update in checkpoint thread

Ilya Kazakov Wed, 28 Jul 2021 19:49:19 -0700

Guys. I have discussed just these changes. Look at my PR, please. What do
you think?


https://issues.apache.org/jira/browse/IGNITE-15192
https://github.com/apache/ignite/pull/9280/files

чт, 22 июл. 2021 г. в 22:30, Andrey Mashenkov <andrey.mashen...@gmail.com>:

> Hi guys,
>
> I think updateHeartBeat() method was misused in the future listener and
> this must be fixed.
>
> Actually, GridWorker.heartbeatTs Javadoc says that field is updated by the
> worker itself.
> It is consistent with WorkProgressDispatcher.updateHearbeat() javadoc,
> which said "Notifying dispatcher that work is in progress and thread didn't
> freeze."
> GridWorker.heartbeatTs field is marked volatile just to provide consistent
> visibility guarantee to a watcher.
>
> So, CheckpointPagesWriter violates this contract, when runs in another
> thread(s).
> But it works fine just because the main (Checkpointer) thread waits for
> all the Writers to finish their work before it can fall into the blocking
> section, therefore there is no race possible.
>
> As for the broken case, there are 2 possible solutions,
> 1. Main thread must wait for all children's tasks to finish before going
> into the blocking section.
> 2. Or make updateHeartbeat consistent with the blockingSectionBegin/End.
> Seems, there is no need to mark the method as synchronized, but use call
> AtomicLongFieldUpdater.getAndUpdate(this, v -> v == Long.MAX_VALUE ? v :
> U.currentTimeMillis()).
>
> I'd suggest
> * Remove the "thread" mentioning from WorkProgressDispatcher interface to
> avoid confusion with the current usage and make WorkProgressDispatcher more
> general.
> * Avoid unsafe heartbeatUpdater leak to the outside via wrapping
> CheckpointContextImpl.heartbeatUpdater instance to perform consistent
> heartbeat update from p.2 above.
>
> Does it make sense?
>
> On Wed, Jul 21, 2021 at 11:12 AM Alex Plehanov <plehanov.a...@gmail.com>
> wrote:
>
>> Ilya,
>>
>> Race only affects the future listener (which only updates heartbeat), but
>> not checkpoint listeners, so no other bugs possible caused by this race
>> except wrong heartbeat update.
>>
>> As I've written before, I think it's not a good idea to wait for other
>> threads by the critical thread in the blocking section. If these threads
>> are not monitored we can never detect lack of progress and never trigger
>> the failure handler. Perhaps a good solution will be to register async
>> threads as critical threads (additionally to waiting for these threads in
>> the blocing section) and update their own heartbeat. But the current fix
>> is
>> required too, to avoid such problems for other critical threads in the
>> future.
>>
>> ср, 21 июл. 2021 г. в 06:29, Ilya Kazakov <kazakov.i...@gmail.com>:
>>
>> > 1.
>> > I mean calling listeners in CheckpointWorkflow.markCheckpointBegin():
>> >
>> > This
>> > ------------------------------------------
>> > for (CheckpointListener lsnr : dbLsnrs)
>> >     lsnr.beforeCheckpointBegin(ctx0);
>> >
>> > ctx0.awaitPendingTasksFinished();
>> > ------------------------------------------
>> >
>> > And this:
>> >
>> > ------------------------------------------
>> > // Listeners must be invoked before we write checkpoint record to WAL.
>> > for (CheckpointListener lsnr : dbLsnrs)
>> >     lsnr.onMarkCheckpointBegin(ctx0);
>> >
>> > ctx0.awaitPendingTasksFinished();
>> > ------------------------------------------
>> >
>> > Inside lsnr.beforeCheckpointBegin(ctx0) and
>> > lsnr.onMarkCheckpointBegin(ctx0) we call
>> CheckpointContextImpl.executor()
>> > which submit heartbeat update tasks in threadpool. But due to a bug in
>> > registration, ctx0.awaitPendingTasksFinished() do not work correctly.
>> > Checpoint thread does not wait for all tasks to complete and moves on.
>> >
>> > This could lead to other bugs because as written in the comment "//
>> > Listeners must be invoked before we write checkpoint record to WAL."
>> >
>> > 2.
>> > About the fix.
>> > Yes, the fix resolves the issue, but does not resolve the root cause - a
>> > race between checkpoint thread and threads run in asyncRunner. Also, as
>> I
>> > understand, there should be no attempts to update heartbeat inside
>> > blockingSection, but the fix does not exclude such attempts but blocks
>> them.
>> >
>> > 3.
>> > But my main point is that it looks strange to update the heartbeat of
>> > thread A from thread B. It's like doing artificial respiration and chest
>> > compressions. Thread A is waiting on async tasks completion, but these
>> > tasks are updating progress of thread A. I suppose that blockingSection
>> was
>> > designed for such situations when the thread is waiting for something
>> and
>> > does not perform any progress.
>> >
>> > вт, 20 июл. 2021 г. в 21:43, Ivan Daschinsky <ivanda...@gmail.com>:
>> >
>> >> +1 For current fix. Code is clean and understandable. I suppose that
>> the
>> >> current fix is a correct variant to update heartbeatTs.
>> >>
>> >> вт, 20 июл. 2021 г. в 16:13, Alex Plehanov <plehanov.a...@gmail.com>:
>> >>
>> >> > Hello, Ilya
>> >> >
>> >> > > But anyway, I propose to remove the update of the heartbeat from
>> other
>> >> > threads altogether and wrap the call to listeners in a
>> blockingSection.
>> >> > I don't quite understand your proposal. Which call to listeners do
>> you
>> >> > mean? If we wrap the listener into the blocking section the result
>> will
>> >> be
>> >> > the same.
>> >> > Alternatively, I think we can wrap awaitPendingTasksFinished into the
>> >> > blocking section, this will also solve the problem, but in this case
>> we
>> >> can
>> >> > never detect lack of progress by async executor threads and
>> checkpointer
>> >> > thread hangs due to this reason.
>> >> > What's wrong with the current fix? It solves the current problem and
>> I
>> >> hope
>> >> > will protect us from problems like this in the future.
>> >> >
>> >> > вт, 20 июл. 2021 г. в 15:28, Ilya Kazakov <kazakov.i...@gmail.com>:
>> >> >
>> >> > > Hi Igniters, hi Alexey.
>> >> > >
>> >> > > I want to discuss this issue:
>> >> > > https://issues.apache.org/jira/browse/IGNITE-15099. I have caught
>> it
>> >> > too.
>> >> > >
>> >> > > I was able to determine where there is a race.
>> >> > >
>> >> > > The update of the heartbeat happens asynchronously into the
>> listener
>> >> > code.
>> >> > > But we always wait in the checkpoint thread for all pending async
>> >> > > tasks. And this is reasonable.
>> >> > >
>> >> > > for (CheckpointListener lsnr : dbLsnrs)
>> >> > >   lsnr.beforeCheckpointBegin(ctx0);
>> >> > >
>> >> > > ctx0.awaitPendingTasksFinished();
>> >> > >
>> >> > > The race was because of inappropriate order of future
>> registration. In
>> >> > > CheckpointContextImpl.executor () (inside listeners execution)
>> >> > >
>> >> > > GridFutureAdapter<?> res = new GridFutureAdapter<>();
>> >> > > res.listen(fut -> heartbeatUpdater.updateHeartbeat());
>> >> > > asyncRunner.execute(U.wrapIgniteFuture(cmd, res));
>> >> > > pendingTaskFuture.add(res);
>> >> > >
>> >> > > Here we create a task, submit a task to the executor, and only
>> after
>> >> this
>> >> > > do we register the task. Thus we got a situation where checkpointer
>> >> > thread
>> >> > > was moving on after ctx0.awaitPendingTasksFinished(); and still,
>> >> > > the unregistered asyncRunner task was moving on in parallel.
>> >> > >
>> >> > > But anyway, I propose to remove the update of the heartbeat from
>> other
>> >> > > threads altogether and wrap the call to listeners in a
>> >> blockingSection.
>> >> > >
>> >> > > As I understand heartbeat was designed just to indicate
>> self-progress
>> >> by
>> >> > a
>> >> > > worker. If a worker can not indicate self-progress we should wrap
>> such
>> >> > code
>> >> > > into blockingSections. In case of listeners, worker can not
>> indicate
>> >> > > self-progress, thus let's wrap it into blockingSection.
>> >> > >
>> >> > > Guys, what do you think about this?
>> >> > >
>> >> > > -------------
>> >> > > Ilya Kazakov
>> >> > >
>> >> >
>> >>
>> >>
>> >> --
>> >> Sincerely yours, Ivan Daschinskiy
>> >>
>> >
>>
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>

Re: [Discussion] Race on heartbeat update in checkpoint thread

Reply via email to