Guys. I have discussed just these changes. Look at my PR, please. What do you think?
https://issues.apache.org/jira/browse/IGNITE-15192 https://github.com/apache/ignite/pull/9280/files чт, 22 июл. 2021 г. в 22:30, Andrey Mashenkov <andrey.mashen...@gmail.com>: > Hi guys, > > I think updateHeartBeat() method was misused in the future listener and > this must be fixed. > > Actually, GridWorker.heartbeatTs Javadoc says that field is updated by the > worker itself. > It is consistent with WorkProgressDispatcher.updateHearbeat() javadoc, > which said "Notifying dispatcher that work is in progress and thread didn't > freeze." > GridWorker.heartbeatTs field is marked volatile just to provide consistent > visibility guarantee to a watcher. > > So, CheckpointPagesWriter violates this contract, when runs in another > thread(s). > But it works fine just because the main (Checkpointer) thread waits for > all the Writers to finish their work before it can fall into the blocking > section, therefore there is no race possible. > > As for the broken case, there are 2 possible solutions, > 1. Main thread must wait for all children's tasks to finish before going > into the blocking section. > 2. Or make updateHeartbeat consistent with the blockingSectionBegin/End. > Seems, there is no need to mark the method as synchronized, but use call > AtomicLongFieldUpdater.getAndUpdate(this, v -> v == Long.MAX_VALUE ? v : > U.currentTimeMillis()). > > I'd suggest > * Remove the "thread" mentioning from WorkProgressDispatcher interface to > avoid confusion with the current usage and make WorkProgressDispatcher more > general. > * Avoid unsafe heartbeatUpdater leak to the outside via wrapping > CheckpointContextImpl.heartbeatUpdater instance to perform consistent > heartbeat update from p.2 above. > > Does it make sense? > > On Wed, Jul 21, 2021 at 11:12 AM Alex Plehanov <plehanov.a...@gmail.com> > wrote: > >> Ilya, >> >> Race only affects the future listener (which only updates heartbeat), but >> not checkpoint listeners, so no other bugs possible caused by this race >> except wrong heartbeat update. >> >> As I've written before, I think it's not a good idea to wait for other >> threads by the critical thread in the blocking section. If these threads >> are not monitored we can never detect lack of progress and never trigger >> the failure handler. Perhaps a good solution will be to register async >> threads as critical threads (additionally to waiting for these threads in >> the blocing section) and update their own heartbeat. But the current fix >> is >> required too, to avoid such problems for other critical threads in the >> future. >> >> ср, 21 июл. 2021 г. в 06:29, Ilya Kazakov <kazakov.i...@gmail.com>: >> >> > 1. >> > I mean calling listeners in CheckpointWorkflow.markCheckpointBegin(): >> > >> > This >> > ------------------------------------------ >> > for (CheckpointListener lsnr : dbLsnrs) >> > lsnr.beforeCheckpointBegin(ctx0); >> > >> > ctx0.awaitPendingTasksFinished(); >> > ------------------------------------------ >> > >> > And this: >> > >> > ------------------------------------------ >> > // Listeners must be invoked before we write checkpoint record to WAL. >> > for (CheckpointListener lsnr : dbLsnrs) >> > lsnr.onMarkCheckpointBegin(ctx0); >> > >> > ctx0.awaitPendingTasksFinished(); >> > ------------------------------------------ >> > >> > Inside lsnr.beforeCheckpointBegin(ctx0) and >> > lsnr.onMarkCheckpointBegin(ctx0) we call >> CheckpointContextImpl.executor() >> > which submit heartbeat update tasks in threadpool. But due to a bug in >> > registration, ctx0.awaitPendingTasksFinished() do not work correctly. >> > Checpoint thread does not wait for all tasks to complete and moves on. >> > >> > This could lead to other bugs because as written in the comment "// >> > Listeners must be invoked before we write checkpoint record to WAL." >> > >> > 2. >> > About the fix. >> > Yes, the fix resolves the issue, but does not resolve the root cause - a >> > race between checkpoint thread and threads run in asyncRunner. Also, as >> I >> > understand, there should be no attempts to update heartbeat inside >> > blockingSection, but the fix does not exclude such attempts but blocks >> them. >> > >> > 3. >> > But my main point is that it looks strange to update the heartbeat of >> > thread A from thread B. It's like doing artificial respiration and chest >> > compressions. Thread A is waiting on async tasks completion, but these >> > tasks are updating progress of thread A. I suppose that blockingSection >> was >> > designed for such situations when the thread is waiting for something >> and >> > does not perform any progress. >> > >> > вт, 20 июл. 2021 г. в 21:43, Ivan Daschinsky <ivanda...@gmail.com>: >> > >> >> +1 For current fix. Code is clean and understandable. I suppose that >> the >> >> current fix is a correct variant to update heartbeatTs. >> >> >> >> вт, 20 июл. 2021 г. в 16:13, Alex Plehanov <plehanov.a...@gmail.com>: >> >> >> >> > Hello, Ilya >> >> > >> >> > > But anyway, I propose to remove the update of the heartbeat from >> other >> >> > threads altogether and wrap the call to listeners in a >> blockingSection. >> >> > I don't quite understand your proposal. Which call to listeners do >> you >> >> > mean? If we wrap the listener into the blocking section the result >> will >> >> be >> >> > the same. >> >> > Alternatively, I think we can wrap awaitPendingTasksFinished into the >> >> > blocking section, this will also solve the problem, but in this case >> we >> >> can >> >> > never detect lack of progress by async executor threads and >> checkpointer >> >> > thread hangs due to this reason. >> >> > What's wrong with the current fix? It solves the current problem and >> I >> >> hope >> >> > will protect us from problems like this in the future. >> >> > >> >> > вт, 20 июл. 2021 г. в 15:28, Ilya Kazakov <kazakov.i...@gmail.com>: >> >> > >> >> > > Hi Igniters, hi Alexey. >> >> > > >> >> > > I want to discuss this issue: >> >> > > https://issues.apache.org/jira/browse/IGNITE-15099. I have caught >> it >> >> > too. >> >> > > >> >> > > I was able to determine where there is a race. >> >> > > >> >> > > The update of the heartbeat happens asynchronously into the >> listener >> >> > code. >> >> > > But we always wait in the checkpoint thread for all pending async >> >> > > tasks. And this is reasonable. >> >> > > >> >> > > for (CheckpointListener lsnr : dbLsnrs) >> >> > > lsnr.beforeCheckpointBegin(ctx0); >> >> > > >> >> > > ctx0.awaitPendingTasksFinished(); >> >> > > >> >> > > The race was because of inappropriate order of future >> registration. In >> >> > > CheckpointContextImpl.executor () (inside listeners execution) >> >> > > >> >> > > GridFutureAdapter<?> res = new GridFutureAdapter<>(); >> >> > > res.listen(fut -> heartbeatUpdater.updateHeartbeat()); >> >> > > asyncRunner.execute(U.wrapIgniteFuture(cmd, res)); >> >> > > pendingTaskFuture.add(res); >> >> > > >> >> > > Here we create a task, submit a task to the executor, and only >> after >> >> this >> >> > > do we register the task. Thus we got a situation where checkpointer >> >> > thread >> >> > > was moving on after ctx0.awaitPendingTasksFinished(); and still, >> >> > > the unregistered asyncRunner task was moving on in parallel. >> >> > > >> >> > > But anyway, I propose to remove the update of the heartbeat from >> other >> >> > > threads altogether and wrap the call to listeners in a >> >> blockingSection. >> >> > > >> >> > > As I understand heartbeat was designed just to indicate >> self-progress >> >> by >> >> > a >> >> > > worker. If a worker can not indicate self-progress we should wrap >> such >> >> > code >> >> > > into blockingSections. In case of listeners, worker can not >> indicate >> >> > > self-progress, thus let's wrap it into blockingSection. >> >> > > >> >> > > Guys, what do you think about this? >> >> > > >> >> > > ------------- >> >> > > Ilya Kazakov >> >> > > >> >> > >> >> >> >> >> >> -- >> >> Sincerely yours, Ivan Daschinskiy >> >> >> > >> > > > -- > Best regards, > Andrey V. Mashenkov >