Re: PR25394 patch: debuginfod groom vs. scan race condition

2020-01-20 Thread Mark Wielaard
Hi Frank,

On Sun, 2020-01-19 at 20:43 -0500, Frank Ch. Eigler wrote:
> This was a doozie.  As the commit text hints, I haven't found a way
> of
> testing this on an elfutils/tests snack-sized data set (except perhaps
> by injecting long sleeps?), but the changes work on the larger
> archives such as the debuginfod.systemtap.org set.
> 
> 
> commit 34e67018914cf9ebbef07065965755b6554fd66e (HEAD -> master, 
> origin/fche/debuginfod-PR25394)
> Author: Frank Ch. Eigler 
> Date:   Sun Jan 19 20:33:32 2020 -0500
> 
> PR25394: debuginfod mutex between grooming and scanning
> 
> Extended the work-queue concept with "idlers" - other threads that
> block on the work queue until it becomes empty (rather than normal
> consumers that block on it until it becomes non-empty).  Use this
> facility for the groomer thread to avoid working at the same time as
> the scanner threads.  Use this for the fts traversal thread for
> similar reasons.  One user-visible effect: response to SIGUSR1 and
> SIGUSR2 will wait until the work queue runs empty, but the man page
> was unspecific so does not need changing.
> 
> It's not obvious how to test this with a tests/ dataset so small that
> scanning takes negligible time, so the former races are very tight.
> 
> P.S. We also evaluated using sqlite level transactions to isolate the
> scanner thread groups-of-operations from the groomer.  These
> experiments failed to produce a nominally concurrent debuginfod,
> having triggered "database locked" type errors.  So we remain
> single-threaded (fully serialized) at the sqlite API level.
> 
> diff --git a/debuginfod/ChangeLog b/debuginfod/ChangeLog
> index 35130b2a5d85..83b94a1108a6 100644
> --- a/debuginfod/ChangeLog
> +++ b/debuginfod/ChangeLog
> @@ -1,3 +1,14 @@
> +2020-01-19  Frank Ch. Eigler  
> +
> + * debuginfod.cxx (scanq): Rework to let groomer/fts threads
> + synchronize with an empty workqueue, and lock out workqueue
> + consumers.
> + (thread_groom): Adopt new scanq idle APIs to lock out scanners.
> + (thread_main_fts_source_paths): Adopt new scanq idler API to
> + avoid being restarted while scanners haven't even finished yet.
> + (thread_main_*): Increment thread_work_total metric only after
> + a work cycle is completed, not when it begins.

Nice work. And I think I almost understand it. I am just a little
confused about thread_main_fts_source_paths also being an "idler":

>   while (! interrupted)
>  {
> -  set_metric("thread_timer", "role","traverse", rescan_timer);
> -  // set_metric("thread_forced_total", "role","traverse", 
> forced_rescan_count);
> -  if (rescan_s && rescan_timer > rescan_s)
> -rescan_timer = 0;
> +  sleep (1);
> +  scanq.wait_idle(); // don't start a new traversal while scanners 
> haven't finished the job
> +  scanq.done_idle(); // release the hounds
> +  if (interrupted) break;

I guess the comment "release the hounds" gets me confused. I don't know 
precisely what that means. Is this needed for correctness, or is it simply to 
have an hint that the scanners have ran because you register as an idler and so 
at this point you could only run if no scanner is running? But given you 
immediately tell them you are done a scanner could in theory start their work 
up again while you are doing your own traversal?

Thanks,

Mark


Re: PR25394 patch: debuginfod groom vs. scan race condition

2020-01-20 Thread Frank Ch. Eigler
Hi -

> Nice work. And I think I almost understand it. I am just a little
> confused about thread_main_fts_source_paths also being an "idler":

Yup, it's a bit subtle.


> >   while (! interrupted)
> >  {
> > -  set_metric("thread_timer", "role","traverse", rescan_timer);
> > -  // set_metric("thread_forced_total", "role","traverse", 
> > forced_rescan_count);
> > -  if (rescan_s && rescan_timer > rescan_s)
> > -rescan_timer = 0;
> > +  sleep (1);
> > +  scanq.wait_idle(); // don't start a new traversal while scanners 
> > haven't finished the job
> > +  scanq.done_idle(); // release the hounds
> > +  if (interrupted) break;

> I guess the comment "release the hounds" gets me confused. I don't
> know precisely what that means. 

https://www.youtube.com/watch?time_continue=28&v=ORvkElqw0QM

> Is this needed for correctness, or is it simply to have an hint that
> the scanners have ran because you register as an idler and so at
> this point you could only run if no scanner is running?

It's because we don't want to start a new fts traversal pass while the
scanners haven't even finished the last one.  I've seen a case where
the rescan interval was so short that the scanner backlog just kept
growing and growing and growing.  (Changing the work queue to a set
from a dequeue also helps with duplicate elimination.)  So in this
way, it just returns to the timing behaviour of the pre-workqueue
implementation: we wait a certain amount of time AFTER the scanning is
ALL DONE, before starting a new scan.


> But given you immediately tell them you are done a scanner could in
> theory start their work up again while you are doing your own
> traversal?

Yes, because that's ideal for concurrency: while one thread fts(3)'s
and enumerates the results, the scanner threads can process the
already enumerated files.

- FChE



Re: PR25394 patch: debuginfod groom vs. scan race condition

2020-01-20 Thread Mark Wielaard
Hi Frank,

On Mon, 2020-01-20 at 12:10 -0500, Frank Ch. Eigler wrote:
> > Nice work. And I think I almost understand it. I am just a little
> > confused about thread_main_fts_source_paths also being an "idler":
> 
> Yup, it's a bit subtle.
> 
> > >   while (! interrupted)
> > >  {
> > > -  set_metric("thread_timer", "role","traverse", rescan_timer);
> > > -  // set_metric("thread_forced_total", "role","traverse", 
> > > forced_rescan_count);
> > > -  if (rescan_s && rescan_timer > rescan_s)
> > > -rescan_timer = 0;
> > > +  sleep (1);
> > > +  scanq.wait_idle(); // don't start a new traversal while scanners 
> > > haven't finished the job
> > > +  scanq.done_idle(); // release the hounds
> > > +  if (interrupted) break;
> > I guess the comment "release the hounds" gets me confused. I don't
> > know precisely what that means. 
> 
> https://www.youtube.com/watch?time_continue=28&v=ORvkElqw0QM

Aha, so it is as if we are going to start to run (chasing files),
except there is also the rescan timeout which might still stop the
hounds from really being released :)

> > Is this needed for correctness, or is it simply to have an hint that
> > the scanners have ran because you register as an idler and so at
> > this point you could only run if no scanner is running?
> 
> It's because we don't want to start a new fts traversal pass while the
> scanners haven't even finished the last one.  I've seen a case where
> the rescan interval was so short that the scanner backlog just kept
> growing and growing and growing.  (Changing the work queue to a set
> from a dequeue also helps with duplicate elimination.)  So in this
> way, it just returns to the timing behaviour of the pre-workqueue
> implementation: we wait a certain amount of time AFTER the scanning is
> ALL DONE, before starting a new scan.

That all makes sense. Thanks.

> > But given you immediately tell them you are done a scanner could in
> > theory start their work up again while you are doing your own
> > traversal?
> 
> Yes, because that's ideal for concurrency: while one thread fts(3)'s
> and enumerates the results, the scanner threads can process the
> already enumerated files.

OK. And if any idlers (grooming) would wake up that is no problem,
because that simply means the work queue was empty and adding work to
the queue can be done whether either the grooming thread (x)or scanner
threads are running concurrently.

Please do push if you feel confident about it being correct. I think it
is now that I understand how it actually works.

Thanks,

Mark


Buildbot failure in Wildebeest Builder on whole buildset

2020-01-20 Thread buildbot
The Buildbot has detected a failed build on builder whole buildset while 
building elfutils.
Full details are available at:
https://builder.wildebeest.org/buildbot/#builders/1/builds/462

Buildbot URL: https://builder.wildebeest.org/buildbot/

Worker for this Build: centos-x86_64

Build Reason: 
Blamelist: Frank Ch. Eigler 

BUILD FAILED: failed test (failure)

Sincerely,
 -The BuildbotThe Buildbot has detected a failed build on builder whole 
buildset while building elfutils.
Full details are available at:
https://builder.wildebeest.org/buildbot/#builders/2/builds/457

Buildbot URL: https://builder.wildebeest.org/buildbot/

Worker for this Build: debian-amd64

Build Reason: 
Blamelist: Frank Ch. Eigler 

BUILD FAILED: failed test (failure)

Sincerely,
 -The BuildbotThe Buildbot has detected a failed build on builder whole 
buildset while building elfutils.
Full details are available at:
https://builder.wildebeest.org/buildbot/#builders/3/builds/461

Buildbot URL: https://builder.wildebeest.org/buildbot/

Worker for this Build: fedora-x86_64

Build Reason: 
Blamelist: Frank Ch. Eigler 

BUILD FAILED: failed test (failure)

Sincerely,
 -The BuildbotThe Buildbot has detected a failed build on builder whole 
buildset while building elfutils.
Full details are available at:
https://builder.wildebeest.org/buildbot/#builders/4/builds/458

Buildbot URL: https://builder.wildebeest.org/buildbot/

Worker for this Build: debian-i386

Build Reason: 
Blamelist: Frank Ch. Eigler 

BUILD FAILED: failed compile (failure)

Sincerely,
 -The BuildbotThe Buildbot has detected a failed build on builder whole 
buildset while building elfutils.
Full details are available at:
https://builder.wildebeest.org/buildbot/#builders/10/builds/429

Buildbot URL: https://builder.wildebeest.org/buildbot/

Worker for this Build: fedora-s390x

Build Reason: 
Blamelist: Frank Ch. Eigler 

BUILD FAILED: failed test (failure)

Sincerely,
 -The BuildbotThe Buildbot has detected a failed build on builder whole 
buildset while building elfutils.
Full details are available at:
https://builder.wildebeest.org/buildbot/#builders/15/builds/254

Buildbot URL: https://builder.wildebeest.org/buildbot/

Worker for this Build: debian-armhf

Build Reason: 
Blamelist: Frank Ch. Eigler 

BUILD FAILED: failed compile (failure)

Sincerely,
 -The BuildbotThe Buildbot has detected a failed build on builder whole 
buildset while building elfutils.
Full details are available at:
https://builder.wildebeest.org/buildbot/#builders/16/builds/251

Buildbot URL: https://builder.wildebeest.org/buildbot/

Worker for this Build: centos-aarch64

Build Reason: 
Blamelist: Frank Ch. Eigler 

BUILD FAILED: failed test (failure)

Sincerely,
 -The Buildbot



Buildbot failure in Wildebeest Builder on whole buildset

2020-01-20 Thread buildbot
The Buildbot has detected a failed build on builder whole buildset while 
building elfutils.
Full details are available at:
https://builder.wildebeest.org/buildbot/#builders/11/builds/417

Buildbot URL: https://builder.wildebeest.org/buildbot/

Worker for this Build: fedora-ppc64le

Build Reason: 
Blamelist: Frank Ch. Eigler 

BUILD FAILED: failed test (failure)

Sincerely,
 -The BuildbotThe Buildbot has detected a failed build on builder whole 
buildset while building elfutils.
Full details are available at:
https://builder.wildebeest.org/buildbot/#builders/12/builds/415

Buildbot URL: https://builder.wildebeest.org/buildbot/

Worker for this Build: fedora-ppc64

Build Reason: 
Blamelist: Frank Ch. Eigler 

BUILD FAILED: failed test (failure)

Sincerely,
 -The Buildbot



[Bug debuginfod/25394] groom vs. scan race condition

2020-01-20 Thread fche at redhat dot com
https://sourceware.org/bugzilla/show_bug.cgi?id=25394

Frank Ch. Eigler  changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |FIXED

--- Comment #1 from Frank Ch. Eigler  ---
commit 34e67018914cf9ebbef07065965755b6554fd66e
let's try to put out of our minds the four subsequent cleanup patches

-- 
You are receiving this mail because:
You are on the CC list for the bug.