On Sat, Apr 8, 2023 at 1:30 PM Melanie Plageman
<melanieplage...@gmail.com> wrote:
>
> On Sat, Apr 8, 2023 at 12:33 PM Tom Lane <t...@sss.pgh.pa.us> wrote:
> >
> > Thomas Munro <thomas.mu...@gmail.com> writes:
> > > I committed the main patch.
> >
> > BTW, it was easy to miss in all the buildfarm noise from
> > last-possible-minute patches, but chimaera just showed something
> > that looks like a bug in this code [1]:
> >
> > 2023-04-08 12:25:28.709 UTC [18027:321] pg_regress/join_hash LOG:  
> > statement: savepoint settings;
> > 2023-04-08 12:25:28.709 UTC [18027:322] pg_regress/join_hash LOG:  
> > statement: set local max_parallel_workers_per_gather = 2;
> > 2023-04-08 12:25:28.710 UTC [18027:323] pg_regress/join_hash LOG:  
> > statement: explain (costs off)
> >              select  count(*) from simple r full outer join simple s on 
> > (r.id = 0 - s.id);
> > 2023-04-08 12:25:28.710 UTC [18027:324] pg_regress/join_hash LOG:  
> > statement: select  count(*) from simple r full outer join simple s on (r.id 
> > = 0 - s.id);
> > TRAP: failed Assert("BarrierParticipants(&batch->batch_barrier) == 1"), 
> > File: "nodeHash.c", Line: 2118, PID: 19147

So, after staring at this for awhile, I suspect this assertion is just
plain wrong. BarrierArriveAndDetachExceptLast() contains this code:

    if (barrier->participants > 1)
    {
        --barrier->participants;
        SpinLockRelease(&barrier->mutex);

        return false;
    }
    Assert(barrier->participants == 1);

So in between this assertion and the one we tripped,

    if (!BarrierArriveAndDetachExceptLast(&batch->batch_barrier))
    {
    ...
        return false;
    }

    /* Now we are alone with this batch. */
    Assert(BarrierPhase(&batch->batch_barrier) == PHJ_BATCH_SCAN);
    Assert(BarrierParticipants(&batch->batch_barrier) == 1);

Another worker attached to the batch barrier, saw that it was in
PHJ_BATCH_SCAN, marked it done and detached. This is fine.
BarrierArriveAndDetachExceptLast() is meant to ensure no one waits
(deadlock hazard) and that at least one worker stays to do the unmatched
scan. It doesn't hurt anything for another worker to join and find out
there is no work to do.

We should simply delete this assertion.

- Melanie


Reply via email to