Re: Gather performance analysis

Tomas Vondra Sat, 18 Sep 2021 06:19:16 -0700



On 9/8/21 8:05 AM, Dilip Kumar wrote:

On Tue, Sep 7, 2021 at 8:41 PM Tomas Vondra<tomas.von...@enterprisedb.com <mailto:tomas.von...@enterprisedb.com>>wrote:
    Hi,

    The numbers presented in this thread seem very promising - clearly
    there's significant potential for improvements. I'll run similar
    benchmarks too, to get a better understanding of this.


Thanks for showing interest.


    Can you share some basic details about the hardware you used?
    Particularly the CPU model - I guess this might explain some of the
    results, e.g. if CPU caches are ~1MB, that'd explain why setting
    tup_queue_size to 1MB improves things, but 4MB is a bit slower.
    Similarly, number of cores might explain why 4 workers perform better
    than 8 or 16 workers.
I have attached the output of the lscpu. I think batching the databefore updating in the shared memory will win because we are avoidingthe frequent cache misses and IMHO the benefit will be more in themachine with more CPU sockets.
    Now, this is mostly expected, but the consequence is that maybe things
    like queue size should be tunable/dynamic, not hard-coded?
Actually, my intention behind the tuple queue size was to just see thebehavior. Do we really have the problem of workers stalling on queuewhile sending the tuple, the perf report showed some load on WaitLatchon the worker side so I did this experiment. I saw some benefits but itwas not really huge. I am not sure whether we want to just increase thetuple queue size or make it tunable, but if we want to supportredistribute operators in future sometime then maybe we should make itdynamically growing at runtime, maybe using dsa or dsa + shared files.

Thanks. I ran a couple more benchmarks, with different queue sizes(16kB, 64kB, 256kB and 1MB) and according to the results the queue sizereally makes almost no difference. It might make a difference for somequeries, but I wouldn't bother tuning this until we have a plausibleexample - increasing the queue size is not free either.

So it was worth checking, but I'd just leave it as 64kB for now. We mayrevisit this later in a separate patch/thread.

    As for the patches, I think the proposed changes are sensible, but I
    wonder what queries might get slower. For example with the batching
    (updating the counter only once every 4kB, that pretty much transfers
    data in larger chunks with higher latency. So what if the query needs
    only a small chunk, like a LIMIT query? Similarly, this might mean the
    upper parts of the plan have to wait for the data for longer, and thus
    can't start some async operation (like send them to a FDW, or something
    like that). I do admit those are theoretical queries, I haven't tried
    creating such query.
Yeah, I was thinking about such cases, basically, this design canincrease the startup cost of the Gather node, I will also try to derivesuch cases and test them.
    FWIW I've tried applying both patches at the same time, but there's a
    conflict in shm_mq_sendv - not a complex one, but I'm not sure what's
    the correct solution. Can you share a "combined" patch?
Actually, these both patches are the same,"v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" is thecleaner version of the first patch. For configurable tuple queue size Idid not send a patch, because that is I just used for the testingpurpose and never intended to to propose anything. My most of thelatest performance data I sent with only"v1-0001-Optimize-parallel-tuple-send-shm_mq_send_bytes.patch" and withdefault tuple queue size.
But I am attaching both the patches in case you want to play around.

Ah, silly me. I should have noticed the second patch is just a refinedversion of the first one.



regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Gather performance analysis

Reply via email to