On Wed, Sep 8, 2021 at 3:08 AM Andres Freund <and...@anarazel.de> wrote:
> Looking at this profile made me wonder if this was a build without > optimizations. The pg_atomic_read_u64()/pg_atomic_read_u64_impl() calls > should > be inlined. And while perf can reconstruct inlined functions when using > --call-graph=dwarf, they show up like "pg_atomic_read_u64 (inlined)" for > me. > Yeah, for profiling generally I build without optimizations so that I can see all the functions in the stack, so yeah profile results are without optimizations build but the performance results are with optimizations build. > > FWIW, I see times like this > > postgres[4144648][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t; > > ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ > │ QUERY PLAN > │ > > ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ > │ Gather (cost=1000.00..6716686.33 rows=200000000 width=208) (actual > rows=200000000 loops=1) │ > │ Workers Planned: 2 > │ > │ Workers Launched: 2 > │ > │ -> Parallel Seq Scan on t (cost=0.00..6715686.33 rows=83333333 > width=208) (actual rows=66666667 loops=3) │ > │ Planning Time: 0.043 ms > │ > │ Execution Time: 24954.012 ms > │ > > └──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ > (6 rows) > > Is this with or without patch, I mean can we see a comparison that patch improved anything in your environment? Looking at a profile I see the biggest bottleneck in the leader (which is > the > bottleneck as soon as the worker count is increased) to be reading the > length > word of the message. I do see shm_mq_receive_bytes() in the profile, but > the > costly part there is the "read % (uint64) ringsize" - divisions are slow. > We > could just compute a mask instead of the size. > Yeah that could be done, I can test with this change as well that how much we gain with this. > > We also should probably split the read-mostly data in shm_mq (ring_size, > detached, ring_offset, receiver, sender) into a separate cacheline from the > read/write data. Or perhaps copy more info into the handle, particularly > the > ringsize (or mask). > Good suggestion, I will do some experiments around this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com