Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Tomas Vondra Wed, 28 Aug 2019 16:19:26 -0700

On Wed, Aug 28, 2019 at 08:17:47PM +0300, Alexey Kondratov wrote:

Hi Tomas,
Interesting. Any idea where does the extra overhead in this particular
case come from? It's hard to deduce that from the single flame graph,
when I don't have anything to compare it with (i.e. the flamegraph for
the "normal" case).
I guess that bottleneck is in disk operations. You can check
logical_repl_worker_new_perf.svg flame graph: disk reads (~9%) and
writes (~26%) take around 35% of CPU time in summary. To compare,
please, see attached flame graph for the following transaction:

INSERT INTO large_text
SELECT (SELECT string_agg('x', ',')
FROM generate_series(1, 2000)) FROM generate_series(1, 1000000);

Execution Time: 44519.816 ms
Time: 98333,642 ms (01:38,334)

where disk IO is only ~7-8% in total. So we get very roughly the same
~x4-5 performance drop here. JFYI, I am using a machine with SSDfor tests.
Therefore, probably you may write changes on receiver in bigger chunks,
not each change separately.
Possibly, I/O is certainly a possible culprit, although we should be
using buffered I/O and there certainly are not any fsyncs here. So I'm
not sure why would it be cheaper to do the writes in batches.

BTW does this mean you see the overhead on the apply side? Or are you
running this on a single machine, and it's difficult to decide?
I run this on a single machine, but walsender and worker areutilizing almost 100% of CPU per each process all the time, and atapply side I/O syscalls take about 1/3 of CPU time. Though I amstill not sure, but for me this result somehow links performancedrop with problems at receiver side.
Writing in batches was just a hypothesis and to validate it I haveperformed test with large txn, but consisting of a smaller number ofwide rows. This test does not exhibit any significant performancedrop, while it was streamed too. So it seems to be valid. Anyway, Ido not have other reasonable ideas beside that right now.
I've checked recently this patch again and tried to elaborate it interms of performance. As a result I've implemented a new POC versionof the applier (attached). Almost everything in streaming logic stayedintact, but apply worker is significantly different.
As I wrote earlier I still claim, that spilling changes on disk at theapplier side adds additional overhead, but it is possible to get ridof it. In my additional patch I do the following:
1) Maintain a pool of additional background workers (bgworkers), thatare connected with main logical apply worker via shm_mq's. Each workeris dedicated to the processing of specific streamed transaction.
2) When we receive a streamed change for some transaction, we checkwhether there is an existing dedicated bgworker in HTAB (xid ->bgworker), or there are some in the idle list, or spawn a new one.
3) We pass all changes (between STREAM START/STOP) to that bgworkervia shm_mq_send without intermediate waiting. However, we wait forbgworker to apply the entire changes chunk at STREAM STOP, since wedon't want transactions reordering.
4) When transaction is commited/aborted worker is being added to theidle list and is waiting for reassigning message.
5) I have used the same machinery with apply_dispatch in bgworkers,since most of actions are practically very similar.
Thus, we do not spill anything at the applier side, so transactionchanges are processed by bgworkers as normal backends do. In the sametime, changes processing is strictly serial, which preventstransactions reordering and possible conflicts/anomalies. Even thoughwe trade off performance in favor of stability the result is ratherimpressive. I have used a similar query for testing as before:
EXPLAIN (ANALYZE, BUFFERS) INSERT INTO large_test (num1, num2, num3)
    SELECT round(random()*10), random(), random()*142
    FROM generate_series(1, 1000000) s(i);
with 1kk (1000000), 3kk and 5kk rows; logical_work_mem = 64MB andsynchronous_standby_names = 'FIRST 1 (large_sub)'. Table schema isfollowing:
CREATE TABLE large_test (
    id serial primary key,
    num1 bigint,
    num2 double precision,
    num3 double precision
);

Here are the results:

-------------------------------------------------------------------
| N | Time on master, sec | Total xact time, sec |     Ratio      |
-------------------------------------------------------------------
|                        On commit (master, v13)                  |
-------------------------------------------------------------------
| 1kk | 6.5               | 17.6                 | x2.74          |
-------------------------------------------------------------------
| 3kk | 21                | 55.4                 | x2.64          |
-------------------------------------------------------------------
| 5kk | 38.3              | 91.5                 | x2.39          |
-------------------------------------------------------------------
|                        Stream + spill                           |
-------------------------------------------------------------------
| 1kk | 5.9               | 18                   | x3             |
-------------------------------------------------------------------
| 3kk | 19.5              | 52.4                 | x2.7           |
-------------------------------------------------------------------
| 5kk | 33.3              | 86.7                 | x2.86          |
-------------------------------------------------------------------
|                        Stream + BGW pool                        |
-------------------------------------------------------------------
| 1kk | 6                 | 12                   | x2             |
-------------------------------------------------------------------
| 3kk | 18.5              | 30.5                 | x1.65          |
-------------------------------------------------------------------
| 5kk | 35.6              | 53.9                 | x1.51          |
-------------------------------------------------------------------
It seems that overhead added by synchronous replica is lower by 2-3times compared with Postgres master and streaming with spilling.Therefore, the original patch eliminated delay before largetransaction processing start by sender, while this additional patchspeeds up the applier side.
Although the overall speed up is surely measurable, there is a roomfor improvements yet:
1) Currently bgworkers are only spawned on demand without some initialpool and never stopped. Maybe we should create a small pool onreplication start and offload some of idle bgworkers if they exceedsome limit?
2) Probably we can track somehow that incoming change has conflictswith some of being processed xacts, so we can wait for specificbgworkers only in that case?
3) Since the communication between main logical apply worker and eachbgworker from the pool is a 'single producer --- single consumer'problem, then probably it is possible to wait and set/check flagswithout locks, but using just atomics.
What do you think about this concept in general? Any concerns andcriticism are welcome!


Hi Alexey,

I'm unable to do any in-depth review of the patch over the next two weeks
or so, but I think the idea of having a pool of apply workers is sound and
can be quite beneficial for some workloads.

I don't think it matters very much whether the workers are started at the
beginning or allocated ad hoc, that's IMO a minor implementation detail.

There's one huge challenge that I however don't see mentioned in your
message or in the patch (after cursory reading) - ensuring the same commit
order, and introducing deadlocks that would not exist in single-process
apply.

Surely, we want to end up with the same commit order as on the upstream,
otherwise we might easily get different data on the subscriber. So when we
pass the large transaction to a separate process, then this process has
to wait for the other processes processing transactions that committed
first. And similarly, other processes have to wait for this process.
Depending on the commit order. I might have missed something, but I don't
see anything like that in your patch.

Essentially, this means there needs to be some sort of wait between those
apply processes, enforcing the commit order.

That however means we can easily introduce deadlocks into workloads where
the serial-apply would not have that issue - imagine multiple large
transactions, touching the same set of rows. We may ship them to different
bgworkers, and those processes may deadlock.

Of course, the deadlock detector will come around (assuming the wait is
done in a way visible to the detector) and will abort one of the
processes. But we don't know it'll abort the right one - it may easily
abort the apply process that needs to comit first, and eveyone else is
waitiing for it. Which stalls the apply forever.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: PATCH: logical_work_mem and logical streaming of large in-progress transactions

Reply via email to