On 28/08/18 03:39, Daniel Wood wrote:
Having quit Amazon, where I was doing Postgres development, I've
started looking at various things I might work on for fun. One
thought is to start with something easy like the scalability of
GetSnapshotData(). :-)

Cool! :-)

I recently found it interesting to examine performance while running
near 1 million pgbench selects per sec on a 48 core/96 HT Skylake
box. I noticed that additional sessions trying to connect were timing
out when they got stuck in ProcArrayAdd trying to get the
ProcArrayLock in EXCLUSIVE mode. FYI, scale 10000 with 2048 clients.

The question is whether it is possible that the problem with
GetSnapshotData() has reached a critical point, with respect to
snapshot scaling, on the newest high end systems.

Yeah, GetSnapshotData() certainly becomes a bottleneck in certain workloads.

What I'd like is a short cut to any of the current discussions of
various ideas to improve snapshot scaling. I have some of my own
ideas but want to review things before posting them.

The main solution we've been discussing on -hackers over the last few years is changing the way snapshots work, to use a Commit Sequence Number. If we assign each transaction an CSN, then a snapshot is just a single integer, and GetSnapshotData() just needs to read the current value of the CSN counter. CSNs have problems of their own, of course :-). If you search the archives for "CSN", you'll find several threads on that.

Other less invasive ideas have also been thrown around. For example, when one backend acquires a snapshot, it could store a copy of that in shared memory. The next call to GetSnapshotData() could then just memcpy() the cached snapshot. Transaction commit would need to invalidate the cached copy. This helps, if you have a lot reads and few writes.

- Heikki

Reply via email to