Re: [PoC] Non-volatile WAL buffer

Konstantin Knizhnik Fri, 22 Jan 2021 08:04:58 -0800



On 22.01.2021 5:32, Tomas Vondra wrote:



On 1/21/21 3:17 AM, Masahiko Sawada wrote:

On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
<tomas.von...@enterprisedb.com> wrote:


Hi,

I think I've managed to get the 0002 patch [1] rebased to master and
working (with help from Masahiko Sawada). It's not clear to me how it
could have worked as submitted - my theory is that an incomplete patch
was submitted by mistake, or something like that.

Unfortunately, the benchmark results were kinda disappointing. For a
pgbench on scale 500 (fits into shared buffers), an average of three
5-minute runs looks like this:

    branch                 1        16        32 64        96
----------------------------------------------------------------
    master              7291     87704    165310    150437 224186
    ntt                 7912    106095    213206    212410 237819
    simple-no-buffers   7654     96544    115416     95828 103065

NTT refers to the patch from September 10, pre-allocating a large WAL

file on PMEM, and simple-no-buffers is the simpler patch simplyremoving

the WAL buffers and writing directly to a mmap-ed WAL segment on PMEM.

Note: The patch is just replacing the old implementation with mmap.
That's good enough for experiments like this, but we probably want to
keep the old one for setups without PMEM. But it's good enough for
testing, benchmarking etc.

Unfortunately, the results for this simple approach are pretty bad. Not
only compared to the "ntt" patch, but even to master. I'm not entirely
sure what's the root cause, but I have a couple hypotheses:

1) bug in the patch - That's clearly a possibility, although I've tried
tried to eliminate this possibility.

2) PMEM is slower than DRAM - From what I know, PMEM is much fasterthan

NVMe storage, but still much slower than DRAM (both in terms of latency
and bandwidth, see [2] for some data). It's not terrible, but the

latency is maybe 2-3x higher - not a huge difference, but may matterfor

WAL buffers?

3) PMEM does not handle parallel writes well - If you look at [2],
Figure 4(b), you'll see that the throughput actually *drops" as the
number of threads increase. That's pretty strange / annoying, because

that's how we write into WAL buffers - each thread writes it's owndata,

so parallelism is not something we can get rid of.

I've added some simple profiling, to measure number of calls / time for
each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates data
for each backend, and logs the counts every 1M ops.

Typical stats from a concurrent run looks like this:

    xlog stats cnt 43000000
       map cnt 100 time 5448333 unmap cnt 100 time 3730963
       memcpy cnt 985964 time 1550442272 len 15150499
       memset cnt 0 time 0 len 0
       persist cnt 13836 time 10369617 len 16292182

The times are in nanoseconds, so this says the backend did 100 mmapand

unmap calls, taking ~10ms in total. There were ~14k pmem_persist calls,
taking 10ms in total. And the most time (~1.5s) was used by pmem_memcpy
copying about 15MB of data. That's quite a lot :-(


It might also be interesting if we can see how much time spent on each
logging function, such as XLogInsert(), XLogWrite(), and XLogFlush().

Yeah, we could extend it to that, that's fairly mechanical thing. Bbutmaybe that could be visible in a regular perf profile. Also, I supposemost of the time will be used by the pmem calls, shown in the stats.


My conclusion from this is that eliminating WAL buffers and writing WAL

directly to PMEM (by memcpy to mmap-ed WAL segments) is probably notthe

right approach.

I suppose we should keep WAL buffers, and then just write the data to
mmap-ed WAL segments on PMEM. Which I think is what the NTT patch does,
except that it allocates one huge file on PMEM and writes to that
(instead of the traditional WAL segments).

So I decided to try how it'd work with writing to regular WAL segments,
mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does that,
and the results look a bit nicer:

    branch                 1        16        32 64        96
----------------------------------------------------------------
    master              7291     87704    165310    150437 224186
    ntt                 7912    106095    213206    212410 237819
    simple-no-buffers   7654     96544    115416     95828 103065
    with-wal-buffers    7477     95454    181702    140167 214715

So, much better than the version without WAL buffers, somewhat better
than master (except for 64/96 clients), but still not as good as NTT.

At this point I was wondering how could the NTT patch be faster when
it's doing roughly the same thing. I'm sire there are some differences,

but it seemed strange. The main difference seems to be that it onlymaps

one large file, and only once. OTOH the alternative "simple" patch maps
segments one by one, in each backend. Per the debug stats the map/unmap

calls are fairly cheap, but maybe it interferes with the memcpysomehow.


While looking at the two methods: NTT and simple-no-buffer, I realized
that in XLogFlush(), NTT patch flushes (by pmem_flush() and
pmem_drain()) WAL without acquiring WALWriteLock whereas
simple-no-buffer patch acquires WALWriteLock to do that
(pmem_persist()). I wonder if this also affected the performance
differences between those two methods since WALWriteLock serializes
the operations. With PMEM, multiple backends can concurrently flush
the records if the memory region is not overlapped? If so, flushing
WAL without WALWriteLock would be a big benefit.

That's a very good question - it's quite possible the WALWriteLock isnot really needed, because the processes are actually "writing" theWAL directly to PMEM. So it's a bit confusing, because it's onlyreally concerned about making sure it's flushed.

And yes, multiple processes certainly can write to PMEM at the sametime, in fact it's a requirement to get good throughput I believe. Myunderstanding is we need ~8 processes, at least that's what I heardfrom people with more PMEM experience.

TBH I'm not convinced the code in the "simple-no-buffer" code (comingfrom the 0002 patch) is actually correct. Essentially, consider thebackend needs to do a flush, but does not have a segment mapped. So itmaps it and calls pmem_drain() on it.

But does that actually flush anything? Does it properly flush changesdone by other processes that may not have called pmem_drain() yet? Ifind this somewhat suspicious and I'd bet all processes that did writesomething have to call pmem_drain().

So I did an experiment by increasing the size of the WAL segments. I

chose to try with 521MB and 1024MB, and the results with 1GB looklike this:

branch 1 16 32 64 96
----------------------------------------------------------------
master 6635 88524 171106 163387 245307
ntt 7909 106826 217364 223338 242042
simple-no-buffers 7871 101575 199403 188074 224716
with-wal-buffers 7643 101056 206911 223860 261712

So yeah, there's a clear difference. It changes the values for "master"
a bit, but both the "simple" patches (with and without) WAL buffers are
much faster. The with-wal-buffers is almost equal to the NTT patch,
which was using 96GB file. I presume larger WAL segments would get even
closer, if we supported them.

I'll continue investigating this, but my conclusion so far seem to be
that we can't really replace WAL buffers with PMEM - that seems to
perform much worse.

The question is what to do about the segment size. Can we reduce the
overhead of mmap-ing individual segments, so that this works even for
smaller WAL segments, to make this useful for common instances (not
everyone wants to run with 1GB WAL). Or whether we need to adopt the
design with a large file, mapped just once.

Another question is whether it's even worth the extra complexity. On
16MB segments the difference between master and NTT patch seems to be
non-trivial, but increasing the WAL segment size kinda reduces that. So
maybe just using File I/O on PMEM DAX filesystem seems good enough.
Alternatively, maybe we could switch to libpmemblk, which should
eliminate the filesystem overhead at least.


I think the performance improvement by NTT patch with the 16MB WAL
segment, the most common WAL segment size, is very good (150437 vs.
212410 with 64 clients). But maybe evaluating writing WAL segment
files on PMEM DAX filesystem is also worth, as you mentioned, if we
don't do that yet.

Well, not sure. I think the question is still open whether it'sactually safe to run on DAX, which does not have atomic writes of 512Bsectors, and I think we rely on that e.g. for pg_config. But maybe forWAL that's not an issue.

Also, I'm interested in why the through-put of NTT patch saturated at
32 clients, which is earlier than the master's one (96 clients). How
many CPU cores are there on the machine you used?

From what I know, this is somewhat expected for PMEM devices, for abunch of reasons:

1) The memory bandwidth is much lower than for DRAM (maybe ~10-20%),so it takes fewer processes to saturate it.

2) Internally, the PMEM has a 256B buffer for writes, used forcombining etc. With too many processes sending writes, it becomes tolook more random, which is harmful for throughput.

When combined, this means the performance starts dropping at certainnumber of threads, and the optimal number of threads is rather low(something like 5-10). This is very different behavior compared to DRAM.


There's a nice overview and measurements in this paper:

Building blocks for persistent memory / How to get the most out ofyour new memory?Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann & AlfonsKemper


https://link.springer.com/article/10.1007/s00778-020-00622-9

I'm also wondering if WAL is the right usage for PMEM. Per [2]there's a
huge read-write assymmetry (the writes being way slower), and their
recommendation (in "Observation 3" is)
The read-write asymmetry of PMem im-plies the necessity ofavoiding
      writes as much as possible for PMem.

So maybe we should not be trying to use PMEM for WAL, which is pretty
write-heavy (and in most cases even write-only).


I think using PMEM for WAL is cost-effective but it leverages the only
low-latency (sequential) write, but not other abilities such as
fine-grained access and low-latency random write. If we want to
exploit its all ability we might need some drastic changes to logging
protocol while considering storing data on PMEM.

True. I think investigating whether it's sensible to use PMEM for thispurpose. It may turn out that replacing the DRAM WAL buffers withwrites directly to PMEM is not economical, and aggregating data in aDRAM buffer is better :-(



regards

I have heard from several DBMS experts that appearance of huge and cheapnon-volatile memory can make a revolution in database system architecture.If all database can fit in non-volatile memory, then we do not needbuffers, WAL, ...But although multi-terabyte NVM announces were made by IBM severalyears ago, I do not know about some successful DBMS prototypes with newarchitecture.

I tried to understand why...

It was very interesting to me to read this thread, which is actuallystarted in 2016 with "Non-volatile Memory Logging" presentation at PGCon.As far as I understand from Tomas result right now using PMEM for WALdoesn't provide some substantial increase of performance.

But the main advantage of PMEM from my point of view is that it allowsto avoid write-ahead logging at all!Certainly we need to change our algorithms to make it possible. Speakingabout Postgres, we have to rewrite all indexes + heap

and throw away buffer manager + WAL.

What can be used instead of standard B-Tree?
For example there is description of multiword-CAS approach:

   http://justinlevandoski.org/papers/mwcas.pdf

and BzTree implementation on top of it:

   https://www.cc.gatech.edu/~jarulraj/papers/2018.bztree.vldb.pdf

There is free BzTree implementation at github:

    g...@github.com:sfu-dis/bztree.git

I tried to adopt it for Postgres. It was not so easy because:
1. It was written in modern C++ (-std=c++14)
2. It supports multithreading, but not mutliprocess access

So I have to patch code of this library instead of just using it:

  g...@github.com:postgrespro/bztree.git

I have not tested yet most iterating case: access to PMEM through PMDK.And I do not have hardware for such tests.But first results are also seem to be interesting: PMwCAS is kind oflockless algorithm and it shows much better scaling at

NUMA host comparing with standard Postgres.

I have done simple parallel insertion test: multiple clients areinserting data with random keys.

To make competition with vanilla Postgres more honest I used unlogged table:

create unlogged table t(pk int, payload int);
create index on t using bztree(pk);

randinsert.sql:

insert into t (payload,pk) values(generate_series(1,1000),random()*1000000000);


pgbench -f randinsert.sql -c N -j N -M prepared -n -t 1000 -P 1 postgres

So each client is inserting one million records.
The target system has 160 virtual and 80 real cores with 256GB of RAM.
Results (TPS) are the following:

N      nbtree      bztree
1           540          455
10         993        2237
100     1479        5025

So bztree is more than 3 times faster for 100 clients.

Just for comparison: result for inserting in this table without index is10k TPS.


I am going then try to play with PMEM.

If results will be promising, then it is possible to think aboutreimplementation of heap and WAL-less Postgres!

I am sorry, that my post has no direct relation to the topic of thisthread (Non-volatile WAL buffer).It seems to be that it is better to use PMEM to eliminate WAL at allinstead of optimizing it.

Certainly, I realize that WAL plays very important role in Postgres:

archiving and replication are based on WAL. So even if we can livewithout WAL, it is still not clear whether we really want to livewithout it.

One more idea: using multiword CAS approach requires us to make changesas editing sequences.Such editing sequence is actually ready WAL records. So implementors ofaccess methods do not have to dodouble work: update data structure in memory and create correspondentWAL records. Moreover, PMwCAS operations are atomic:we can replay or revert them in case of fault. So there is no need inFPW (full page writes) which have very noticeable impact on WAL size and

database performance.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [PoC] Non-volatile WAL buffer

Reply via email to