Re: [PoC] Non-volatile WAL buffer

Takashi Menjo Tue, 26 Jan 2021 01:51:47 -0800

 Dear everyone, Tomas,

First of all, the "v4" patchset for non-volatile WAL buffer attached to the
previous mail is actually v5... Please read "v4" as "v5."


Then, to Tomas:
Thank you for your crash report you gave on Nov 27, 2020, regarding msync
patchset. I applied the latest msync patchset v3 attached to the previous
to master 411ae64 (on Jan18, 2021) then tested it, and I got no error when
pgbench -i -s 500. Please try it if necessary.

Best regards,
Takashi


2021年1月26日(火) 17:52 Takashi Menjo <[email protected]>:

> Dear everyone,
>
> Sorry but I forgot to attach my patchsets... Please see the files attached
> to this mail. Please also note that they contain some fixes.
>
> Best regards,
> Takashi
>
>
> 2021年1月26日(火) 17:46 Takashi Menjo <[email protected]>:
>
>> Dear everyone,
>>
>> I'm sorry for the late reply. I rebase my two patchsets onto the latest
>> master 411ae64.The one patchset prefixed with v4 is for non-volatile WAL
>> buffer; the other prefixed with v3 is for msync.
>>
>> I will reply to your thankful feedbacks one by one within days. Please
>> wait for a moment.
>>
>> Best regards,
>> Takashi
>>
>>
>> 01/25/2021(Mon) 11:56 Masahiko Sawada <[email protected]>:
>>
>>> On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra
>>> <[email protected]> wrote:
>>> >
>>> >
>>> >
>>> > On 1/21/21 3:17 AM, Masahiko Sawada wrote:
>>> > > On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
>>> > > <[email protected]> wrote:
>>> > >>
>>> > >> Hi,
>>> > >>
>>> > >> I think I've managed to get the 0002 patch [1] rebased to master and
>>> > >> working (with help from Masahiko Sawada). It's not clear to me how
>>> it
>>> > >> could have worked as submitted - my theory is that an incomplete
>>> patch
>>> > >> was submitted by mistake, or something like that.
>>> > >>
>>> > >> Unfortunately, the benchmark results were kinda disappointing. For a
>>> > >> pgbench on scale 500 (fits into shared buffers), an average of three
>>> > >> 5-minute runs looks like this:
>>> > >>
>>> > >>     branch                 1        16        32        64        96
>>> > >>     ----------------------------------------------------------------
>>> > >>     master              7291     87704    165310    150437    224186
>>> > >>     ntt                 7912    106095    213206    212410    237819
>>> > >>     simple-no-buffers   7654     96544    115416     95828    103065
>>> > >>
>>> > >> NTT refers to the patch from September 10, pre-allocating a large
>>> WAL
>>> > >> file on PMEM, and simple-no-buffers is the simpler patch simply
>>> removing
>>> > >> the WAL buffers and writing directly to a mmap-ed WAL segment on
>>> PMEM.
>>> > >>
>>> > >> Note: The patch is just replacing the old implementation with mmap.
>>> > >> That's good enough for experiments like this, but we probably want
>>> to
>>> > >> keep the old one for setups without PMEM. But it's good enough for
>>> > >> testing, benchmarking etc.
>>> > >>
>>> > >> Unfortunately, the results for this simple approach are pretty bad.
>>> Not
>>> > >> only compared to the "ntt" patch, but even to master. I'm not
>>> entirely
>>> > >> sure what's the root cause, but I have a couple hypotheses:
>>> > >>
>>> > >> 1) bug in the patch - That's clearly a possibility, although I've
>>> tried
>>> > >> tried to eliminate this possibility.
>>> > >>
>>> > >> 2) PMEM is slower than DRAM - From what I know, PMEM is much faster
>>> than
>>> > >> NVMe storage, but still much slower than DRAM (both in terms of
>>> latency
>>> > >> and bandwidth, see [2] for some data). It's not terrible, but the
>>> > >> latency is maybe 2-3x higher - not a huge difference, but may
>>> matter for
>>> > >> WAL buffers?
>>> > >>
>>> > >> 3) PMEM does not handle parallel writes well - If you look at [2],
>>> > >> Figure 4(b), you'll see that the throughput actually *drops" as the
>>> > >> number of threads increase. That's pretty strange / annoying,
>>> because
>>> > >> that's how we write into WAL buffers - each thread writes it's own
>>> data,
>>> > >> so parallelism is not something we can get rid of.
>>> > >>
>>> > >> I've added some simple profiling, to measure number of calls / time
>>> for
>>> > >> each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates
>>> data
>>> > >> for each backend, and logs the counts every 1M ops.
>>> > >>
>>> > >> Typical stats from a concurrent run looks like this:
>>> > >>
>>> > >>     xlog stats cnt 43000000
>>> > >>        map cnt 100 time 5448333 unmap cnt 100 time 3730963
>>> > >>        memcpy cnt 985964 time 1550442272 len 15150499
>>> > >>        memset cnt 0 time 0 len 0
>>> > >>        persist cnt 13836 time 10369617 len 16292182
>>> > >>
>>> > >> The times are in nanoseconds, so this says the backend did 100
>>> mmap and
>>> > >> unmap calls, taking ~10ms in total. There were ~14k pmem_persist
>>> calls,
>>> > >> taking 10ms in total. And the most time (~1.5s) was used by
>>> pmem_memcpy
>>> > >> copying about 15MB of data. That's quite a lot :-(
>>> > >
>>> > > It might also be interesting if we can see how much time spent on
>>> each
>>> > > logging function, such as XLogInsert(), XLogWrite(), and XLogFlush().
>>> > >
>>> >
>>> > Yeah, we could extend it to that, that's fairly mechanical thing. Bbut
>>> > maybe that could be visible in a regular perf profile. Also, I suppose
>>> > most of the time will be used by the pmem calls, shown in the stats.
>>> >
>>> > >>
>>> > >> My conclusion from this is that eliminating WAL buffers and writing
>>> WAL
>>> > >> directly to PMEM (by memcpy to mmap-ed WAL segments) is probably
>>> not the
>>> > >> right approach.
>>> > >>
>>> > >> I suppose we should keep WAL buffers, and then just write the data
>>> to
>>> > >> mmap-ed WAL segments on PMEM. Which I think is what the NTT patch
>>> does,
>>> > >> except that it allocates one huge file on PMEM and writes to that
>>> > >> (instead of the traditional WAL segments).
>>> > >>
>>> > >> So I decided to try how it'd work with writing to regular WAL
>>> segments,
>>> > >> mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does
>>> that,
>>> > >> and the results look a bit nicer:
>>> > >>
>>> > >>     branch                 1        16        32        64        96
>>> > >>     ----------------------------------------------------------------
>>> > >>     master              7291     87704    165310    150437    224186
>>> > >>     ntt                 7912    106095    213206    212410    237819
>>> > >>     simple-no-buffers   7654     96544    115416     95828    103065
>>> > >>     with-wal-buffers    7477     95454    181702    140167    214715
>>> > >>
>>> > >> So, much better than the version without WAL buffers, somewhat
>>> better
>>> > >> than master (except for 64/96 clients), but still not as good as
>>> NTT.
>>> > >>
>>> > >> At this point I was wondering how could the NTT patch be faster when
>>> > >> it's doing roughly the same thing. I'm sire there are some
>>> differences,
>>> > >> but it seemed strange. The main difference seems to be that it only
>>> maps
>>> > >> one large file, and only once. OTOH the alternative "simple" patch
>>> maps
>>> > >> segments one by one, in each backend. Per the debug stats the
>>> map/unmap
>>> > >> calls are fairly cheap, but maybe it interferes with the memcpy
>>> somehow.
>>> > >>
>>> > >
>>> > > While looking at the two methods: NTT and simple-no-buffer, I
>>> realized
>>> > > that in XLogFlush(), NTT patch flushes (by pmem_flush() and
>>> > > pmem_drain()) WAL without acquiring WALWriteLock whereas
>>> > > simple-no-buffer patch acquires WALWriteLock to do that
>>> > > (pmem_persist()). I wonder if this also affected the performance
>>> > > differences between those two methods since WALWriteLock serializes
>>> > > the operations. With PMEM, multiple backends can concurrently flush
>>> > > the records if the memory region is not overlapped? If so, flushing
>>> > > WAL without WALWriteLock would be a big benefit.
>>> > >
>>> >
>>> > That's a very good question - it's quite possible the WALWriteLock is
>>> > not really needed, because the processes are actually "writing" the WAL
>>> > directly to PMEM. So it's a bit confusing, because it's only really
>>> > concerned about making sure it's flushed.
>>> >
>>> > And yes, multiple processes certainly can write to PMEM at the same
>>> > time, in fact it's a requirement to get good throughput I believe. My
>>> > understanding is we need ~8 processes, at least that's what I heard
>>> from
>>> > people with more PMEM experience.
>>>
>>> Thanks, that's good to know.
>>>
>>> >
>>> > TBH I'm not convinced the code in the "simple-no-buffer" code (coming
>>> > from the 0002 patch) is actually correct. Essentially, consider the
>>> > backend needs to do a flush, but does not have a segment mapped. So it
>>> > maps it and calls pmem_drain() on it.
>>> >
>>> > But does that actually flush anything? Does it properly flush changes
>>> > done by other processes that may not have called pmem_drain() yet? I
>>> > find this somewhat suspicious and I'd bet all processes that did write
>>> > something have to call pmem_drain().
>>>
>>> Yeah, in terms of experiments at least it's good to find out that the
>>> approach mmapping each WAL segment is not good at performance.
>>>
>>> >
>>> >
>>> > >> So I did an experiment by increasing the size of the WAL segments. I
>>> > >> chose to try with 521MB and 1024MB, and the results with 1GB look
>>> like this:
>>> > >>
>>> > >>     branch                 1        16        32        64        96
>>> > >>     ----------------------------------------------------------------
>>> > >>     master              6635     88524    171106    163387    245307
>>> > >>     ntt                 7909    106826    217364    223338    242042
>>> > >>     simple-no-buffers   7871    101575    199403    188074    224716
>>> > >>     with-wal-buffers    7643    101056    206911    223860    261712
>>> > >>
>>> > >> So yeah, there's a clear difference. It changes the values for
>>> "master"
>>> > >> a bit, but both the "simple" patches (with and without) WAL buffers
>>> are
>>> > >> much faster. The with-wal-buffers is almost equal to the  NTT patch,
>>> > >> which was using 96GB file. I presume larger WAL segments would get
>>> even
>>> > >> closer, if we supported them.
>>> > >>
>>> > >> I'll continue investigating this, but my conclusion so far seem to
>>> be
>>> > >> that we can't really replace WAL buffers with PMEM - that seems to
>>> > >> perform much worse.
>>> > >>
>>> > >> The question is what to do about the segment size. Can we reduce the
>>> > >> overhead of mmap-ing individual segments, so that this works even
>>> for
>>> > >> smaller WAL segments, to make this useful for common instances (not
>>> > >> everyone wants to run with 1GB WAL). Or whether we need to adopt the
>>> > >> design with a large file, mapped just once.
>>> > >>
>>> > >> Another question is whether it's even worth the extra complexity. On
>>> > >> 16MB segments the difference between master and NTT patch seems to
>>> be
>>> > >> non-trivial, but increasing the WAL segment size kinda reduces
>>> that. So
>>> > >> maybe just using File I/O on PMEM DAX filesystem seems good enough.
>>> > >> Alternatively, maybe we could switch to libpmemblk, which should
>>> > >> eliminate the filesystem overhead at least.
>>> > >
>>> > > I think the performance improvement by NTT patch with the 16MB WAL
>>> > > segment, the most common WAL segment size, is very good (150437 vs.
>>> > > 212410 with 64 clients). But maybe evaluating writing WAL segment
>>> > > files on PMEM DAX filesystem is also worth, as you mentioned, if we
>>> > > don't do that yet.
>>> > >
>>> >
>>> > Well, not sure. I think the question is still open whether it's
>>> actually
>>> > safe to run on DAX, which does not have atomic writes of 512B sectors,
>>> > and I think we rely on that e.g. for pg_config. But maybe for WAL
>>> that's
>>> > not an issue.
>>>
>>> I think we can use the Block Translation Table (BTT) driver that
>>> provides atomic sector updates.
>>>
>>> >
>>> > > Also, I'm interested in why the through-put of NTT patch saturated at
>>> > > 32 clients, which is earlier than the master's one (96 clients). How
>>> > > many CPU cores are there on the machine you used?
>>> > >
>>> >
>>> >  From what I know, this is somewhat expected for PMEM devices, for a
>>> > bunch of reasons:
>>> >
>>> > 1) The memory bandwidth is much lower than for DRAM (maybe ~10-20%), so
>>> > it takes fewer processes to saturate it.
>>> >
>>> > 2) Internally, the PMEM has a 256B buffer for writes, used for
>>> combining
>>> > etc. With too many processes sending writes, it becomes to look more
>>> > random, which is harmful for throughput.
>>> >
>>> > When combined, this means the performance starts dropping at certain
>>> > number of threads, and the optimal number of threads is rather low
>>> > (something like 5-10). This is very different behavior compared to
>>> DRAM.
>>>
>>> Makes sense.
>>>
>>> >
>>> > There's a nice overview and measurements in this paper:
>>> >
>>> > Building blocks for persistent memory / How to get the most out of your
>>> > new memory?
>>> > Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann & Alfons
>>> > Kemper
>>> >
>>> > https://link.springer.com/article/10.1007/s00778-020-00622-9
>>>
>>> Thank you. I'll read it.
>>>
>>> >
>>> >
>>> > >> I'm also wondering if WAL is the right usage for PMEM. Per [2]
>>> there's a
>>> > >> huge read-write assymmetry (the writes being way slower), and their
>>> > >> recommendation (in "Observation 3" is)
>>> > >>
>>> > >>       The read-write asymmetry of PMem im-plies the necessity of
>>> avoiding
>>> > >>       writes as much as possible for PMem.
>>> > >>
>>> > >> So maybe we should not be trying to use PMEM for WAL, which is
>>> pretty
>>> > >> write-heavy (and in most cases even write-only).
>>> > >
>>> > > I think using PMEM for WAL is cost-effective but it leverages the
>>> only
>>> > > low-latency (sequential) write, but not other abilities such as
>>> > > fine-grained access and low-latency random write. If we want to
>>> > > exploit its all ability we might need some drastic changes to logging
>>> > > protocol while considering storing data on PMEM.
>>> > >
>>> >
>>> > True. I think investigating whether it's sensible to use PMEM for this
>>> > purpose. It may turn out that replacing the DRAM WAL buffers with
>>> writes
>>> > directly to PMEM is not economical, and aggregating data in a DRAM
>>> > buffer is better :-(
>>>
>>> Yes. I think it might be interesting to do an analysis of the
>>> bottlenecks of NTT patch by perf etc. If bottlenecks are moved to
>>> other places by removing WALWriteLock during flush, it's probably a
>>> good sign for further performance improvements. IIRC WALWriteLock is
>>> one of the main bottlenecks on OLTP workload, although my memory might
>>> already be out of date.
>>>
>>> Regards,
>>>
>>> --
>>> Masahiko Sawada
>>> EDB:  https://www.enterprisedb.com/
>>>
>>
>>
>> --
>> Takashi Menjo <[email protected]>
>>
>
>
> --
> Takashi Menjo <[email protected]>
>


-- 
Takashi Menjo <[email protected]>

Re: [PoC] Non-volatile WAL buffer

Reply via email to