Re: [PoC] Non-volatile WAL buffer

Masahiko Sawada Wed, 24 Feb 2021 19:29:01 -0800

On Sat, Feb 13, 2021 at 12:18 PM Masahiko Sawada <sawada.m...@gmail.com> wrote:
>
> On Thu, Jan 28, 2021 at 1:41 AM Tomas Vondra
> <tomas.von...@enterprisedb.com> wrote:
> >
> > On 1/25/21 3:56 AM, Masahiko Sawada wrote:
> > >>
> > >> ...
> > >>
> > >> On 1/21/21 3:17 AM, Masahiko Sawada wrote:
> > >>> ...
> > >>>
> > >>> While looking at the two methods: NTT and simple-no-buffer, I realized
> > >>> that in XLogFlush(), NTT patch flushes (by pmem_flush() and
> > >>> pmem_drain()) WAL without acquiring WALWriteLock whereas
> > >>> simple-no-buffer patch acquires WALWriteLock to do that
> > >>> (pmem_persist()). I wonder if this also affected the performance
> > >>> differences between those two methods since WALWriteLock serializes
> > >>> the operations. With PMEM, multiple backends can concurrently flush
> > >>> the records if the memory region is not overlapped? If so, flushing
> > >>> WAL without WALWriteLock would be a big benefit.
> > >>>
> > >>
> > >> That's a very good question - it's quite possible the WALWriteLock is
> > >> not really needed, because the processes are actually "writing" the WAL
> > >> directly to PMEM. So it's a bit confusing, because it's only really
> > >> concerned about making sure it's flushed.
> > >>
> > >> And yes, multiple processes certainly can write to PMEM at the same
> > >> time, in fact it's a requirement to get good throughput I believe. My
> > >> understanding is we need ~8 processes, at least that's what I heard from
> > >> people with more PMEM experience.
> > >
> > > Thanks, that's good to know.
> > >
> > >>
> > >> TBH I'm not convinced the code in the "simple-no-buffer" code (coming
> > >> from the 0002 patch) is actually correct. Essentially, consider the
> > >> backend needs to do a flush, but does not have a segment mapped. So it
> > >> maps it and calls pmem_drain() on it.
> > >>
> > >> But does that actually flush anything? Does it properly flush changes
> > >> done by other processes that may not have called pmem_drain() yet? I
> > >> find this somewhat suspicious and I'd bet all processes that did write
> > >> something have to call pmem_drain().
> > >
> > For the record, from what I learned / been told by engineers with PMEM
> > experience, calling pmem_drain() should properly flush changes done by
> > other processes. So it should be sufficient to do that in XLogFlush(),
> > from a single process.
> >
> > My understanding is that we have about three challenges here:
> >
> > (a) we still need to track how far we flushed, so this needs to be
> > protected by some lock anyway (although perhaps a much smaller section
> > of the function)
> >
> > (b) pmem_drain() flushes all the changes, so it flushes even "future"
> > part of the WAL after the requested LSN, which may negatively affects
> > performance I guess. So I wonder if pmem_persist would be a better fit,
> > as it allows specifying a range that should be persisted.
> >
> > (c) As mentioned before, PMEM behaves differently with concurrent
> > access, i.e. it reaches peak throughput with relatively low number of
> > threads wroting data, and then the throughput drops quite quickly. I'm
> > not sure if the same thing applies to pmem_drain() too - if it does, we
> > may need something like we have for insertions, i.e. a handful of locks
> > allowing limited number of concurrent inserts.
>
> Thanks. That's a good summary.
>
> >
> >
> > > Yeah, in terms of experiments at least it's good to find out that the
> > > approach mmapping each WAL segment is not good at performance.
> > >
> > Right. The problem with small WAL segments seems to be that each mmap
> > causes the TLB to be thrown away, which means a lot of expensive cache
> > misses. As the mmap needs to be done by each backend writing WAL, this
> > is particularly bad with small WAL segments. The NTT patch works around
> > that by doing just a single mmap.
> >
> > I wonder if we could pre-allocate and mmap small segments, and keep them
> > mapped and just rename the underlying files when recycling them. That'd
> > keep the regular segment files, as expected by various tools, etc. The
> > question is what would happen when we temporarily need more WAL, etc.
> >
> > >>>
> > >>> ...
> > >>>
> > >>> I think the performance improvement by NTT patch with the 16MB WAL
> > >>> segment, the most common WAL segment size, is very good (150437 vs.
> > >>> 212410 with 64 clients). But maybe evaluating writing WAL segment
> > >>> files on PMEM DAX filesystem is also worth, as you mentioned, if we
> > >>> don't do that yet.
> > >>>
> > >>
> > >> Well, not sure. I think the question is still open whether it's actually
> > >> safe to run on DAX, which does not have atomic writes of 512B sectors,
> > >> and I think we rely on that e.g. for pg_config. But maybe for WAL that's
> > >> not an issue.
> > >
> > > I think we can use the Block Translation Table (BTT) driver that
> > > provides atomic sector updates.
> > >
> >
> > But we have benchmarked that, see my message from 2020/11/26, which
> > shows this table:
> >
> >          master/btt    master/dax           ntt        simple
> >    -----------------------------------------------------------
> >      1         5469          7402          7977          6746
> >     16        48222         80869        107025         82343
> >     32        73974        158189        214718        158348
> >     64        85921        154540        225715        164248
> >     96       150602        221159        237008        217253
> >
> > Clearly, BTT is quite expensive. Maybe there's a way to tune that at
> > filesystem/kernel level, I haven't tried that.
>
> I missed your mail. Yeah, BTT seems to be quite expensive.
>
> >
> > >>
> > >>>> I'm also wondering if WAL is the right usage for PMEM. Per [2] there's 
> > >>>> a
> > >>>> huge read-write assymmetry (the writes being way slower), and their
> > >>>> recommendation (in "Observation 3" is)
> > >>>>
> > >>>>       The read-write asymmetry of PMem im-plies the necessity of 
> > >>>> avoiding
> > >>>>       writes as much as possible for PMem.
> > >>>>
> > >>>> So maybe we should not be trying to use PMEM for WAL, which is pretty
> > >>>> write-heavy (and in most cases even write-only).
> > >>>
> > >>> I think using PMEM for WAL is cost-effective but it leverages the only
> > >>> low-latency (sequential) write, but not other abilities such as
> > >>> fine-grained access and low-latency random write. If we want to
> > >>> exploit its all ability we might need some drastic changes to logging
> > >>> protocol while considering storing data on PMEM.
> > >>>
> > >>
> > >> True. I think investigating whether it's sensible to use PMEM for this
> > >> purpose. It may turn out that replacing the DRAM WAL buffers with writes
> > >> directly to PMEM is not economical, and aggregating data in a DRAM
> > >> buffer is better :-(
> > >
> > > Yes. I think it might be interesting to do an analysis of the
> > > bottlenecks of NTT patch by perf etc. If bottlenecks are moved to
> > > other places by removing WALWriteLock during flush, it's probably a
> > > good sign for further performance improvements. IIRC WALWriteLock is
> > > one of the main bottlenecks on OLTP workload, although my memory might
> > > already be out of date.
> > >
> >
> > I think WALWriteLock itself (i.e. acquiring/releasing it) is not an
> > issue - the problem is that writing the WAL to persistent storage itself
> > is expensive, and we're waiting to that.
> >
> > So it's not clear to me if removing the lock (and allowing multiple
> > processes to do pmem_drain concurrently) can actually help, considering
> > pmem_drain() should flush writes from other processes anyway.
> >
> > But as I said, that is just my theory - I might be entirely wrong, it'd
> > be good to hack XLogFlush a bit and try it out.
> >
> >
>
> I've done some performance benchmarks with the master and NTT v4
> patch. Let me share the results.
>
> pgbench setup:
> * scale factor = 2000
> * duration = 600 sec
> * clients = 32, 64, 96
>
> NVWAL setup:
> * nvwal_size = 50GB
> * max_wal_size = 50GB
> * min_wal_size = 50GB
>
> The whole database fits in shared_buffers and WAL segment file size is 16MB.
>
> The results are:
>
>         master  NTT     master-unlogged
> 32      113209  67107   154298
> 64      144880  54289   178883
> 96      151405  50562   180018
>
> "master-unlogged" is the same setup as "master" except for using
> unlogged tables (using --unlogged-tables pgbench option). The TPS
> increased by about 20% compared to "master" case (i.g., logged table
> case). The reason why I experimented unlogged table case as well is
> that we can think these results as an ideal performance if we were
> able to write WAL records in 0 sec. IOW, even if the PMEM patch would
> significantly improve WAL logging performance, I think it could not
> exceed this performance. But hope is that if we currently have a
> performance bottle-neck in WAL logging (.e.g, locking and writing
> WAL), removing or minimizing WAL logging would bring a chance to
> further improve performance by eliminating the new-coming bottle-neck.
>
> As we can see from the above result, apparently, the performance of
> “ntt” case was not good in this evaluation. I've not reviewed the
> patch in-depth yet but something might be wrong with the v4 patch or
> PMEM configuration I did on my environment is wrong.


I've reconfigured PMEM and done the same benchmark. I got the
following results (changed only "ntt" case):

          master    NTT        master-unlogged
32      113209  144829   154298
64      144880  164899   178883
96      151405  166096   180018

I got a much better performance with "ntt" patch. I think I think it
was wrong that I created a partition on PMEM (i.g., created filesystem
on /dev/pmem1p1) when the last evaluation. Sorry for confusing you,
Menjo-san.

FWIW here are the top 5 wait events on new "ntt" case:

 event_type |        event         | sum
------------+----------------------+------
 Client     | ClientRead           | 8462
 LWLock     | WALInsert            | 1049
 LWLock     | ProcArray            |  627
 IPC        | ProcArrayGroupUpdate |  481
 LWLock     | XactSLRU             |  247

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [PoC] Non-volatile WAL buffer

Reply via email to