Hi, Now I have caught up with this thread. I see that many of you are interested in performance profiling.
I share my slides in SNIA SDC 2020 [1]. In the slides, I had profiles focused on XLogInsert and XLogFlush (mainly the latter) for my non-volatile WAL buffer patchset. I found that the time for XLogWrite and locking/unlocking WALWriteLock were eliminated by the patchset. Instead, XLogInsert and WaitXLogInsertionsToFinish took more (or a little more) time than ever because memcpy-ing to PMEM (Optane PMem) is slower than to DRAM. For details, please see the slides. Best regards, Takashi [1] https://www.snia.org/educational-library/how-can-persistent-memory-make-databases-faster-and-how-could-we-go-ahead-2020 2021年1月26日(火) 18:50 Takashi Menjo <takashi.me...@gmail.com>: > Dear everyone, Tomas, > > First of all, the "v4" patchset for non-volatile WAL buffer attached to > the previous mail is actually v5... Please read "v4" as "v5." > > Then, to Tomas: > Thank you for your crash report you gave on Nov 27, 2020, regarding msync > patchset. I applied the latest msync patchset v3 attached to the previous > to master 411ae64 (on Jan18, 2021) then tested it, and I got no error when > pgbench -i -s 500. Please try it if necessary. > > Best regards, > Takashi > > > 2021年1月26日(火) 17:52 Takashi Menjo <takashi.me...@gmail.com>: > >> Dear everyone, >> >> Sorry but I forgot to attach my patchsets... Please see the files >> attached to this mail. Please also note that they contain some fixes. >> >> Best regards, >> Takashi >> >> >> 2021年1月26日(火) 17:46 Takashi Menjo <takashi.me...@gmail.com>: >> >>> Dear everyone, >>> >>> I'm sorry for the late reply. I rebase my two patchsets onto the latest >>> master 411ae64.The one patchset prefixed with v4 is for non-volatile WAL >>> buffer; the other prefixed with v3 is for msync. >>> >>> I will reply to your thankful feedbacks one by one within days. Please >>> wait for a moment. >>> >>> Best regards, >>> Takashi >>> >>> >>> 01/25/2021(Mon) 11:56 Masahiko Sawada <sawada.m...@gmail.com>: >>> >>>> On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra >>>> <tomas.von...@enterprisedb.com> wrote: >>>> > >>>> > >>>> > >>>> > On 1/21/21 3:17 AM, Masahiko Sawada wrote: >>>> > > On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra >>>> > > <tomas.von...@enterprisedb.com> wrote: >>>> > >> >>>> > >> Hi, >>>> > >> >>>> > >> I think I've managed to get the 0002 patch [1] rebased to master >>>> and >>>> > >> working (with help from Masahiko Sawada). It's not clear to me how >>>> it >>>> > >> could have worked as submitted - my theory is that an incomplete >>>> patch >>>> > >> was submitted by mistake, or something like that. >>>> > >> >>>> > >> Unfortunately, the benchmark results were kinda disappointing. For >>>> a >>>> > >> pgbench on scale 500 (fits into shared buffers), an average of >>>> three >>>> > >> 5-minute runs looks like this: >>>> > >> >>>> > >> branch 1 16 32 64 >>>> 96 >>>> > >> >>>> ---------------------------------------------------------------- >>>> > >> master 7291 87704 165310 150437 >>>> 224186 >>>> > >> ntt 7912 106095 213206 212410 >>>> 237819 >>>> > >> simple-no-buffers 7654 96544 115416 95828 >>>> 103065 >>>> > >> >>>> > >> NTT refers to the patch from September 10, pre-allocating a large >>>> WAL >>>> > >> file on PMEM, and simple-no-buffers is the simpler patch simply >>>> removing >>>> > >> the WAL buffers and writing directly to a mmap-ed WAL segment on >>>> PMEM. >>>> > >> >>>> > >> Note: The patch is just replacing the old implementation with mmap. >>>> > >> That's good enough for experiments like this, but we probably want >>>> to >>>> > >> keep the old one for setups without PMEM. But it's good enough for >>>> > >> testing, benchmarking etc. >>>> > >> >>>> > >> Unfortunately, the results for this simple approach are pretty >>>> bad. Not >>>> > >> only compared to the "ntt" patch, but even to master. I'm not >>>> entirely >>>> > >> sure what's the root cause, but I have a couple hypotheses: >>>> > >> >>>> > >> 1) bug in the patch - That's clearly a possibility, although I've >>>> tried >>>> > >> tried to eliminate this possibility. >>>> > >> >>>> > >> 2) PMEM is slower than DRAM - From what I know, PMEM is much >>>> faster than >>>> > >> NVMe storage, but still much slower than DRAM (both in terms of >>>> latency >>>> > >> and bandwidth, see [2] for some data). It's not terrible, but the >>>> > >> latency is maybe 2-3x higher - not a huge difference, but may >>>> matter for >>>> > >> WAL buffers? >>>> > >> >>>> > >> 3) PMEM does not handle parallel writes well - If you look at [2], >>>> > >> Figure 4(b), you'll see that the throughput actually *drops" as the >>>> > >> number of threads increase. That's pretty strange / annoying, >>>> because >>>> > >> that's how we write into WAL buffers - each thread writes it's own >>>> data, >>>> > >> so parallelism is not something we can get rid of. >>>> > >> >>>> > >> I've added some simple profiling, to measure number of calls / >>>> time for >>>> > >> each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates >>>> data >>>> > >> for each backend, and logs the counts every 1M ops. >>>> > >> >>>> > >> Typical stats from a concurrent run looks like this: >>>> > >> >>>> > >> xlog stats cnt 43000000 >>>> > >> map cnt 100 time 5448333 unmap cnt 100 time 3730963 >>>> > >> memcpy cnt 985964 time 1550442272 len 15150499 >>>> > >> memset cnt 0 time 0 len 0 >>>> > >> persist cnt 13836 time 10369617 len 16292182 >>>> > >> >>>> > >> The times are in nanoseconds, so this says the backend did 100 >>>> mmap and >>>> > >> unmap calls, taking ~10ms in total. There were ~14k pmem_persist >>>> calls, >>>> > >> taking 10ms in total. And the most time (~1.5s) was used by >>>> pmem_memcpy >>>> > >> copying about 15MB of data. That's quite a lot :-( >>>> > > >>>> > > It might also be interesting if we can see how much time spent on >>>> each >>>> > > logging function, such as XLogInsert(), XLogWrite(), and >>>> XLogFlush(). >>>> > > >>>> > >>>> > Yeah, we could extend it to that, that's fairly mechanical thing. Bbut >>>> > maybe that could be visible in a regular perf profile. Also, I suppose >>>> > most of the time will be used by the pmem calls, shown in the stats. >>>> > >>>> > >> >>>> > >> My conclusion from this is that eliminating WAL buffers and >>>> writing WAL >>>> > >> directly to PMEM (by memcpy to mmap-ed WAL segments) is probably >>>> not the >>>> > >> right approach. >>>> > >> >>>> > >> I suppose we should keep WAL buffers, and then just write the data >>>> to >>>> > >> mmap-ed WAL segments on PMEM. Which I think is what the NTT patch >>>> does, >>>> > >> except that it allocates one huge file on PMEM and writes to that >>>> > >> (instead of the traditional WAL segments). >>>> > >> >>>> > >> So I decided to try how it'd work with writing to regular WAL >>>> segments, >>>> > >> mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does >>>> that, >>>> > >> and the results look a bit nicer: >>>> > >> >>>> > >> branch 1 16 32 64 >>>> 96 >>>> > >> >>>> ---------------------------------------------------------------- >>>> > >> master 7291 87704 165310 150437 >>>> 224186 >>>> > >> ntt 7912 106095 213206 212410 >>>> 237819 >>>> > >> simple-no-buffers 7654 96544 115416 95828 >>>> 103065 >>>> > >> with-wal-buffers 7477 95454 181702 140167 >>>> 214715 >>>> > >> >>>> > >> So, much better than the version without WAL buffers, somewhat >>>> better >>>> > >> than master (except for 64/96 clients), but still not as good as >>>> NTT. >>>> > >> >>>> > >> At this point I was wondering how could the NTT patch be faster >>>> when >>>> > >> it's doing roughly the same thing. I'm sire there are some >>>> differences, >>>> > >> but it seemed strange. The main difference seems to be that it >>>> only maps >>>> > >> one large file, and only once. OTOH the alternative "simple" patch >>>> maps >>>> > >> segments one by one, in each backend. Per the debug stats the >>>> map/unmap >>>> > >> calls are fairly cheap, but maybe it interferes with the memcpy >>>> somehow. >>>> > >> >>>> > > >>>> > > While looking at the two methods: NTT and simple-no-buffer, I >>>> realized >>>> > > that in XLogFlush(), NTT patch flushes (by pmem_flush() and >>>> > > pmem_drain()) WAL without acquiring WALWriteLock whereas >>>> > > simple-no-buffer patch acquires WALWriteLock to do that >>>> > > (pmem_persist()). I wonder if this also affected the performance >>>> > > differences between those two methods since WALWriteLock serializes >>>> > > the operations. With PMEM, multiple backends can concurrently flush >>>> > > the records if the memory region is not overlapped? If so, flushing >>>> > > WAL without WALWriteLock would be a big benefit. >>>> > > >>>> > >>>> > That's a very good question - it's quite possible the WALWriteLock is >>>> > not really needed, because the processes are actually "writing" the >>>> WAL >>>> > directly to PMEM. So it's a bit confusing, because it's only really >>>> > concerned about making sure it's flushed. >>>> > >>>> > And yes, multiple processes certainly can write to PMEM at the same >>>> > time, in fact it's a requirement to get good throughput I believe. My >>>> > understanding is we need ~8 processes, at least that's what I heard >>>> from >>>> > people with more PMEM experience. >>>> >>>> Thanks, that's good to know. >>>> >>>> > >>>> > TBH I'm not convinced the code in the "simple-no-buffer" code (coming >>>> > from the 0002 patch) is actually correct. Essentially, consider the >>>> > backend needs to do a flush, but does not have a segment mapped. So it >>>> > maps it and calls pmem_drain() on it. >>>> > >>>> > But does that actually flush anything? Does it properly flush changes >>>> > done by other processes that may not have called pmem_drain() yet? I >>>> > find this somewhat suspicious and I'd bet all processes that did write >>>> > something have to call pmem_drain(). >>>> >>>> Yeah, in terms of experiments at least it's good to find out that the >>>> approach mmapping each WAL segment is not good at performance. >>>> >>>> > >>>> > >>>> > >> So I did an experiment by increasing the size of the WAL segments. >>>> I >>>> > >> chose to try with 521MB and 1024MB, and the results with 1GB look >>>> like this: >>>> > >> >>>> > >> branch 1 16 32 64 >>>> 96 >>>> > >> >>>> ---------------------------------------------------------------- >>>> > >> master 6635 88524 171106 163387 >>>> 245307 >>>> > >> ntt 7909 106826 217364 223338 >>>> 242042 >>>> > >> simple-no-buffers 7871 101575 199403 188074 >>>> 224716 >>>> > >> with-wal-buffers 7643 101056 206911 223860 >>>> 261712 >>>> > >> >>>> > >> So yeah, there's a clear difference. It changes the values for >>>> "master" >>>> > >> a bit, but both the "simple" patches (with and without) WAL >>>> buffers are >>>> > >> much faster. The with-wal-buffers is almost equal to the NTT >>>> patch, >>>> > >> which was using 96GB file. I presume larger WAL segments would get >>>> even >>>> > >> closer, if we supported them. >>>> > >> >>>> > >> I'll continue investigating this, but my conclusion so far seem to >>>> be >>>> > >> that we can't really replace WAL buffers with PMEM - that seems to >>>> > >> perform much worse. >>>> > >> >>>> > >> The question is what to do about the segment size. Can we reduce >>>> the >>>> > >> overhead of mmap-ing individual segments, so that this works even >>>> for >>>> > >> smaller WAL segments, to make this useful for common instances (not >>>> > >> everyone wants to run with 1GB WAL). Or whether we need to adopt >>>> the >>>> > >> design with a large file, mapped just once. >>>> > >> >>>> > >> Another question is whether it's even worth the extra complexity. >>>> On >>>> > >> 16MB segments the difference between master and NTT patch seems to >>>> be >>>> > >> non-trivial, but increasing the WAL segment size kinda reduces >>>> that. So >>>> > >> maybe just using File I/O on PMEM DAX filesystem seems good enough. >>>> > >> Alternatively, maybe we could switch to libpmemblk, which should >>>> > >> eliminate the filesystem overhead at least. >>>> > > >>>> > > I think the performance improvement by NTT patch with the 16MB WAL >>>> > > segment, the most common WAL segment size, is very good (150437 vs. >>>> > > 212410 with 64 clients). But maybe evaluating writing WAL segment >>>> > > files on PMEM DAX filesystem is also worth, as you mentioned, if we >>>> > > don't do that yet. >>>> > > >>>> > >>>> > Well, not sure. I think the question is still open whether it's >>>> actually >>>> > safe to run on DAX, which does not have atomic writes of 512B sectors, >>>> > and I think we rely on that e.g. for pg_config. But maybe for WAL >>>> that's >>>> > not an issue. >>>> >>>> I think we can use the Block Translation Table (BTT) driver that >>>> provides atomic sector updates. >>>> >>>> > >>>> > > Also, I'm interested in why the through-put of NTT patch saturated >>>> at >>>> > > 32 clients, which is earlier than the master's one (96 clients). How >>>> > > many CPU cores are there on the machine you used? >>>> > > >>>> > >>>> > From what I know, this is somewhat expected for PMEM devices, for a >>>> > bunch of reasons: >>>> > >>>> > 1) The memory bandwidth is much lower than for DRAM (maybe ~10-20%), >>>> so >>>> > it takes fewer processes to saturate it. >>>> > >>>> > 2) Internally, the PMEM has a 256B buffer for writes, used for >>>> combining >>>> > etc. With too many processes sending writes, it becomes to look more >>>> > random, which is harmful for throughput. >>>> > >>>> > When combined, this means the performance starts dropping at certain >>>> > number of threads, and the optimal number of threads is rather low >>>> > (something like 5-10). This is very different behavior compared to >>>> DRAM. >>>> >>>> Makes sense. >>>> >>>> > >>>> > There's a nice overview and measurements in this paper: >>>> > >>>> > Building blocks for persistent memory / How to get the most out of >>>> your >>>> > new memory? >>>> > Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann & Alfons >>>> > Kemper >>>> > >>>> > https://link.springer.com/article/10.1007/s00778-020-00622-9 >>>> >>>> Thank you. I'll read it. >>>> >>>> > >>>> > >>>> > >> I'm also wondering if WAL is the right usage for PMEM. Per [2] >>>> there's a >>>> > >> huge read-write assymmetry (the writes being way slower), and their >>>> > >> recommendation (in "Observation 3" is) >>>> > >> >>>> > >> The read-write asymmetry of PMem im-plies the necessity of >>>> avoiding >>>> > >> writes as much as possible for PMem. >>>> > >> >>>> > >> So maybe we should not be trying to use PMEM for WAL, which is >>>> pretty >>>> > >> write-heavy (and in most cases even write-only). >>>> > > >>>> > > I think using PMEM for WAL is cost-effective but it leverages the >>>> only >>>> > > low-latency (sequential) write, but not other abilities such as >>>> > > fine-grained access and low-latency random write. If we want to >>>> > > exploit its all ability we might need some drastic changes to >>>> logging >>>> > > protocol while considering storing data on PMEM. >>>> > > >>>> > >>>> > True. I think investigating whether it's sensible to use PMEM for this >>>> > purpose. It may turn out that replacing the DRAM WAL buffers with >>>> writes >>>> > directly to PMEM is not economical, and aggregating data in a DRAM >>>> > buffer is better :-( >>>> >>>> Yes. I think it might be interesting to do an analysis of the >>>> bottlenecks of NTT patch by perf etc. If bottlenecks are moved to >>>> other places by removing WALWriteLock during flush, it's probably a >>>> good sign for further performance improvements. IIRC WALWriteLock is >>>> one of the main bottlenecks on OLTP workload, although my memory might >>>> already be out of date. >>>> >>>> Regards, >>>> >>>> -- >>>> Masahiko Sawada >>>> EDB: https://www.enterprisedb.com/ >>>> >>> >>> >>> -- >>> Takashi Menjo <takashi.me...@gmail.com> >>> >> >> >> -- >> Takashi Menjo <takashi.me...@gmail.com> >> > > > -- > Takashi Menjo <takashi.me...@gmail.com> > -- Takashi Menjo <takashi.me...@gmail.com>