On Sat, Feb 13, 2021 at 12:18 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > On Thu, Jan 28, 2021 at 1:41 AM Tomas Vondra > <tomas.von...@enterprisedb.com> wrote: > > > > On 1/25/21 3:56 AM, Masahiko Sawada wrote: > > >> > > >> ... > > >> > > >> On 1/21/21 3:17 AM, Masahiko Sawada wrote: > > >>> ... > > >>> > > >>> While looking at the two methods: NTT and simple-no-buffer, I realized > > >>> that in XLogFlush(), NTT patch flushes (by pmem_flush() and > > >>> pmem_drain()) WAL without acquiring WALWriteLock whereas > > >>> simple-no-buffer patch acquires WALWriteLock to do that > > >>> (pmem_persist()). I wonder if this also affected the performance > > >>> differences between those two methods since WALWriteLock serializes > > >>> the operations. With PMEM, multiple backends can concurrently flush > > >>> the records if the memory region is not overlapped? If so, flushing > > >>> WAL without WALWriteLock would be a big benefit. > > >>> > > >> > > >> That's a very good question - it's quite possible the WALWriteLock is > > >> not really needed, because the processes are actually "writing" the WAL > > >> directly to PMEM. So it's a bit confusing, because it's only really > > >> concerned about making sure it's flushed. > > >> > > >> And yes, multiple processes certainly can write to PMEM at the same > > >> time, in fact it's a requirement to get good throughput I believe. My > > >> understanding is we need ~8 processes, at least that's what I heard from > > >> people with more PMEM experience. > > > > > > Thanks, that's good to know. > > > > > >> > > >> TBH I'm not convinced the code in the "simple-no-buffer" code (coming > > >> from the 0002 patch) is actually correct. Essentially, consider the > > >> backend needs to do a flush, but does not have a segment mapped. So it > > >> maps it and calls pmem_drain() on it. > > >> > > >> But does that actually flush anything? Does it properly flush changes > > >> done by other processes that may not have called pmem_drain() yet? I > > >> find this somewhat suspicious and I'd bet all processes that did write > > >> something have to call pmem_drain(). > > > > > For the record, from what I learned / been told by engineers with PMEM > > experience, calling pmem_drain() should properly flush changes done by > > other processes. So it should be sufficient to do that in XLogFlush(), > > from a single process. > > > > My understanding is that we have about three challenges here: > > > > (a) we still need to track how far we flushed, so this needs to be > > protected by some lock anyway (although perhaps a much smaller section > > of the function) > > > > (b) pmem_drain() flushes all the changes, so it flushes even "future" > > part of the WAL after the requested LSN, which may negatively affects > > performance I guess. So I wonder if pmem_persist would be a better fit, > > as it allows specifying a range that should be persisted. > > > > (c) As mentioned before, PMEM behaves differently with concurrent > > access, i.e. it reaches peak throughput with relatively low number of > > threads wroting data, and then the throughput drops quite quickly. I'm > > not sure if the same thing applies to pmem_drain() too - if it does, we > > may need something like we have for insertions, i.e. a handful of locks > > allowing limited number of concurrent inserts. > > Thanks. That's a good summary. > > > > > > > > Yeah, in terms of experiments at least it's good to find out that the > > > approach mmapping each WAL segment is not good at performance. > > > > > Right. The problem with small WAL segments seems to be that each mmap > > causes the TLB to be thrown away, which means a lot of expensive cache > > misses. As the mmap needs to be done by each backend writing WAL, this > > is particularly bad with small WAL segments. The NTT patch works around > > that by doing just a single mmap. > > > > I wonder if we could pre-allocate and mmap small segments, and keep them > > mapped and just rename the underlying files when recycling them. That'd > > keep the regular segment files, as expected by various tools, etc. The > > question is what would happen when we temporarily need more WAL, etc. > > > > >>> > > >>> ... > > >>> > > >>> I think the performance improvement by NTT patch with the 16MB WAL > > >>> segment, the most common WAL segment size, is very good (150437 vs. > > >>> 212410 with 64 clients). But maybe evaluating writing WAL segment > > >>> files on PMEM DAX filesystem is also worth, as you mentioned, if we > > >>> don't do that yet. > > >>> > > >> > > >> Well, not sure. I think the question is still open whether it's actually > > >> safe to run on DAX, which does not have atomic writes of 512B sectors, > > >> and I think we rely on that e.g. for pg_config. But maybe for WAL that's > > >> not an issue. > > > > > > I think we can use the Block Translation Table (BTT) driver that > > > provides atomic sector updates. > > > > > > > But we have benchmarked that, see my message from 2020/11/26, which > > shows this table: > > > > master/btt master/dax ntt simple > > ----------------------------------------------------------- > > 1 5469 7402 7977 6746 > > 16 48222 80869 107025 82343 > > 32 73974 158189 214718 158348 > > 64 85921 154540 225715 164248 > > 96 150602 221159 237008 217253 > > > > Clearly, BTT is quite expensive. Maybe there's a way to tune that at > > filesystem/kernel level, I haven't tried that. > > I missed your mail. Yeah, BTT seems to be quite expensive. > > > > > >> > > >>>> I'm also wondering if WAL is the right usage for PMEM. Per [2] there's > > >>>> a > > >>>> huge read-write assymmetry (the writes being way slower), and their > > >>>> recommendation (in "Observation 3" is) > > >>>> > > >>>> The read-write asymmetry of PMem im-plies the necessity of > > >>>> avoiding > > >>>> writes as much as possible for PMem. > > >>>> > > >>>> So maybe we should not be trying to use PMEM for WAL, which is pretty > > >>>> write-heavy (and in most cases even write-only). > > >>> > > >>> I think using PMEM for WAL is cost-effective but it leverages the only > > >>> low-latency (sequential) write, but not other abilities such as > > >>> fine-grained access and low-latency random write. If we want to > > >>> exploit its all ability we might need some drastic changes to logging > > >>> protocol while considering storing data on PMEM. > > >>> > > >> > > >> True. I think investigating whether it's sensible to use PMEM for this > > >> purpose. It may turn out that replacing the DRAM WAL buffers with writes > > >> directly to PMEM is not economical, and aggregating data in a DRAM > > >> buffer is better :-( > > > > > > Yes. I think it might be interesting to do an analysis of the > > > bottlenecks of NTT patch by perf etc. If bottlenecks are moved to > > > other places by removing WALWriteLock during flush, it's probably a > > > good sign for further performance improvements. IIRC WALWriteLock is > > > one of the main bottlenecks on OLTP workload, although my memory might > > > already be out of date. > > > > > > > I think WALWriteLock itself (i.e. acquiring/releasing it) is not an > > issue - the problem is that writing the WAL to persistent storage itself > > is expensive, and we're waiting to that. > > > > So it's not clear to me if removing the lock (and allowing multiple > > processes to do pmem_drain concurrently) can actually help, considering > > pmem_drain() should flush writes from other processes anyway. > > > > But as I said, that is just my theory - I might be entirely wrong, it'd > > be good to hack XLogFlush a bit and try it out. > > > > > > I've done some performance benchmarks with the master and NTT v4 > patch. Let me share the results. > > pgbench setup: > * scale factor = 2000 > * duration = 600 sec > * clients = 32, 64, 96 > > NVWAL setup: > * nvwal_size = 50GB > * max_wal_size = 50GB > * min_wal_size = 50GB > > The whole database fits in shared_buffers and WAL segment file size is 16MB. > > The results are: > > master NTT master-unlogged > 32 113209 67107 154298 > 64 144880 54289 178883 > 96 151405 50562 180018 > > "master-unlogged" is the same setup as "master" except for using > unlogged tables (using --unlogged-tables pgbench option). The TPS > increased by about 20% compared to "master" case (i.g., logged table > case). The reason why I experimented unlogged table case as well is > that we can think these results as an ideal performance if we were > able to write WAL records in 0 sec. IOW, even if the PMEM patch would > significantly improve WAL logging performance, I think it could not > exceed this performance. But hope is that if we currently have a > performance bottle-neck in WAL logging (.e.g, locking and writing > WAL), removing or minimizing WAL logging would bring a chance to > further improve performance by eliminating the new-coming bottle-neck. > > As we can see from the above result, apparently, the performance of > “ntt” case was not good in this evaluation. I've not reviewed the > patch in-depth yet but something might be wrong with the v4 patch or > PMEM configuration I did on my environment is wrong.
I've reconfigured PMEM and done the same benchmark. I got the following results (changed only "ntt" case): master NTT master-unlogged 32 113209 144829 154298 64 144880 164899 178883 96 151405 166096 180018 I got a much better performance with "ntt" patch. I think I think it was wrong that I created a partition on PMEM (i.g., created filesystem on /dev/pmem1p1) when the last evaluation. Sorry for confusing you, Menjo-san. FWIW here are the top 5 wait events on new "ntt" case: event_type | event | sum ------------+----------------------+------ Client | ClientRead | 8462 LWLock | WALInsert | 1049 LWLock | ProcArray | 627 IPC | ProcArrayGroupUpdate | 481 LWLock | XactSLRU | 247 Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/