Hi Tomas, > Hello Takashi-san, > > On 3/5/21 9:08 AM, Takashi Menjo wrote: > > Hi Tomas, > > > > Thank you so much for your report. I have read it with great interest. > > > > Your conclusion sounds reasonable to me. My patchset you call "NTT / > > segments" got as good performance as "NTT / buffer" patchset. I have > > been worried that calling mmap/munmap for each WAL segment file could > > have a lot of overhead. Based on your performance tests, however, the > > overhead looks less than what I thought. In addition, "NTT / segments" > > patchset is more compatible to the current PG and more friendly to > > DBAs because that patchset uses WAL segment files and does not > > introduce any other new WAL-related file. > > > > I agree. I was actually a bit surprised it performs this well, mostly in > line with the "NTT / buffer" patchset. I've seen significant issues with > our simple experimental patches, which however went away with larger WAL > segments. But the "NTT / segments" patch does not have that issue, so > either our patches were doing something wrong, or perhaps there was some > other issue (not sure why larger WAL segments would improve that). > > Do these results match your benchmarks? Or are you seeing significantly > different behavior?
I made a performance test for "NTT / segments" and added its results to my previous report [1], on the same conditions. The updated graph is attached to this mail. Note that some legends are renamed: "Mapped WAL file" to "NTT / simple", and "Non-volatile WAL buffer" to "NTT / buffer." The graph tells me that "NTT / segments" performs as well as "NTT / buffer." This matches with the results you reported. > Do you have any thoughts regarding the impact of full-page writes? So > far all the benchmarks we did focused on small OLTP transactions on data > sets that fit into RAM. The assumption was that that's the workload that > would benefit from this, but maybe that's missing something important > about workloads producing much larger WAL records? Say, workloads > working with large BLOBs, bulk loads etc. I'd say that more work is needed for workloads producing a large amount of WAL (in the number of records or the size per record, or both of them). Based on the case Gang reported and I have tried to reproduce in this thread [2][3], the current inserting and flushing method can be unsuitable for such workloads. The case was for "NTT / buffer," but I think it can be also applied to "NTT / segments." > The other question is whether simply placing WAL on DAX (without any > code changes) is safe. If it's not, then all the "speedups" are computed > with respect to unsafe configuration and so are useless. And BTT should > be used instead, which would of course produce very different results. I think it's safe, thanks to the checksum in the header of WAL record (xl_crc in struct XLogRecord). In DAX mode, user data (WAL record here) is written to the PMEM device by a smaller unit (probably a byte or a cache line) than the traditional 512-byte disk sector. So a torn-write such that "some bytes in a sector persist, other bytes not" can occur when crash. AFAICS, however, the checksum for WAL records can also support such a torn-write case. > > I also think that supporting both file I/O and mmap is better than > > supporting only mmap. I will continue my work on "NTT / segments" > > patchset to support both ways. > > > > +1 > > > In the following, I will answer "Issues & Questions" you reported. > > > > > >> While testing the "NTT / segments" patch, I repeatedly managed to crash > >> the cluster with errors like this: > >> > >> 2021-02-28 00:07:21.221 PST client backend [3737139] WARNING: creating > >> logfile segment just before > >> mapping; path "pg_wal/00000001000000070000002F" > >> 2021-02-28 00:07:21.670 PST client backend [3737142] WARNING: creating > >> logfile segment just before > >> mapping; path "pg_wal/000000010000000700000030" > >> ... > >> 2021-02-28 00:07:21.698 PST client backend [3737145] WARNING: creating > >> logfile segment just before > >> mapping; path "pg_wal/000000010000000700000030" > >> 2021-02-28 00:07:21.698 PST client backend [3737130] PANIC: could not > >> open file > >> "pg_wal/000000010000000700000030": No such file or directory > >> > >> I do believe this is a thinko in the 0008 patch, which does XLogFileInit > >> in XLogFileMap. Notice there are multiple > >> "creating logfile" messages with the ..0030 segment, followed by the > >> failure. AFAICS the XLogFileMap may be > >> called from multiple backends, so they may call XLogFileInit concurrently, > >> likely triggering some sort of race > >> condition. It's fairly rare issue, though - I've only seen it twice from > >> ~20 runs. > > > > Thank you for your report. I found that rather the patch 0009 has an > > issue, and that will also cause WAL loss. I should have set > > use_existent to true, or InstallXlogFileSegment and BasicOpenFile in > > XLogFileInit can be racy. I have misunderstood that use_existent can > > be true because I am creating a brand-new file with XLogFileInit. > > > > I will fix the issue. > > > > OK, thanks for looking into this. > > > > >> The other question I have is about WALInsertLockUpdateInsertingAt. 0003 > >> removes this function, but leaves > >> behind some of the other bits working with insert locks and insertingAt. > >> But it does not explain how it works without > >> WaitXLogInsertionsToFinish() - how does it ensure that when we commit > >> something, all the preceding WAL is > >> "complete" (i.e. written by other backends etc.)? > > > > To wait for *all* the WALInsertLocks to be released, no matter each of > > them precedes or follows the current insertion. > > > > It would have worked functionally, but I rethink it is not good for > > performance because XLogFileMap in GetXLogBuffer (where > > WaitXLogInsertionsToFinish is removed) can block because it can > > eventually call write() in XLogFileInit. > > > > I will restore the WALInsertLockUpdateInsertingAt function and related > > code for mmap. > > > > OK. I'm still not entirely sure I understand if the current version is > correct, but I'll wait for the reworked version. > > > kind regards Best regards, Takashi [1] https://www.postgresql.org/message-id/caownp3ofofosftmeikqcbmp0ywdjn0kvb4ka_0tj+urq7dt...@mail.gmail.com [2] https://www.postgresql.org/message-id/byapr11mb344801ff81e9c92a081d3e10e6...@byapr11mb3448.namprd11.prod.outlook.com [3] https://www.postgresql.org/message-id/CAOwnP3NHAbVFOfAawZPs5ezn57_7fcX%3DKaaQ5YMgirc9rNrijQ%40mail.gmail.com -- Takashi Menjo <takashi.me...@gmail.com>