Hi, Luc: Happy New Year. Looking at BufferAllocExtend() in v1-0002-WIP-buffer-alloc-specialized-for-relation-extensi.patch. it seems there is duplicate code with the existing BufferAlloc().
It would be good if some refactoring is done by extracting common code into a helper function. Thanks On Fri, Jan 1, 2021 at 6:07 AM Luc Vlaming <l...@swarm64.com> wrote: > Hi, > > In an effort to speed up bulk data loading/transforming I noticed that > considerable time is spent in the relation extension lock. I know there > are already many other efforts to increase the chances of using bulk > loading [1], [2], [3], [4], efforts to make loading more parallel [5], > and removing some syscalls [6], as well as completely new I/O systems > [7] and some more extreme measures like disabling wal logging [8]. > > Whilst they will all help, they will ultimately be stopped by the > relation extension lock. Also it seems from the tests I've done so far > that at least for bulk loading using pwrite() actually can carry us > rather far as long as we are not doing it under a global lock. Moreover, > the solution provided here might be an alternative to [7] because the > results are quite promising, even with WAL enabled. > > Attached two WIP patches in the hopes of getting feedback. > The first changes the way we do bulk loading: each backend now gets a > standalone set of blocks allocated that are local to that backend. This > helps both with reducing contention but also with some OS writeback > mechanisms. > The second patch reduces the time spent on locking the partition buffers > by shifting around the logic to make each set of 128 blocks use the same > buffer partition, and then adding a custom function to get buffer blocks > specifically for extension, whilst keeping a previous partition lock, > thereby reducing the amount of time we spent on futexes. > > The design: > - add a set of buffers into the BulkInsertState that can be used by the > backend without any locks. > - add a ReadBufferExtendBulk function which extends the relation with > BULK_INSERT_BATCH_SIZE blocks at once. > - rework FileWrite to have a parameter to speed up relation extension by > passing in if we are using filewrite just to extend the file. if > supported uses ftruncate as this is much faster (also than > posix_fallocate on my system) and according to the manpages > (https://linux.die.net/man/2/ftruncate) should read as zeroed space. to > be cleaned-up later possibly into a special function FileExtend(). > - rework mdextend to get a page count. > - make a specialized version of BufferAlloc called BufferAllocExtend > which keeps around the lock on the last buffer partition and tries to > reuse this so that there are a lot less futex calls. > > Many things that are still to be done; some are: > - reintroduce FSM again, and possibly optimize the lock usage there. in > other words: this patch currently can only start the server and run COPY > FROM and read queries. > - look into the wal logging. whilst it seems to be fairly optimal > already wrt the locking and such i noticed there seems to be an > alternating pattern between the bgwriter and the workers. whilst setting > some parameters bigger helped a lot (wal_writer_flush_after, > wal_writer_delay, wal_buffers) > - work nicely with the code from [6] so that the register_dirty_segment > is indeed not needed anymore; or alternatively optimize that code so > that less locks are needed. > - make use of [9] in the fallback case in FileWrite() when > ftruncate/fallocate is not available so that the buffer size can be > reduced. > > First results are below; all tests were loading 32 times the same 1G > lineitem csv into the same table. tests were done both on a nvme and the > more parallel ones also with a tmpfs disk to see potential disk > bottlenecks and e.g. potential wrt using NVDIMM. > ================================= > using an unlogged table: > NVME, UNLOGGED table, 4 parallel streams: HEAD 171s, patched 113s > NVME, UNLOGGED table, 8 parallel streams: HEAD 113s, patched 67s > NVME, UNLOGGED table, 16 parallel streams: HEAD 112s, patched 42s > NVME, UNLOGGED table, 32 parallel streams: HEAD 121s, patched 36s > tmpfs, UNLOGGED table, 16 parallel streams: HEAD 96s, patched 38s > tmpfs, UNLOGGED table, 32 parallel streams: HEAD 104s, patched 25s > ================================= > using a normal table, wal-level=minimal, 16mb wal segments: > NVME, 4 parallel streams: HEAD 237s, patched 157s > NVME, 8 parallel streams: HEAD 200s, patched 142s > NVME, 16 parallel streams: HEAD 171s, patched 145s > NVME, 32 parallel streams: HEAD 199s, patched 146s > tmpfs, 16 parallel streams: HEAD 131s, patched 89s > tmpfs, 32 parallel streams: HEAD 148s, patched 98s > ================================= > using a normal table, wal-level=minimal, 256mb wal segments, > wal_init_zero = off, wal_buffers = 262143, wal_writer_delay = 10000ms, > wal_writer_flush_after = 512MB > > NVME, 4 parallel streams: HEAD 201s, patched 159s > NVME, 8 parallel streams: HEAD 157s, patched 109s > NVME, 16 parallel streams: HEAD 150s, patched 78s > NVME, 32 parallel streams: HEAD 158s, patched 70s > tmpfs, 16 parallel streams: HEAD 106s, patched 54s > tmpfs, 32 parallel streams: HEAD 113s, patched 44s > ================================= > > Thoughts? > > Cheers, > Luc > Swarm64 > > > [1] > > https://www.postgresql.org/message-id/flat/CAJcOf-f%3DUX1uKbPjDXf%2B8gJOoEPz9VCzh7pKnknfU4sG4LXj0A%40mail.gmail.com#49fe9f2ffcc9916cc5ed3a712aa5f28f > [2] > > https://www.postgresql.org/message-id/flat/CALj2ACWjymmyTvvhU20Er-LPLaBjzBQxMJwr4nzO7XWmOkxhsg%40mail.gmail.com#be34b5b7861876fc0fd7edb621c067fa > [3] > > https://www.postgresql.org/message-id/flat/CALj2ACXg-4hNKJC6nFnepRHYT4t5jJVstYvri%2BtKQHy7ydcr8A%40mail.gmail.com > [4] > > https://www.postgresql.org/message-id/flat/CALj2ACVi9eTRYR%3Dgdca5wxtj3Kk_9q9qVccxsS1hngTGOCjPwQ%40mail.gmail.com > [5] > > https://www.postgresql.org/message-id/flat/CALDaNm3GaZyYPpGu-PpF0SEkJg-eaW3TboHxpxJ-2criv2j_eA%40mail.gmail.com#07292ce654ef58fae7f257a4e36afc41 > [6] > > https://www.postgresql.org/message-id/flat/20200203132319.x7my43whtefeznz7%40alap3.anarazel.de#85a2a0ab915cdf079862d70505abe3db > [7] > > https://www.postgresql.org/message-id/flat/20201208040227.7rlzpfpdxoau4pvd%40alap3.anarazel.de#b8ea4a3b7f37e88ddfe121c4b3075e7b > [8] > > https://www.postgresql.org/message-id/flat/CAD21AoA9oK1VOoUuKW-jEO%3DY2nt5kCQKKFgeQwwRUMoh6BE-ow%40mail.gmail.com#0475248a5ff7aed735be41fd4034ae36 > [9] > > https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BHf_R_ih1pkBMTWn%3DSTyKMOM2Ks47Y_UqqfU1wRc1VvA%40mail.gmail.com#7a53ad72331e423ba3c6a50e6dc1259f >