On Mon, 7 Nov 2022 at 17:00, Jonathan Wakely wrote: > > On Thu, 6 Oct 2022 at 20:03, Charles-Francois Natali via Libstdc++ > <libstd...@gcc.gnu.org> wrote: > > > > `basic_filebuf::xsputn` would bypass the buffer when passed a chunk of > > size 1024 and above, seemingly as an optimisation. > > > > This can have a significant performance impact if the overhead of a > > `write` syscall is non-negligible, e.g. on a slow disk, on network > > filesystems, or simply during IO contention because instead of flushing > > every `BUFSIZ` (by default), we can flush every 1024 char. > > The impact is even greater with custom larger buffers, e.g. for network > > filesystems, because the code could issue `write` for example 1000X more > > often than necessary with respect to the buffer size. > > It also introduces a significant discontinuity in performance when > > writing chunks of size 1024 and above. > > > > See this reproducer which writes down a fixed number of chunks to a file > > open with `O_SYNC` - to replicate high-latency `write` - for varying > > size of chunks: > > > > ``` > > $ cat test_fstream_flush.cpp > > > > int > > main(int argc, char* argv[]) > > { > > assert(argc == 3); > > > > const auto* path = argv[1]; > > const auto chunk_size = std::stoul(argv[2]); > > > > const auto fd = > > open(path, O_CREAT | O_TRUNC | O_WRONLY | O_SYNC | O_CLOEXEC, 0666); > > assert(fd >= 0); > > > > auto filebuf = __gnu_cxx::stdio_filebuf<char>(fd, std::ios_base::out); > > auto stream = std::ostream(&filebuf); > > > > const auto chunk = std::vector<char>(chunk_size); > > > > for (auto i = 0; i < 1'000; ++i) { > > stream.write(chunk.data(), chunk.size()); > > } > > > > return 0; > > } > > ``` > > > > ``` > > $ g++ -o /tmp/test_fstream_flush test_fstream_flush.cpp -std=c++17 > > $ for i in $(seq 1021 1025); do echo -e "\n$i"; time > > /tmp/test_fstream_flush /tmp/foo $i; done > > > > 1021 > > > > real 0m0.997s > > user 0m0.000s > > sys 0m0.038s > > > > 1022 > > > > real 0m0.939s > > user 0m0.005s > > sys 0m0.032s > > > > 1023 > > > > real 0m0.954s > > user 0m0.005s > > sys 0m0.034s > > > > 1024 > > > > real 0m7.102s > > user 0m0.040s > > sys 0m0.192s > > > > 1025 > > > > real 0m7.204s > > user 0m0.025s > > sys 0m0.209s > > ``` > > > > See the huge drop in performance at the 1024-boundary. > > I've finally found time to properly look at this, sorry for the delay. > > I thought I was unable to reproduce these numbers, then I realised I'd > already installed a build with the patch, so was measuring the patched > performance for both my "before" and "after" tests. Oops! > > My concern is that the patch doesn't only affect files on remote > filesystems. I assume the original 1024-byte chunking behaviour is > there for a reason, because for large writes the performance might be > better if we just write directly instead of buffering and then writing > again. Assuming we have a fast disk, writing straight to disk avoids > copying in and out of the buffer. But if we have a slow disk, it's > better to buffer and reduce the total number of disk writes. I'm > concerned that the patch optimizes the slow disk case potentially at a > cost for the fast disk case. > > I wonder whether it would make sense to check whether the buffer size > has been manually changed, i.e. epptr() - pbase() != _M_buf_size. If > the buffer has been explicitly set by the user, then we should assume > they really want it to be used and so don't bypass it for writes >= > 1024. > > In the absence of a better idea, I think I'm going to commit the patch > as-is. I don't see it causing any measurable slowdown for very large > writes on fast disks, and it's certainly a huge improvement for slow > disks.
The patch has been pushed to trunk now, thanks for the contribution. I removed the testcase and results from the commit message as they don't need to be in the git log. I added a link to your email into bugzilla though, so we can still find it easily.