Hello Trent, many thanks for the insights. We always suggest to use 4kN over 512e HDDs to our users.
As we recently found out, is that WD Support offers a tool called HUGO to reformat 512e to 4kN drives with "hugo format -m <model_number> -n max --fastformat -b 4096" in seconds. Maybe that helps someone that has bought the wrong disk. -- Martin Verges Managing director Mobile: +49 174 9335695 E-Mail: martin.ver...@croit.io Chat: https://t.me/MartinVerges croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io YouTube: https://goo.gl/PGE1Bx Am Fr., 10. Mai 2019 um 10:00 Uhr schrieb Trent Lloyd < trent.ll...@canonical.com>: > I recently was investigating a performance problem for a reasonably sized > OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS HDD) with > NVMe Journals. The primary workload is Windows guests backed by Cinder RBD > volumes. > This specific deployment is Ceph Jewel (FileStore + SimpleMessenger) which > while it is EOL, the issue is reproducible on current versions and also on > BlueStore however for different reasons than FileStore. > > Generally the Ceph cluster was suffering from very poor outlier > performance, the numbers change a little bit depending on the exact > situation but roughly 80% of I/O was happening in a "reasonable" time of > 0-200ms but 5-20% of I/O operations were taking excessively long anywhere > from 500ms through to 10-20+ seconds. However the normal metrics for commit > and apply latency were normal, and in fact, this latency was hard to spot > in the performance metrics available in jewel. > > Previously I more simply considered FileStore to have the "commit" (to > journal) stage where it was written to the journal and it is OK to return > to the client and then the "apply" (to disk) stage where it was flushed to > disk and confirmed so that the data could be purged from the journal. > However there is really a third stage in the middle where FileStore submits > the I/O to the operating system and this is done before the lock on the > object is released. Until that succeeds another operation cannot write to > the same object (generally being a 4MB area of the disk). > > I found that the fstore_op threads would get stuck for hundreds of MS or > more inside of pwritev() which was blocking inside of the kernel. Normally > we expect pwritev() to be buffered I/O into the page cache and return quite > fast however in this case the kernel was in a few percent of cases blocking > with the stack trace included at the end of the e-mail [1]. My finding from > that stack is that inside __block_write_begin_int we see a call to > out_of_line_wait_on_bit call which is really an inlined call for > wait_on_buffer which occurs in linux/fs/buffer.c in the section around line > 2000-2024 with the comment "If we issued read requests - let them > complete." ( > https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002 > ) > > My interpretation of that code is that for Linux to store a write in the > page cache, it has to have the entire 4K page as that is the granularity of > which it tracks the dirty state and it needs the entire 4K page to later > submit back to the disk. Since we wrote a part of the page, and the page > wasn't already in the cache, it has to fetch the remainder of the page from > the disk. When this happens, it blocks waiting for this read to complete > before returning from the pwritev() call - hence our normally buffered > write blocks. This holds up the tp_fstore_op thread, of which there are (by > default) only 2-4 such threads trying to process several hundred operations > per second. Additionally the size of the osd_op_queue is bounded, and > operations do not clear out of this queue until the tp_fstore_op thread is > done. Which ultimately means that not only are these partial writes delayed > but it knocks on to delay other writes behind them because of the > constrained thread pools. > > What was further confusing to this, is that I could easily reproduce this > in a test deployment using an rbd benchmark that was only writing to a > total disk size of 256MB which I would easily have expected to fit in the > page cache: > rbd create -p rbd --size=256M bench2 > rbd bench-write -p rbd bench2 --io-size 512 --io-threads 256 --io-total > 256M --io-pattern rand > > This is explained by the fact that on secondary OSDs (at least, there was > some refactoring of fadvise which I have not fully understood as of yet), > FileStore is using fadvise FADVISE_DONTNEED on the objects after write > which causes the kernel to immediately discard them from the page cache > without any regard to their statistics of being recently/frequently used. > The motivation for this addition appears to be that on a secondary OSD we > don't service reads (only writes) and so therefor we can optimize memory > usage by throwing away this object and in theory leaving more room in the > page cache for objects which we are primary for and expect to actually > service reads from a client for. Unfortunately this behavior does not take > into account partial writes, where we now pathologically throw away the > cached copy instantly such that a write even 1 second later will have to > fetch the page from disk again. I also found that this FADVISE_DONTNEED is > issue not only during filestore sync but also by the WBThrottle - which as > this cluster was quite busy was constantly flushing writes leading to the > cache being discarded almost instantly. > > Changing filestore_fadvise to False on this cluster lead to a significant > performance increase as it could now cache the pages in memory in many > cases. The number of reads from disk was reduced from around 40/second to > 2/second, and the number of slow writes (>200ms) operations was reduced by > 75%. > > I wrote a script to parse ceph-osd logs with debug_filestore=10 or 15 to > report the time spent inside of write() as well as to count and report on > the number of operations that are unaligned and also slow. It's a bit rough > but you can find it here: > https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb > > It does not solve the problem entirely, in that a filestore thread can > still be blocked in such a case where it is not cached - but the > pathological case of never having it in the cache is removed at least. > Understanding this problem, I looked to the situation for BlueStore. > BlueStore suffers from a similar issue in that the performance is quite > poor due to both fadvise and also because it is check-summing the data in > 4k blocks so needs to read the rest of the block in, despite not having the > limitations of the Linux page cache to deal with. I have not yet further > fully investigated BlueStore implementation other than to note the > following doc talking about how such writes are handled and a possible > future improvement to submit partial writes into the WAL before reading the > rest of the block, which is apparently not done currently (and would be a > great optimization): http://docs.ceph.com/docs/mimic/dev/bluestore/ > > > Moving onto a full solution for this issue. We can tell Windows guests to > send 4k-aligned I/O where possible by setting the physical_block_size hint > on the disk. This support was added mainly for the incoming new series of > hard drives which also have 4k blocks internally, and also need to do a > similar 'read-modify-update' operation in the case where a smaller write is > done. In this case Windows tries to align the I/O to 4k as much as > possible, at the most basic level for example when a new file is created, > it will pad out the write to the block to the nearest 4k. You can read more > about support for that here: > > https://support.microsoft.com/en-au/help/2510009/microsoft-support-policy-for-4k-sector-hard-drives-in-windows > > On a basic test, booting a Windows 2016 instance and then installing > several months of Windows Updates the number of partial writes was reduced > from 23% (753090 / 3229597) to 1.8% (54535 / 2880217) - many of which > were during early boot and don't re-occur once the VM is running. > > I have submitted a patch to the OpenStack Cinder RBD driver to support > setting this parameter. You can find that here: > https://review.opendev.org/#/c/658283/ > > I did not have much luck finding information about any of this online when > I searched, so this e-mail is serving largely to document my findings for > others. But I am also looking for input from anyone as to anything I have > missed, confirming my analysis as sound, review for my Cinder patch, etc. > > There is also likely scope to make this same patch to report a > physical_block_size=4096 on other Ceph consumers such as the new(ish) iSCSI > gateway, etc. > > Regards, > Trent > > > [1] fstore_op pwritev blocking stack trace - if anyone is interested in > the perf data, flamegraph, etc - I'd be happy to share. > > tp_fstore_op > ceph::buffer::list::write_fd > pwritev64 > entry_SYSCALL_64_after_hwframe > do_syscall_64 > sys_pwritev > do_pwritev > vfs_writev > do_iter_write > do_iter_readv_writev > xfs_file_write_iter > xfs_file_buffered_aio_write > iomap_file_buffered_write > iomap_apply > iomap_write_actor > iomap_write_begin.constprop.18 > __block_write_begin_int > out_of_line_wait_on_bit > __wait_on_bit > bit_wait_io > io_schedule > schedule > __schedule > finish_task_switch > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com