yes, we recommend this as a precaution to get the best possible IO performance for all workloads and usage scenarios. 512e doesn't bring any advantage and in some cases can mean a performance disadvantage. By the way, 4kN and 512e cost exactly the same at our dealers.
Whether this really makes a difference in the individual case with virtual disks by the underlying physical disks, I can't say. -- Martin Verges Managing director Mobile: +49 174 9335695 E-Mail: martin.ver...@croit.io Chat: https://t.me/MartinVerges croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io YouTube: https://goo.gl/PGE1Bx Am Fr., 10. Mai 2019 um 10:54 Uhr schrieb Trent Lloyd < trent.ll...@canonical.com>: > Note that the issue I am talking about here is how a "Virtual" Ceph RBD > disk is presented to a virtual guest, and specifically for Windows guests > (Linux guests are not affected). I am not at all talking about how the > physical disks are presented to Ceph itself (although Martin was, he wasn't > clear whether changing these underlying physical disks to 4kn was for Ceph > or other environments). > > I would not expect that having your underlying physical disk presented to > Ceph itself as 512b/512e or 4kn to have a significant impact on performance > for the reason that Linux systems generally send 4k-aligned I/O anyway > (regardless of what the underlying disk is reporting for > physical_block_size). There may be some exceptions to that, such as > applications performing Direct I/O to the disk. If anyone knows otherwise, > it would be great to hear specific details. > > Regards, > Trent > > On Fri, May 10, 2019 at 4:40 PM Marc Roos <m.r...@f1-outsourcing.eu> > wrote: > >> >> Hmmm, so if I have (wd) drives that list this in smartctl output, I >> should try and reformat them to 4k, which will give me better >> performance? >> >> Sector Sizes: 512 bytes logical, 4096 bytes physical >> >> Do you have a link to this download? Can only find some .cz site with >> the rpms. >> >> >> -----Original Message----- >> From: Martin Verges [mailto:martin.ver...@croit.io] >> Sent: vrijdag 10 mei 2019 10:21 >> To: Trent Lloyd >> Cc: ceph-users >> Subject: Re: [ceph-users] Poor performance for 512b aligned "partial" >> writes from Windows guests in OpenStack + potential fix >> >> Hello Trent, >> >> many thanks for the insights. We always suggest to use 4kN over 512e >> HDDs to our users. >> >> As we recently found out, is that WD Support offers a tool called HUGO >> to reformat 512e to 4kN drives with "hugo format -m <model_number> -n >> max --fastformat -b 4096" in seconds. >> Maybe that helps someone that has bought the wrong disk. >> >> -- >> Martin Verges >> Managing director >> >> Mobile: +49 174 9335695 >> E-Mail: martin.ver...@croit.io >> Chat: https://t.me/MartinVerges >> >> croit GmbH, Freseniusstr. 31h, 81247 Munich >> CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht >> Munich HRB 231263 >> >> Web: https://croit.io >> YouTube: https://goo.gl/PGE1Bx >> >> >> >> Am Fr., 10. Mai 2019 um 10:00 Uhr schrieb Trent Lloyd >> <trent.ll...@canonical.com>: >> >> >> I recently was investigating a performance problem for a >> reasonably >> sized OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS >> HDD) with NVMe Journals. The primary workload is Windows guests backed >> by Cinder RBD volumes. >> This specific deployment is Ceph Jewel (FileStore + >> SimpleMessenger) which while it is EOL, the issue is reproducible on >> current versions and also on BlueStore however for different reasons >> than FileStore. >> >> >> Generally the Ceph cluster was suffering from very poor outlier >> performance, the numbers change a little bit depending on the exact >> situation but roughly 80% of I/O was happening in a "reasonable" time of >> 0-200ms but 5-20% of I/O operations were taking excessively long >> anywhere from 500ms through to 10-20+ seconds. However the normal >> metrics for commit and apply latency were normal, and in fact, this >> latency was hard to spot in the performance metrics available in jewel. >> >> Previously I more simply considered FileStore to have the >> "commit" >> (to journal) stage where it was written to the journal and it is OK to >> return to the client and then the "apply" (to disk) stage where it was >> flushed to disk and confirmed so that the data could be purged from the >> journal. However there is really a third stage in the middle where >> FileStore submits the I/O to the operating system and this is done >> before the lock on the object is released. Until that succeeds another >> operation cannot write to the same object (generally being a 4MB area of >> the disk). >> >> I found that the fstore_op threads would get stuck for hundreds >> of >> MS or more inside of pwritev() which was blocking inside of the kernel. >> Normally we expect pwritev() to be buffered I/O into the page cache and >> return quite fast however in this case the kernel was in a few percent >> of cases blocking with the stack trace included at the end of the e-mail >> [1]. My finding from that stack is that inside __block_write_begin_int >> we see a call to out_of_line_wait_on_bit call which is really an inlined >> call for wait_on_buffer which occurs in linux/fs/buffer.c in the section >> around line 2000-2024 with the comment "If we issued read requests - let >> them complete." >> (https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b255 >> 9f7f827c/fs/buffer.c#L2002 >> <https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002> >> ) >> >> My interpretation of that code is that for Linux to store a write >> in the page cache, it has to have the entire 4K page as that is the >> granularity of which it tracks the dirty state and it needs the entire >> 4K page to later submit back to the disk. Since we wrote a part of the >> page, and the page wasn't already in the cache, it has to fetch the >> remainder of the page from the disk. When this happens, it blocks >> waiting for this read to complete before returning from the pwritev() >> call - hence our normally buffered write blocks. This holds up the >> tp_fstore_op thread, of which there are (by default) only 2-4 such >> threads trying to process several hundred operations per second. >> Additionally the size of the osd_op_queue is bounded, and operations do >> not clear out of this queue until the tp_fstore_op thread is done. Which >> ultimately means that not only are these partial writes delayed but it >> knocks on to delay other writes behind them because of the constrained >> thread pools. >> >> What was further confusing to this, is that I could easily >> reproduce this in a test deployment using an rbd benchmark that was only >> writing to a total disk size of 256MB which I would easily have expected >> to fit in the page cache: >> >> rbd create -p rbd --size=256M bench2 >> rbd bench-write -p rbd bench2 --io-size 512 --io-threads 256 >> --io-total 256M --io-pattern rand >> >> This is explained by the fact that on secondary OSDs (at least, >> there was some refactoring of fadvise which I have not fully understood >> as of yet), FileStore is using fadvise FADVISE_DONTNEED on the objects >> after write which causes the kernel to immediately discard them from the >> page cache without any regard to their statistics of being >> recently/frequently used. The motivation for this addition appears to be >> that on a secondary OSD we don't service reads (only writes) and so >> therefor we can optimize memory usage by throwing away this object and >> in theory leaving more room in the page cache for objects which we are >> primary for and expect to actually service reads from a client for. >> Unfortunately this behavior does not take into account partial writes, >> where we now pathologically throw away the cached copy instantly such >> that a write even 1 second later will have to fetch the page from disk >> again. I also found that this FADVISE_DONTNEED is issue not only during >> filestore sync but also by the WBThrottle - which as this cluster was >> quite busy was constantly flushing writes leading to the cache being >> discarded almost instantly. >> >> Changing filestore_fadvise to False on this cluster lead to a >> significant performance increase as it could now cache the pages in >> memory in many cases. The number of reads from disk was reduced from >> around 40/second to 2/second, and the number of slow writes (>200ms) >> operations was reduced by 75%. >> >> I wrote a script to parse ceph-osd logs with debug_filestore=10 >> or >> 15 to report the time spent inside of write() as well as to count and >> report on the number of operations that are unaligned and also slow. >> It's a bit rough but you can find it here: >> https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb >> >> It does not solve the problem entirely, in that a filestore >> thread >> can still be blocked in such a case where it is not cached - but the >> pathological case of never having it in the cache is removed at least. >> Understanding this problem, I looked to the situation for BlueStore. >> BlueStore suffers from a similar issue in that the performance is quite >> poor due to both fadvise and also because it is check-summing the data >> in 4k blocks so needs to read the rest of the block in, despite not >> having the limitations of the Linux page cache to deal with. I have not >> yet further fully investigated BlueStore implementation other than to >> note the following doc talking about how such writes are handled and a >> possible future improvement to submit partial writes into the WAL before >> reading the rest of the block, which is apparently not done currently >> (and would be a great optimization): >> http://docs.ceph.com/docs/mimic/dev/bluestore/ >> >> >> >> Moving onto a full solution for this issue. We can tell Windows >> guests to send 4k-aligned I/O where possible by setting the >> physical_block_size hint on the disk. This support was added mainly for >> the incoming new series of hard drives which also have 4k blocks >> internally, and also need to do a similar 'read-modify-update' operation >> in the case where a smaller write is done. In this case Windows tries to >> align the I/O to 4k as much as possible, at the most basic level for >> example when a new file is created, it will pad out the write to the >> block to the nearest 4k. You can read more about support for that here: >> >> https://support.microsoft.com/en-au/help/2510009/microsoft-support- >> policy-for-4k-sector-hard-drives-in-windows >> <https://support.microsoft.com/en-au/help/2510009/microsoft-support-policy-for-4k-sector-hard-drives-in-windows> >> >> On a basic test, booting a Windows 2016 instance and then >> installing several months of Windows Updates the number of partial >> writes was reduced from 23% (753090 / 3229597) to 1.8% (54535 / 2880217) >> - many of which were during early boot and don't re-occur once the VM is >> running. >> >> I have submitted a patch to the OpenStack Cinder RBD driver to >> support setting this parameter. You can find that here: >> https://review.opendev.org/#/c/658283/ >> >> >> I did not have much luck finding information about any of this >> online when I searched, so this e-mail is serving largely to document my >> findings for others. But I am also looking for input from anyone as to >> anything I have missed, confirming my analysis as sound, review for my >> Cinder patch, etc. >> >> There is also likely scope to make this same patch to report a >> physical_block_size=4096 on other Ceph consumers such as the new(ish) >> iSCSI gateway, etc. >> >> Regards, >> Trent >> >> >> [1] fstore_op pwritev blocking stack trace - if anyone is >> interested in the perf data, flamegraph, etc - I'd be happy to share. >> >> tp_fstore_op >> >> ceph::buffer::list::write_fd >> pwritev64 >> entry_SYSCALL_64_after_hwframe >> do_syscall_64 >> sys_pwritev >> do_pwritev >> vfs_writev >> do_iter_write >> do_iter_readv_writev >> xfs_file_write_iter >> xfs_file_buffered_aio_write >> iomap_file_buffered_write >> iomap_apply >> iomap_write_actor >> iomap_write_begin.constprop.18 >> __block_write_begin_int >> out_of_line_wait_on_bit >> __wait_on_bit >> bit_wait_io >> io_schedule >> schedule >> __schedule >> finish_task_switch >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >>
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com