Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

Martin Verges Fri, 10 May 2019 02:26:31 -0700

yes, we recommend this as a precaution to get the best possible IO
performance for all workloads and usage scenarios. 512e doesn't bring any
advantage and in some cases can mean a performance disadvantage. By the
way, 4kN and 512e cost exactly the same at our dealers.


Whether this really makes a difference in the individual case with virtual
disks by the underlying physical disks, I can't say.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Fr., 10. Mai 2019 um 10:54 Uhr schrieb Trent Lloyd <
trent.ll...@canonical.com>:

> Note that the issue I am talking about here is how a "Virtual" Ceph RBD
> disk is presented to a virtual guest, and specifically for Windows guests
> (Linux guests are not affected). I am not at all talking about how the
> physical disks are presented to Ceph itself (although Martin was, he wasn't
> clear whether changing these underlying physical disks to 4kn was for Ceph
> or other environments).
>
> I would not expect that having your underlying physical disk presented to
> Ceph itself as 512b/512e or 4kn to have a significant impact on performance
> for the reason that Linux systems generally send 4k-aligned I/O anyway
> (regardless of what the underlying disk is reporting for
> physical_block_size). There may be some exceptions to that, such as
> applications performing Direct I/O to the disk. If anyone knows otherwise,
> it would be great to hear specific details.
>
> Regards,
> Trent
>
> On Fri, May 10, 2019 at 4:40 PM Marc Roos <m.r...@f1-outsourcing.eu>
> wrote:
>
>>
>> Hmmm, so if I have (wd) drives that list this in smartctl output, I
>> should try and reformat them to 4k, which will give me better
>> performance?
>>
>> Sector Sizes:     512 bytes logical, 4096 bytes physical
>>
>> Do you have a link to this download? Can only find some .cz site with
>> the rpms.
>>
>>
>> -----Original Message-----
>> From: Martin Verges [mailto:martin.ver...@croit.io]
>> Sent: vrijdag 10 mei 2019 10:21
>> To: Trent Lloyd
>> Cc: ceph-users
>> Subject: Re: [ceph-users] Poor performance for 512b aligned "partial"
>> writes from Windows guests in OpenStack + potential fix
>>
>> Hello Trent,
>>
>> many thanks for the insights. We always suggest to use 4kN over 512e
>> HDDs to our users.
>>
>> As we recently found out, is that WD Support offers a tool called HUGO
>> to reformat 512e to 4kN drives with "hugo format -m <model_number> -n
>> max --fastformat -b 4096" in seconds.
>> Maybe that helps someone that has bought the wrong disk.
>>
>> --
>> Martin Verges
>> Managing director
>>
>> Mobile: +49 174 9335695
>> E-Mail: martin.ver...@croit.io
>> Chat: https://t.me/MartinVerges
>>
>> croit GmbH, Freseniusstr. 31h, 81247 Munich
>> CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht
>> Munich HRB 231263
>>
>> Web: https://croit.io
>> YouTube: https://goo.gl/PGE1Bx
>>
>>
>>
>> Am Fr., 10. Mai 2019 um 10:00 Uhr schrieb Trent Lloyd
>> <trent.ll...@canonical.com>:
>>
>>
>>         I recently was investigating a performance problem for a
>> reasonably
>> sized OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS
>> HDD) with NVMe Journals. The primary workload is Windows guests backed
>> by Cinder RBD volumes.
>>         This specific deployment is Ceph Jewel (FileStore +
>> SimpleMessenger) which while it is EOL, the issue is reproducible on
>> current versions and also on BlueStore however for different reasons
>> than FileStore.
>>
>>
>>         Generally the Ceph cluster was suffering from very poor outlier
>> performance, the numbers change a little bit depending on the exact
>> situation but roughly 80% of I/O was happening in a "reasonable" time of
>> 0-200ms but 5-20% of I/O operations were taking excessively long
>> anywhere from 500ms through to 10-20+ seconds. However the normal
>> metrics for commit and apply latency were normal, and in fact, this
>> latency was hard to spot in the performance metrics available in jewel.
>>
>>         Previously I more simply considered FileStore to have the
>> "commit"
>> (to journal) stage where it was written to the journal and it is OK to
>> return to the client and then the "apply" (to disk) stage where it was
>> flushed to disk and confirmed so that the data could be purged from the
>> journal. However there is really a third stage in the middle where
>> FileStore submits the I/O to the operating system and this is done
>> before the lock on the object is released. Until that succeeds another
>> operation cannot write to the same object (generally being a 4MB area of
>> the disk).
>>
>>         I found that the fstore_op threads would get stuck for hundreds
>> of
>> MS or more inside of pwritev() which was blocking inside of the kernel.
>> Normally we expect pwritev() to be buffered I/O into the page cache and
>> return quite fast however in this case the kernel was in a few percent
>> of cases blocking with the stack trace included at the end of the e-mail
>> [1]. My finding from that stack is that inside __block_write_begin_int
>> we see a call to out_of_line_wait_on_bit call which is really an inlined
>> call for wait_on_buffer which occurs in linux/fs/buffer.c in the section
>> around line 2000-2024 with the comment "If we issued read requests - let
>> them complete."
>> (https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b255
>> 9f7f827c/fs/buffer.c#L2002
>> <https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002>
>> )
>>
>>         My interpretation of that code is that for Linux to store a write
>> in the page cache, it has to have the entire 4K page as that is the
>> granularity of which it tracks the dirty state and it needs the entire
>> 4K page to later submit back to the disk. Since we wrote a part of the
>> page, and the page wasn't already in the cache, it has to fetch the
>> remainder of the page from the disk. When this happens, it blocks
>> waiting for this read to complete before returning from the pwritev()
>> call - hence our normally buffered write blocks. This holds up the
>> tp_fstore_op thread, of which there are (by default) only 2-4 such
>> threads trying to process several hundred operations per second.
>> Additionally the size of the osd_op_queue is bounded, and operations do
>> not clear out of this queue until the tp_fstore_op thread is done. Which
>> ultimately means that not only are these partial writes delayed but it
>> knocks on to delay other writes behind them because of the constrained
>> thread pools.
>>
>>         What was further confusing to this, is that I could easily
>> reproduce this in a test deployment using an rbd benchmark that was only
>> writing to a total disk size of 256MB which I would easily have expected
>> to fit in the page cache:
>>
>>         rbd create -p rbd --size=256M bench2
>>         rbd bench-write -p rbd bench2 --io-size 512 --io-threads 256
>> --io-total 256M --io-pattern rand
>>
>>         This is explained by the fact that on secondary OSDs (at least,
>> there was some refactoring of fadvise which I have not fully understood
>> as of yet), FileStore is using fadvise FADVISE_DONTNEED on the objects
>> after write which causes the kernel to immediately discard them from the
>> page cache without any regard to their statistics of being
>> recently/frequently used. The motivation for this addition appears to be
>> that on a secondary OSD we don't service reads (only writes) and so
>> therefor we can optimize memory usage by throwing away this object and
>> in theory leaving more room in the page cache for objects which we are
>> primary for and expect to actually service reads from a client for.
>> Unfortunately this behavior does not take into account partial writes,
>> where we now pathologically throw away the cached copy instantly such
>> that a write even 1 second later will have to fetch the page from disk
>> again. I also found that this FADVISE_DONTNEED is issue not only during
>> filestore sync but also by the WBThrottle - which as this cluster was
>> quite busy was constantly flushing writes leading to the cache being
>> discarded almost instantly.
>>
>>         Changing filestore_fadvise to False on this cluster lead to a
>> significant performance increase as it could now cache the pages in
>> memory in many cases. The number of reads from disk was reduced from
>> around 40/second to 2/second, and the number of slow writes (>200ms)
>> operations was reduced by 75%.
>>
>>         I wrote a script to parse ceph-osd logs with debug_filestore=10
>> or
>> 15 to report the time spent inside of write() as well as to count and
>> report on the number of operations that are unaligned and also slow.
>> It's a bit rough but you can find it here:
>> https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb
>>
>>         It does not solve the problem entirely, in that a filestore
>> thread
>> can still be blocked in such a case where it is not cached - but the
>> pathological case of never having it in the cache is removed at least.
>> Understanding this problem, I looked to the situation for BlueStore.
>> BlueStore suffers from a similar issue in that the performance is quite
>> poor due to both fadvise and also because it is check-summing the data
>> in 4k blocks so needs to read the rest of the block in, despite not
>> having the limitations of the Linux page cache to deal with. I have not
>> yet further fully investigated BlueStore implementation other than to
>> note the following doc talking about how such writes are handled and a
>> possible future improvement to submit partial writes into the WAL before
>> reading the rest of the block, which is apparently not done currently
>> (and would be a great optimization):
>> http://docs.ceph.com/docs/mimic/dev/bluestore/
>>
>>
>>
>>         Moving onto a full solution for this issue. We can tell Windows
>> guests to send 4k-aligned I/O where possible by setting the
>> physical_block_size hint on the disk. This support was added mainly for
>> the incoming new series of hard drives which also have 4k blocks
>> internally, and also need to do a similar 'read-modify-update' operation
>> in the case where a smaller write is done. In this case Windows tries to
>> align the I/O to 4k as much as possible, at the most basic level for
>> example when a new file is created, it will pad out the write to the
>> block to the nearest 4k. You can read more about support for that here:
>>
>> https://support.microsoft.com/en-au/help/2510009/microsoft-support-
>> policy-for-4k-sector-hard-drives-in-windows
>> <https://support.microsoft.com/en-au/help/2510009/microsoft-support-policy-for-4k-sector-hard-drives-in-windows>
>>
>>         On a basic test, booting a Windows 2016 instance and then
>> installing several months of Windows Updates the number of partial
>> writes was reduced from 23% (753090 / 3229597) to 1.8% (54535 / 2880217)
>> - many of which were during early boot and don't re-occur once the VM is
>> running.
>>
>>         I have submitted a patch to the OpenStack Cinder RBD driver to
>> support setting this parameter. You can find that here:
>>         https://review.opendev.org/#/c/658283/
>>
>>
>>         I did not have much luck finding information about any of this
>> online when I searched, so this e-mail is serving largely to document my
>> findings for others. But I am also looking for input from anyone as to
>> anything I have missed, confirming my analysis as sound, review for my
>> Cinder patch, etc.
>>
>>         There is also likely scope to make this same patch to report a
>> physical_block_size=4096 on other Ceph consumers such as the new(ish)
>> iSCSI gateway, etc.
>>
>>         Regards,
>>         Trent
>>
>>
>>         [1] fstore_op pwritev blocking stack trace - if anyone is
>> interested in the perf data, flamegraph, etc - I'd be happy to share.
>>
>>         tp_fstore_op
>>
>>         ceph::buffer::list::write_fd
>>         pwritev64
>>         entry_SYSCALL_64_after_hwframe
>>         do_syscall_64
>>         sys_pwritev
>>         do_pwritev
>>         vfs_writev
>>         do_iter_write
>>         do_iter_readv_writev
>>         xfs_file_write_iter
>>         xfs_file_buffered_aio_write
>>         iomap_file_buffered_write
>>         iomap_apply
>>         iomap_write_actor
>>         iomap_write_begin.constprop.18
>>         __block_write_begin_int
>>         out_of_line_wait_on_bit
>>         __wait_on_bit
>>         bit_wait_io
>>         io_schedule
>>         schedule
>>         __schedule
>>         finish_task_switch
>>         _______________________________________________
>>         ceph-users mailing list
>>         ceph-users@lists.ceph.com
>>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

Reply via email to