Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

Martin Verges Fri, 10 May 2019 02:15:56 -0700

Hello,

I'm not yet sure if I'm allowed to share the files, but if you find one of
those, you can verify the md5sum.


27d2223d66027d8e989fc07efb2df514  hugo-6.8.0.i386.deb.zip
b7db78c3927ef3d53eb2113a4e369906  hugo-6.8.0.i386.rpm.zip
9a53ed8e201298de6da7ac6a7fd9dba0  hugo-6.8.0.i386.tar.gz.zip
2deaa31186adb36b92016a252b996e70  HUGO-6.8.0.win32.zip
cd031ca8bf47b8976035d08125a2c591  HUGO-6.8.0.win64.zip
b9d90bb70415c4c5ec29dc04180c65a8  HUGO-6.8.0.winArm64.zip
6d4fc696de0b0f95b54fccdb096e634f  hugo-6.8.0.x86_64.deb.zip
12f8e39dc3cdd6c03e4eb3809a37ce65  hugo-6.8.0.x86_64.rpm.zip
545527fbb28af0c0ff4611fa20be0460  hugo-6.8.0.x86_64.tar.gz.zip

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Fr., 10. Mai 2019 um 10:40 Uhr schrieb Marc Roos <
m.r...@f1-outsourcing.eu>:

>
> Hmmm, so if I have (wd) drives that list this in smartctl output, I
> should try and reformat them to 4k, which will give me better
> performance?
>
> Sector Sizes:     512 bytes logical, 4096 bytes physical
>
> Do you have a link to this download? Can only find some .cz site with
> the rpms.
>
>
> -----Original Message-----
> From: Martin Verges [mailto:martin.ver...@croit.io]
> Sent: vrijdag 10 mei 2019 10:21
> To: Trent Lloyd
> Cc: ceph-users
> Subject: Re: [ceph-users] Poor performance for 512b aligned "partial"
> writes from Windows guests in OpenStack + potential fix
>
> Hello Trent,
>
> many thanks for the insights. We always suggest to use 4kN over 512e
> HDDs to our users.
>
> As we recently found out, is that WD Support offers a tool called HUGO
> to reformat 512e to 4kN drives with "hugo format -m <model_number> -n
> max --fastformat -b 4096" in seconds.
> Maybe that helps someone that has bought the wrong disk.
>
> --
> Martin Verges
> Managing director
>
> Mobile: +49 174 9335695
> E-Mail: martin.ver...@croit.io
> Chat: https://t.me/MartinVerges
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht
> Munich HRB 231263
>
> Web: https://croit.io
> YouTube: https://goo.gl/PGE1Bx
>
>
>
> Am Fr., 10. Mai 2019 um 10:00 Uhr schrieb Trent Lloyd
> <trent.ll...@canonical.com>:
>
>
>         I recently was investigating a performance problem for a
> reasonably
> sized OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS
> HDD) with NVMe Journals. The primary workload is Windows guests backed
> by Cinder RBD volumes.
>         This specific deployment is Ceph Jewel (FileStore +
> SimpleMessenger) which while it is EOL, the issue is reproducible on
> current versions and also on BlueStore however for different reasons
> than FileStore.
>
>
>         Generally the Ceph cluster was suffering from very poor outlier
> performance, the numbers change a little bit depending on the exact
> situation but roughly 80% of I/O was happening in a "reasonable" time of
> 0-200ms but 5-20% of I/O operations were taking excessively long
> anywhere from 500ms through to 10-20+ seconds. However the normal
> metrics for commit and apply latency were normal, and in fact, this
> latency was hard to spot in the performance metrics available in jewel.
>
>         Previously I more simply considered FileStore to have the "commit"
> (to journal) stage where it was written to the journal and it is OK to
> return to the client and then the "apply" (to disk) stage where it was
> flushed to disk and confirmed so that the data could be purged from the
> journal. However there is really a third stage in the middle where
> FileStore submits the I/O to the operating system and this is done
> before the lock on the object is released. Until that succeeds another
> operation cannot write to the same object (generally being a 4MB area of
> the disk).
>
>         I found that the fstore_op threads would get stuck for hundreds of
> MS or more inside of pwritev() which was blocking inside of the kernel.
> Normally we expect pwritev() to be buffered I/O into the page cache and
> return quite fast however in this case the kernel was in a few percent
> of cases blocking with the stack trace included at the end of the e-mail
> [1]. My finding from that stack is that inside __block_write_begin_int
> we see a call to out_of_line_wait_on_bit call which is really an inlined
> call for wait_on_buffer which occurs in linux/fs/buffer.c in the section
> around line 2000-2024 with the comment "If we issued read requests - let
> them complete."
> (https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b255
> 9f7f827c/fs/buffer.c#L2002
> <https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002>
> )
>
>         My interpretation of that code is that for Linux to store a write
> in the page cache, it has to have the entire 4K page as that is the
> granularity of which it tracks the dirty state and it needs the entire
> 4K page to later submit back to the disk. Since we wrote a part of the
> page, and the page wasn't already in the cache, it has to fetch the
> remainder of the page from the disk. When this happens, it blocks
> waiting for this read to complete before returning from the pwritev()
> call - hence our normally buffered write blocks. This holds up the
> tp_fstore_op thread, of which there are (by default) only 2-4 such
> threads trying to process several hundred operations per second.
> Additionally the size of the osd_op_queue is bounded, and operations do
> not clear out of this queue until the tp_fstore_op thread is done. Which
> ultimately means that not only are these partial writes delayed but it
> knocks on to delay other writes behind them because of the constrained
> thread pools.
>
>         What was further confusing to this, is that I could easily
> reproduce this in a test deployment using an rbd benchmark that was only
> writing to a total disk size of 256MB which I would easily have expected
> to fit in the page cache:
>
>         rbd create -p rbd --size=256M bench2
>         rbd bench-write -p rbd bench2 --io-size 512 --io-threads 256
> --io-total 256M --io-pattern rand
>
>         This is explained by the fact that on secondary OSDs (at least,
> there was some refactoring of fadvise which I have not fully understood
> as of yet), FileStore is using fadvise FADVISE_DONTNEED on the objects
> after write which causes the kernel to immediately discard them from the
> page cache without any regard to their statistics of being
> recently/frequently used. The motivation for this addition appears to be
> that on a secondary OSD we don't service reads (only writes) and so
> therefor we can optimize memory usage by throwing away this object and
> in theory leaving more room in the page cache for objects which we are
> primary for and expect to actually service reads from a client for.
> Unfortunately this behavior does not take into account partial writes,
> where we now pathologically throw away the cached copy instantly such
> that a write even 1 second later will have to fetch the page from disk
> again. I also found that this FADVISE_DONTNEED is issue not only during
> filestore sync but also by the WBThrottle - which as this cluster was
> quite busy was constantly flushing writes leading to the cache being
> discarded almost instantly.
>
>         Changing filestore_fadvise to False on this cluster lead to a
> significant performance increase as it could now cache the pages in
> memory in many cases. The number of reads from disk was reduced from
> around 40/second to 2/second, and the number of slow writes (>200ms)
> operations was reduced by 75%.
>
>         I wrote a script to parse ceph-osd logs with debug_filestore=10 or
> 15 to report the time spent inside of write() as well as to count and
> report on the number of operations that are unaligned and also slow.
> It's a bit rough but you can find it here:
> https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb
>
>         It does not solve the problem entirely, in that a filestore thread
> can still be blocked in such a case where it is not cached - but the
> pathological case of never having it in the cache is removed at least.
> Understanding this problem, I looked to the situation for BlueStore.
> BlueStore suffers from a similar issue in that the performance is quite
> poor due to both fadvise and also because it is check-summing the data
> in 4k blocks so needs to read the rest of the block in, despite not
> having the limitations of the Linux page cache to deal with. I have not
> yet further fully investigated BlueStore implementation other than to
> note the following doc talking about how such writes are handled and a
> possible future improvement to submit partial writes into the WAL before
> reading the rest of the block, which is apparently not done currently
> (and would be a great optimization):
> http://docs.ceph.com/docs/mimic/dev/bluestore/
>
>
>
>         Moving onto a full solution for this issue. We can tell Windows
> guests to send 4k-aligned I/O where possible by setting the
> physical_block_size hint on the disk. This support was added mainly for
> the incoming new series of hard drives which also have 4k blocks
> internally, and also need to do a similar 'read-modify-update' operation
> in the case where a smaller write is done. In this case Windows tries to
> align the I/O to 4k as much as possible, at the most basic level for
> example when a new file is created, it will pad out the write to the
> block to the nearest 4k. You can read more about support for that here:
>
> https://support.microsoft.com/en-au/help/2510009/microsoft-support-
> policy-for-4k-sector-hard-drives-in-windows
> <https://support.microsoft.com/en-au/help/2510009/microsoft-support-policy-for-4k-sector-hard-drives-in-windows>
>
>         On a basic test, booting a Windows 2016 instance and then
> installing several months of Windows Updates the number of partial
> writes was reduced from 23% (753090 / 3229597) to 1.8% (54535 / 2880217)
> - many of which were during early boot and don't re-occur once the VM is
> running.
>
>         I have submitted a patch to the OpenStack Cinder RBD driver to
> support setting this parameter. You can find that here:
>         https://review.opendev.org/#/c/658283/
>
>
>         I did not have much luck finding information about any of this
> online when I searched, so this e-mail is serving largely to document my
> findings for others. But I am also looking for input from anyone as to
> anything I have missed, confirming my analysis as sound, review for my
> Cinder patch, etc.
>
>         There is also likely scope to make this same patch to report a
> physical_block_size=4096 on other Ceph consumers such as the new(ish)
> iSCSI gateway, etc.
>
>         Regards,
>         Trent
>
>
>         [1] fstore_op pwritev blocking stack trace - if anyone is
> interested in the perf data, flamegraph, etc - I'd be happy to share.
>
>         tp_fstore_op
>
>         ceph::buffer::list::write_fd
>         pwritev64
>         entry_SYSCALL_64_after_hwframe
>         do_syscall_64
>         sys_pwritev
>         do_pwritev
>         vfs_writev
>         do_iter_write
>         do_iter_readv_writev
>         xfs_file_write_iter
>         xfs_file_buffered_aio_write
>         iomap_file_buffered_write
>         iomap_apply
>         iomap_write_actor
>         iomap_write_begin.constprop.18
>         __block_write_begin_int
>         out_of_line_wait_on_bit
>         __wait_on_bit
>         bit_wait_io
>         io_schedule
>         schedule
>         __schedule
>         finish_task_switch
>         _______________________________________________
>         ceph-users mailing list
>         ceph-users@lists.ceph.com
>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

Reply via email to