Hello Trent,

many thanks for the insights. We always suggest to use 4kN over 512e HDDs
to our users.

As we recently found out, is that WD Support offers a tool called HUGO to
reformat 512e to 4kN drives with "hugo format -m <model_number> -n max
--fastformat -b 4096" in seconds.
Maybe that helps someone that has bought the wrong disk.

--
Martin Verges
Managing director

Mobile: +49 174 9335695
E-Mail: martin.ver...@croit.io
Chat: https://t.me/MartinVerges

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web: https://croit.io
YouTube: https://goo.gl/PGE1Bx


Am Fr., 10. Mai 2019 um 10:00 Uhr schrieb Trent Lloyd <
trent.ll...@canonical.com>:

> I recently was investigating a performance problem for a reasonably sized
> OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS HDD) with
> NVMe Journals. The primary workload is Windows guests backed by Cinder RBD
> volumes.
> This specific deployment is Ceph Jewel (FileStore + SimpleMessenger) which
> while it is EOL, the issue is reproducible on current versions and also on
> BlueStore however for different reasons than FileStore.
>
> Generally the Ceph cluster was suffering from very poor outlier
> performance, the numbers change a little bit depending on the exact
> situation but roughly 80% of I/O was happening in a "reasonable" time of
> 0-200ms but 5-20% of I/O operations were taking excessively long anywhere
> from 500ms through to 10-20+ seconds. However the normal metrics for commit
> and apply latency were normal, and in fact, this latency was hard to spot
> in the performance metrics available in jewel.
>
> Previously I more simply considered FileStore to have the "commit" (to
> journal) stage where it was written to the journal and it is OK to return
> to the client and then the "apply" (to disk) stage where it was flushed to
> disk and confirmed so that the data could be purged from the journal.
> However there is really a third stage in the middle where FileStore submits
> the I/O to the operating system and this is done before the lock on the
> object is released. Until that succeeds another operation cannot write to
> the same object (generally being a 4MB area of the disk).
>
> I found that the fstore_op threads would get stuck for hundreds of MS or
> more inside of pwritev() which was blocking inside of the kernel. Normally
> we expect pwritev() to be buffered I/O into the page cache and return quite
> fast however in this case the kernel was in a few percent of cases blocking
> with the stack trace included at the end of the e-mail [1]. My finding from
> that stack is that inside __block_write_begin_int we see a call to
> out_of_line_wait_on_bit call which is really an inlined call for
> wait_on_buffer which occurs in linux/fs/buffer.c in the section around line
> 2000-2024 with the comment "If we issued read requests - let them
> complete." (
> https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002
> )
>
> My interpretation of that code is that for Linux to store a write in the
> page cache, it has to have the entire 4K page as that is the granularity of
> which it tracks the dirty state and it needs the entire 4K page to later
> submit back to the disk. Since we wrote a part of the page, and the page
> wasn't already in the cache, it has to fetch the remainder of the page from
> the disk. When this happens, it blocks waiting for this read to complete
> before returning from the pwritev() call - hence our normally buffered
> write blocks. This holds up the tp_fstore_op thread, of which there are (by
> default) only 2-4 such threads trying to process several hundred operations
> per second. Additionally the size of the osd_op_queue is bounded, and
> operations do not clear out of this queue until the tp_fstore_op thread is
> done. Which ultimately means that not only are these partial writes delayed
> but it knocks on to delay other writes behind them because of the
> constrained thread pools.
>
> What was further confusing to this, is that I could easily reproduce this
> in a test deployment using an rbd benchmark that was only writing to a
> total disk size of 256MB which I would easily have expected to fit in the
> page cache:
> rbd create -p rbd --size=256M bench2
> rbd bench-write -p rbd bench2 --io-size 512 --io-threads 256 --io-total
> 256M --io-pattern rand
>
> This is explained by the fact that on secondary OSDs (at least, there was
> some refactoring of fadvise which I have not fully understood as of yet),
> FileStore is using fadvise FADVISE_DONTNEED on the objects after write
> which causes the kernel to immediately discard them from the page cache
> without any regard to their statistics of being recently/frequently used.
> The motivation for this addition appears to be that on a secondary OSD we
> don't service reads (only writes) and so therefor we can optimize memory
> usage by throwing away this object and in theory leaving more room in the
> page cache for objects which we are primary for and expect to actually
> service reads from a client for. Unfortunately this behavior does not take
> into account partial writes, where we now pathologically throw away the
> cached copy instantly such that a write even 1 second later will have to
> fetch the page from disk again. I also found that this FADVISE_DONTNEED is
> issue not only during filestore sync but also by the WBThrottle - which as
> this cluster was quite busy was constantly flushing writes leading to the
> cache being discarded almost instantly.
>
> Changing filestore_fadvise to False on this cluster lead to a significant
> performance increase as it could now cache the pages in memory in many
> cases. The number of reads from disk was reduced from around 40/second to
> 2/second, and the number of slow writes (>200ms) operations was reduced by
> 75%.
>
> I wrote a script to parse ceph-osd logs with debug_filestore=10 or 15 to
> report the time spent inside of write() as well as to count and report on
> the number of operations that are unaligned and also slow. It's a bit rough
> but you can find it here:
> https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb
>
> It does not solve the problem entirely, in that a filestore thread can
> still be blocked in such a case where it is not cached - but the
> pathological case of never having it in the cache is removed at least.
> Understanding this problem, I looked to the situation for BlueStore.
> BlueStore suffers from a similar issue in that the performance is quite
> poor due to both fadvise and also because it is check-summing the data in
> 4k blocks so needs to read the rest of the block in, despite not having the
> limitations of the Linux page cache to deal with. I have not yet further
> fully investigated BlueStore implementation other than to note the
> following doc talking about how such writes are handled and a possible
> future improvement to submit partial writes into the WAL before reading the
> rest of the block, which is apparently not done currently (and would be a
> great optimization): http://docs.ceph.com/docs/mimic/dev/bluestore/
>
>
> Moving onto a full solution for this issue. We can tell Windows guests to
> send 4k-aligned I/O where possible by setting the physical_block_size hint
> on the disk. This support was added mainly for the incoming new series of
> hard drives which also have 4k blocks internally, and also need to do a
> similar 'read-modify-update' operation in the case where a smaller write is
> done. In this case Windows tries to align the I/O to 4k as much as
> possible, at the most basic level for example when a new file is created,
> it will pad out the write to the block to the nearest 4k. You can read more
> about support for that here:
>
> https://support.microsoft.com/en-au/help/2510009/microsoft-support-policy-for-4k-sector-hard-drives-in-windows
>
> On a basic test, booting a Windows 2016 instance and then installing
> several months of Windows Updates the number of partial writes was reduced
> from 23% (753090 / 3229597) to 1.8% (54535 / 2880217) - many of which
> were during early boot and don't re-occur once the VM is running.
>
> I have submitted a patch to the OpenStack Cinder RBD driver to support
> setting this parameter. You can find that here:
> https://review.opendev.org/#/c/658283/
>
> I did not have much luck finding information about any of this online when
> I searched, so this e-mail is serving largely to document my findings for
> others. But I am also looking for input from anyone as to anything I have
> missed, confirming my analysis as sound, review for my Cinder patch, etc.
>
> There is also likely scope to make this same patch to report a
> physical_block_size=4096 on other Ceph consumers such as the new(ish) iSCSI
> gateway, etc.
>
> Regards,
> Trent
>
>
> [1] fstore_op pwritev blocking stack trace - if anyone is interested in
> the perf data, flamegraph, etc - I'd be happy to share.
>
> tp_fstore_op
> ceph::buffer::list::write_fd
> pwritev64
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> sys_pwritev
> do_pwritev
> vfs_writev
> do_iter_write
> do_iter_readv_writev
> xfs_file_write_iter
> xfs_file_buffered_aio_write
> iomap_file_buffered_write
> iomap_apply
> iomap_write_actor
> iomap_write_begin.constprop.18
> __block_write_begin_int
> out_of_line_wait_on_bit
> __wait_on_bit
> bit_wait_io
> io_schedule
> schedule
> __schedule
> finish_task_switch
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to