[ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

Trent Lloyd Fri, 10 May 2019 01:00:54 -0700

I recently was investigating a performance problem for a reasonably sized
OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS HDD) with
NVMe Journals. The primary workload is Windows guests backed by Cinder RBD
volumes.
This specific deployment is Ceph Jewel (FileStore + SimpleMessenger) which
while it is EOL, the issue is reproducible on current versions and also on
BlueStore however for different reasons than FileStore.


Generally the Ceph cluster was suffering from very poor outlier
performance, the numbers change a little bit depending on the exact
situation but roughly 80% of I/O was happening in a "reasonable" time of
0-200ms but 5-20% of I/O operations were taking excessively long anywhere
from 500ms through to 10-20+ seconds. However the normal metrics for commit
and apply latency were normal, and in fact, this latency was hard to spot
in the performance metrics available in jewel.

Previously I more simply considered FileStore to have the "commit" (to
journal) stage where it was written to the journal and it is OK to return
to the client and then the "apply" (to disk) stage where it was flushed to
disk and confirmed so that the data could be purged from the journal.
However there is really a third stage in the middle where FileStore submits
the I/O to the operating system and this is done before the lock on the
object is released. Until that succeeds another operation cannot write to
the same object (generally being a 4MB area of the disk).

I found that the fstore_op threads would get stuck for hundreds of MS or
more inside of pwritev() which was blocking inside of the kernel. Normally
we expect pwritev() to be buffered I/O into the page cache and return quite
fast however in this case the kernel was in a few percent of cases blocking
with the stack trace included at the end of the e-mail [1]. My finding from
that stack is that inside __block_write_begin_int we see a call to
out_of_line_wait_on_bit call which is really an inlined call for
wait_on_buffer which occurs in linux/fs/buffer.c in the section around line
2000-2024 with the comment "If we issued read requests - let them
complete." (
https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002
)

My interpretation of that code is that for Linux to store a write in the
page cache, it has to have the entire 4K page as that is the granularity of
which it tracks the dirty state and it needs the entire 4K page to later
submit back to the disk. Since we wrote a part of the page, and the page
wasn't already in the cache, it has to fetch the remainder of the page from
the disk. When this happens, it blocks waiting for this read to complete
before returning from the pwritev() call - hence our normally buffered
write blocks. This holds up the tp_fstore_op thread, of which there are (by
default) only 2-4 such threads trying to process several hundred operations
per second. Additionally the size of the osd_op_queue is bounded, and
operations do not clear out of this queue until the tp_fstore_op thread is
done. Which ultimately means that not only are these partial writes delayed
but it knocks on to delay other writes behind them because of the
constrained thread pools.

What was further confusing to this, is that I could easily reproduce this
in a test deployment using an rbd benchmark that was only writing to a
total disk size of 256MB which I would easily have expected to fit in the
page cache:
rbd create -p rbd --size=256M bench2
rbd bench-write -p rbd bench2 --io-size 512 --io-threads 256 --io-total
256M --io-pattern rand

This is explained by the fact that on secondary OSDs (at least, there was
some refactoring of fadvise which I have not fully understood as of yet),
FileStore is using fadvise FADVISE_DONTNEED on the objects after write
which causes the kernel to immediately discard them from the page cache
without any regard to their statistics of being recently/frequently used.
The motivation for this addition appears to be that on a secondary OSD we
don't service reads (only writes) and so therefor we can optimize memory
usage by throwing away this object and in theory leaving more room in the
page cache for objects which we are primary for and expect to actually
service reads from a client for. Unfortunately this behavior does not take
into account partial writes, where we now pathologically throw away the
cached copy instantly such that a write even 1 second later will have to
fetch the page from disk again. I also found that this FADVISE_DONTNEED is
issue not only during filestore sync but also by the WBThrottle - which as
this cluster was quite busy was constantly flushing writes leading to the
cache being discarded almost instantly.

Changing filestore_fadvise to False on this cluster lead to a significant
performance increase as it could now cache the pages in memory in many
cases. The number of reads from disk was reduced from around 40/second to
2/second, and the number of slow writes (>200ms) operations was reduced by
75%.

I wrote a script to parse ceph-osd logs with debug_filestore=10 or 15 to
report the time spent inside of write() as well as to count and report on
the number of operations that are unaligned and also slow. It's a bit rough
but you can find it here:
https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb

It does not solve the problem entirely, in that a filestore thread can
still be blocked in such a case where it is not cached - but the
pathological case of never having it in the cache is removed at least.
Understanding this problem, I looked to the situation for BlueStore.
BlueStore suffers from a similar issue in that the performance is quite
poor due to both fadvise and also because it is check-summing the data in
4k blocks so needs to read the rest of the block in, despite not having the
limitations of the Linux page cache to deal with. I have not yet further
fully investigated BlueStore implementation other than to note the
following doc talking about how such writes are handled and a possible
future improvement to submit partial writes into the WAL before reading the
rest of the block, which is apparently not done currently (and would be a
great optimization): http://docs.ceph.com/docs/mimic/dev/bluestore/


Moving onto a full solution for this issue. We can tell Windows guests to
send 4k-aligned I/O where possible by setting the physical_block_size hint
on the disk. This support was added mainly for the incoming new series of
hard drives which also have 4k blocks internally, and also need to do a
similar 'read-modify-update' operation in the case where a smaller write is
done. In this case Windows tries to align the I/O to 4k as much as
possible, at the most basic level for example when a new file is created,
it will pad out the write to the block to the nearest 4k. You can read more
about support for that here:
https://support.microsoft.com/en-au/help/2510009/microsoft-support-policy-for-4k-sector-hard-drives-in-windows

On a basic test, booting a Windows 2016 instance and then installing
several months of Windows Updates the number of partial writes was reduced
from 23% (753090 / 3229597) to 1.8% (54535 / 2880217) - many of which were
during early boot and don't re-occur once the VM is running.

I have submitted a patch to the OpenStack Cinder RBD driver to support
setting this parameter. You can find that here:
https://review.opendev.org/#/c/658283/

I did not have much luck finding information about any of this online when
I searched, so this e-mail is serving largely to document my findings for
others. But I am also looking for input from anyone as to anything I have
missed, confirming my analysis as sound, review for my Cinder patch, etc.

There is also likely scope to make this same patch to report a
physical_block_size=4096 on other Ceph consumers such as the new(ish) iSCSI
gateway, etc.

Regards,
Trent


[1] fstore_op pwritev blocking stack trace - if anyone is interested in the
perf data, flamegraph, etc - I'd be happy to share.

tp_fstore_op
ceph::buffer::list::write_fd
pwritev64
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_pwritev
do_pwritev
vfs_writev
do_iter_write
do_iter_readv_writev
xfs_file_write_iter
xfs_file_buffered_aio_write
iomap_file_buffered_write
iomap_apply
iomap_write_actor
iomap_write_begin.constprop.18
__block_write_begin_int
out_of_line_wait_on_bit
__wait_on_bit
bit_wait_io
io_schedule
schedule
__schedule
finish_task_switch

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Poor performance for 512b aligned "partial" writes from Windows guests in OpenStack + potential fix

Reply via email to