Hello, I'm not yet sure if I'm allowed to share the files, but if you find one of those, you can verify the md5sum.
27d2223d66027d8e989fc07efb2df514 hugo-6.8.0.i386.deb.zip b7db78c3927ef3d53eb2113a4e369906 hugo-6.8.0.i386.rpm.zip 9a53ed8e201298de6da7ac6a7fd9dba0 hugo-6.8.0.i386.tar.gz.zip 2deaa31186adb36b92016a252b996e70 HUGO-6.8.0.win32.zip cd031ca8bf47b8976035d08125a2c591 HUGO-6.8.0.win64.zip b9d90bb70415c4c5ec29dc04180c65a8 HUGO-6.8.0.winArm64.zip 6d4fc696de0b0f95b54fccdb096e634f hugo-6.8.0.x86_64.deb.zip 12f8e39dc3cdd6c03e4eb3809a37ce65 hugo-6.8.0.x86_64.rpm.zip 545527fbb28af0c0ff4611fa20be0460 hugo-6.8.0.x86_64.tar.gz.zip -- Martin Verges Managing director Mobile: +49 174 9335695 E-Mail: martin.ver...@croit.io Chat: https://t.me/MartinVerges croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web: https://croit.io YouTube: https://goo.gl/PGE1Bx Am Fr., 10. Mai 2019 um 10:40 Uhr schrieb Marc Roos < m.r...@f1-outsourcing.eu>: > > Hmmm, so if I have (wd) drives that list this in smartctl output, I > should try and reformat them to 4k, which will give me better > performance? > > Sector Sizes: 512 bytes logical, 4096 bytes physical > > Do you have a link to this download? Can only find some .cz site with > the rpms. > > > -----Original Message----- > From: Martin Verges [mailto:martin.ver...@croit.io] > Sent: vrijdag 10 mei 2019 10:21 > To: Trent Lloyd > Cc: ceph-users > Subject: Re: [ceph-users] Poor performance for 512b aligned "partial" > writes from Windows guests in OpenStack + potential fix > > Hello Trent, > > many thanks for the insights. We always suggest to use 4kN over 512e > HDDs to our users. > > As we recently found out, is that WD Support offers a tool called HUGO > to reformat 512e to 4kN drives with "hugo format -m <model_number> -n > max --fastformat -b 4096" in seconds. > Maybe that helps someone that has bought the wrong disk. > > -- > Martin Verges > Managing director > > Mobile: +49 174 9335695 > E-Mail: martin.ver...@croit.io > Chat: https://t.me/MartinVerges > > croit GmbH, Freseniusstr. 31h, 81247 Munich > CEO: Martin Verges - VAT-ID: DE310638492 Com. register: Amtsgericht > Munich HRB 231263 > > Web: https://croit.io > YouTube: https://goo.gl/PGE1Bx > > > > Am Fr., 10. Mai 2019 um 10:00 Uhr schrieb Trent Lloyd > <trent.ll...@canonical.com>: > > > I recently was investigating a performance problem for a > reasonably > sized OpenStack deployment having around 220 OSDs (3.5" 7200 RPM SAS > HDD) with NVMe Journals. The primary workload is Windows guests backed > by Cinder RBD volumes. > This specific deployment is Ceph Jewel (FileStore + > SimpleMessenger) which while it is EOL, the issue is reproducible on > current versions and also on BlueStore however for different reasons > than FileStore. > > > Generally the Ceph cluster was suffering from very poor outlier > performance, the numbers change a little bit depending on the exact > situation but roughly 80% of I/O was happening in a "reasonable" time of > 0-200ms but 5-20% of I/O operations were taking excessively long > anywhere from 500ms through to 10-20+ seconds. However the normal > metrics for commit and apply latency were normal, and in fact, this > latency was hard to spot in the performance metrics available in jewel. > > Previously I more simply considered FileStore to have the "commit" > (to journal) stage where it was written to the journal and it is OK to > return to the client and then the "apply" (to disk) stage where it was > flushed to disk and confirmed so that the data could be purged from the > journal. However there is really a third stage in the middle where > FileStore submits the I/O to the operating system and this is done > before the lock on the object is released. Until that succeeds another > operation cannot write to the same object (generally being a 4MB area of > the disk). > > I found that the fstore_op threads would get stuck for hundreds of > MS or more inside of pwritev() which was blocking inside of the kernel. > Normally we expect pwritev() to be buffered I/O into the page cache and > return quite fast however in this case the kernel was in a few percent > of cases blocking with the stack trace included at the end of the e-mail > [1]. My finding from that stack is that inside __block_write_begin_int > we see a call to out_of_line_wait_on_bit call which is really an inlined > call for wait_on_buffer which occurs in linux/fs/buffer.c in the section > around line 2000-2024 with the comment "If we issued read requests - let > them complete." > (https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b255 > 9f7f827c/fs/buffer.c#L2002 > <https://github.com/torvalds/linux/blob/a2d635decbfa9c1e4ae15cb05b68b2559f7f827c/fs/buffer.c#L2002> > ) > > My interpretation of that code is that for Linux to store a write > in the page cache, it has to have the entire 4K page as that is the > granularity of which it tracks the dirty state and it needs the entire > 4K page to later submit back to the disk. Since we wrote a part of the > page, and the page wasn't already in the cache, it has to fetch the > remainder of the page from the disk. When this happens, it blocks > waiting for this read to complete before returning from the pwritev() > call - hence our normally buffered write blocks. This holds up the > tp_fstore_op thread, of which there are (by default) only 2-4 such > threads trying to process several hundred operations per second. > Additionally the size of the osd_op_queue is bounded, and operations do > not clear out of this queue until the tp_fstore_op thread is done. Which > ultimately means that not only are these partial writes delayed but it > knocks on to delay other writes behind them because of the constrained > thread pools. > > What was further confusing to this, is that I could easily > reproduce this in a test deployment using an rbd benchmark that was only > writing to a total disk size of 256MB which I would easily have expected > to fit in the page cache: > > rbd create -p rbd --size=256M bench2 > rbd bench-write -p rbd bench2 --io-size 512 --io-threads 256 > --io-total 256M --io-pattern rand > > This is explained by the fact that on secondary OSDs (at least, > there was some refactoring of fadvise which I have not fully understood > as of yet), FileStore is using fadvise FADVISE_DONTNEED on the objects > after write which causes the kernel to immediately discard them from the > page cache without any regard to their statistics of being > recently/frequently used. The motivation for this addition appears to be > that on a secondary OSD we don't service reads (only writes) and so > therefor we can optimize memory usage by throwing away this object and > in theory leaving more room in the page cache for objects which we are > primary for and expect to actually service reads from a client for. > Unfortunately this behavior does not take into account partial writes, > where we now pathologically throw away the cached copy instantly such > that a write even 1 second later will have to fetch the page from disk > again. I also found that this FADVISE_DONTNEED is issue not only during > filestore sync but also by the WBThrottle - which as this cluster was > quite busy was constantly flushing writes leading to the cache being > discarded almost instantly. > > Changing filestore_fadvise to False on this cluster lead to a > significant performance increase as it could now cache the pages in > memory in many cases. The number of reads from disk was reduced from > around 40/second to 2/second, and the number of slow writes (>200ms) > operations was reduced by 75%. > > I wrote a script to parse ceph-osd logs with debug_filestore=10 or > 15 to report the time spent inside of write() as well as to count and > report on the number of operations that are unaligned and also slow. > It's a bit rough but you can find it here: > https://github.com/lathiat/ceph-tools/blob/master/fstore_op_latency.rb > > It does not solve the problem entirely, in that a filestore thread > can still be blocked in such a case where it is not cached - but the > pathological case of never having it in the cache is removed at least. > Understanding this problem, I looked to the situation for BlueStore. > BlueStore suffers from a similar issue in that the performance is quite > poor due to both fadvise and also because it is check-summing the data > in 4k blocks so needs to read the rest of the block in, despite not > having the limitations of the Linux page cache to deal with. I have not > yet further fully investigated BlueStore implementation other than to > note the following doc talking about how such writes are handled and a > possible future improvement to submit partial writes into the WAL before > reading the rest of the block, which is apparently not done currently > (and would be a great optimization): > http://docs.ceph.com/docs/mimic/dev/bluestore/ > > > > Moving onto a full solution for this issue. We can tell Windows > guests to send 4k-aligned I/O where possible by setting the > physical_block_size hint on the disk. This support was added mainly for > the incoming new series of hard drives which also have 4k blocks > internally, and also need to do a similar 'read-modify-update' operation > in the case where a smaller write is done. In this case Windows tries to > align the I/O to 4k as much as possible, at the most basic level for > example when a new file is created, it will pad out the write to the > block to the nearest 4k. You can read more about support for that here: > > https://support.microsoft.com/en-au/help/2510009/microsoft-support- > policy-for-4k-sector-hard-drives-in-windows > <https://support.microsoft.com/en-au/help/2510009/microsoft-support-policy-for-4k-sector-hard-drives-in-windows> > > On a basic test, booting a Windows 2016 instance and then > installing several months of Windows Updates the number of partial > writes was reduced from 23% (753090 / 3229597) to 1.8% (54535 / 2880217) > - many of which were during early boot and don't re-occur once the VM is > running. > > I have submitted a patch to the OpenStack Cinder RBD driver to > support setting this parameter. You can find that here: > https://review.opendev.org/#/c/658283/ > > > I did not have much luck finding information about any of this > online when I searched, so this e-mail is serving largely to document my > findings for others. But I am also looking for input from anyone as to > anything I have missed, confirming my analysis as sound, review for my > Cinder patch, etc. > > There is also likely scope to make this same patch to report a > physical_block_size=4096 on other Ceph consumers such as the new(ish) > iSCSI gateway, etc. > > Regards, > Trent > > > [1] fstore_op pwritev blocking stack trace - if anyone is > interested in the perf data, flamegraph, etc - I'd be happy to share. > > tp_fstore_op > > ceph::buffer::list::write_fd > pwritev64 > entry_SYSCALL_64_after_hwframe > do_syscall_64 > sys_pwritev > do_pwritev > vfs_writev > do_iter_write > do_iter_readv_writev > xfs_file_write_iter > xfs_file_buffered_aio_write > iomap_file_buffered_write > iomap_apply > iomap_write_actor > iomap_write_begin.constprop.18 > __block_write_begin_int > out_of_line_wait_on_bit > __wait_on_bit > bit_wait_io > io_schedule > schedule > __schedule > finish_task_switch > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com