On 04/22/15 17:57, Jeff Epstein wrote: > > > On 04/10/2015 10:10 AM, Lionel Bouton wrote: >> On 04/10/15 15:41, Jeff Epstein wrote: >>> [...] >>> This seems highly unlikely. We get very good performance without >>> ceph. Requisitioning and manupulating block devices through LVM >>> happens instantaneously. We expect that ceph will be a bit slower by >>> its distributed nature, but we've seen operations block for up to an >>> hour, which is clearly behind the pale. Furthermore, as the >>> performance measure I posted show, read/write speed is not the >>> bottleneck: ceph is simply/waiting/. >>> >>> So, does anyone else have any ideas why mkfs (and other operations) >>> takes so long? >> >> >> As your use case is pretty unique and clearly not something Ceph was >> optimized for, if I were you I'd switch to a single pool with the >> appropriate number of pgs based on your pool size (replication) and >> the number of OSD you use (you should target 100 pgs/OSD to be in >> what seems the sweet spot) and create/delete rbd instead of the whole >> pool. You would be in "known territory" and any remaining performance >> problem would be easier to debug. >> > I agree that this is a good suggestion. It took me a little while, but > I've changed the configuration so that we now have only one pool, > containing many rbds, and now all data is spread across all six of our > OSD nodes. However, the performance has not perceptibly improved. We > still have the occasional long (>10 minutes) wait periods during write > operations, and the bottleneck still seems to be ceph, rather than the > hardware: the blocking process (most usually, but not always, mkfs) is > stuck in a wait state ("D" in ps) but no I/O is actually being > performed, so one can surmise that the physical limitations of the > disk medium are not the bottleneck. This is similar to what is being > reported in the thread titled "100% IO Wait with CEPH RBD and RSYNC". > > Do you have some idea how I can diagnose this problem?
I'll look at ceph -s output while you get these stuck process to see if there's any unusual activity (scrub/deep scrub/recovery/bacfills/...). Is it correlated in any way with rbd removal (ie: write blocking don't appear unless you removed at least one rbd for say one hour before the write performance problems). Best regards, Lionel Bouton
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com