I will definitely follow your steps and apply bluefs_buffered_io=true via ceph.conf and restart. My first try was to update these dynamically. I’ll report when it’s done.
We monitor our clusters via Telegraf (Ceph input Plugin) and InfluxDB and a custom Grafana dashboard fitted for our needs. Björn > Am 13.02.2021 um 09:23 schrieb Frank Schilder <fr...@dtu.dk>: > > Ahh, OK. I'm not sure if it has that effect. What people observed was, that > rocks-DB access became faster due to system buffer cache hits. This has an > indirect influence on data access latency. > > The typical case is "high IOPs on WAL/DB device after upgrade" and setting > bluefs_buffered_io=true got this back to normal also improving client > performance as a result. > > Your latency graphs look actually suspiciously like it should work for you. > Are you sure the OSD is using the value? I had problems with setting some > parameters, I needed to include them in the ceph.conf file and restart to > force them through. > > A sign that bluefs_buffered_io=true is applied is rapidly increasing system > buffer usage reported by top or free. If the values reported are similar for > all hosts, bluefs_buffered_io is still disabled. > > If I may ask, what framework are you using to pull these graphs? Is there a > graphana dashboard one can download somewhere or is it something you > implemented yourself? I plan to enable prometheus on our cluster, but don't > know about a good data sink providing a pre-defined dashboard. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Björn Dolkemeier <b.dolkeme...@dbap.de> > Sent: 13 February 2021 08:51:11 > To: Frank Schilder > Cc: ceph-users@ceph.io > Subject: Re: [ceph-users] Latency increase after upgrade 14.2.8 to 14.2.16 > > Thanks for the quick reply, Frank. > > Sorry, the graphs/attachment where filtered. Here is an example of one > latency: > https://drive.google.com/file/d/1qSWmSmZ6JXVweepcoY13ofhfWXrBi2uZ/view?usp=sharing > > I’m aware that the overall performance depends on the slowest OSD. > > What I expect is that bluefs_buffered_io=true set on one OSD reflects in > dropped latencies for that particular OSD. > > Best regards, > Björn > > Am 13.02.2021 um 07:39 schrieb Frank Schilder > <fr...@dtu.dk<mailto:fr...@dtu.dk>>: > > The graphs were forgotten or filtered out. > > Changing the buffered_io value on one host will not change client IO > performance as its always the slowest OSD thats decisive. However, it should > have an effect on the IOP/s load reported by iostat on the disks on the host. > > Does setting bluefs_buffered_io=true on all hosts have an effect on client > IO? Note that it might need a restart even if the documentation says > otherwise. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Björn Dolkemeier <b.dolkeme...@dbap.de<mailto:b.dolkeme...@dbap.de>> > Sent: 13 February 2021 07:16:06 > To: ceph-users@ceph.io<mailto:ceph-users@ceph.io> > Subject: [ceph-users] Latency increase after upgrade 14.2.8 to 14.2.16 > > Hi, > > after upgrading Ceph from 14.2.8 to 14.2.16 we experienced increased > latencies. There were no changes in hardware, configuration, workload or > networking, just a rolling-update via ceph-ansible on running production > cluster. The cluster consists of 16 OSDs (all SSD) over 4 Nodes. The VMs > served via RBD from this cluster currently suffer on i/o wait cpu. > > These are some latencies that are increased after the update: > - op_r_latency > - op_w_latency > - kv_final_lat > - state_kv_commiting_lat > - submit_lat > - subop_w_latency > > Do these latencies point to KV/RocksDB? > > These are some latencies which are NOT increased after the update: > - kv_sync_lat > - kv_flush_lat > - kv_commit_lat > > I attached one graph showing the massive increase after the update. > > I tried setting bluefs_buffered_io=true (as it’s default value was changed > and it was mentioned as performance relevant) for all OSDs in one host but > this does not make a difference. > > The ceph.conf is fairly simple: > > [global] > cluster network = xxx > fsid = xxx > mon host = xxx > public network = xxx > > [osd] > osd memory target = 10141014425 > > Any ideas what to try? Help appreciated. > > Björn > > > > > > > -- > > dbap GmbH > phone +49 251 609979-0 / fax +49 251 609979-99 > Heinr.-von-Kleist-Str. 47, 48161 Muenster, Germany > http://www.dbap.de > > dbap GmbH, Sitz: Muenster > HRB 5891, Amtsgericht Muenster > Geschaeftsfuehrer: Bjoern Dolkemeier > > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io