On 09/11/2014 00:03, Gregory Farnum wrote: > It's all about the disk accesses. What's the slow part when you dump historic > and in-progress ops?
This is what I see on g1 (6% iowait) root@g1:~# ceph daemon osd.0 dump_ops_in_flight { "num_ops": 0, "ops": []} root@g1:~# ceph daemon osd.0 dump_ops_in_flight { "num_ops": 1, "ops": [ { "description": "osd_op(client.4407100.0:11030174 rb.0.410809.238e1f29.000000001038 [set-alloc-hint object_size 4194304 write_size 4194304,write 4095488~4096] 58.3aabb66d ack+ondisk+write e15613)", "received_at": "2014-11-09 00:14:17.385256", "age": "0.538802", "duration": "0.011955", "type_data": [ "waiting for sub ops", { "client": "client.4407100", "tid": 11030174}, [ { "time": "2014-11-09 00:14:17.385393", "event": "waiting_for_osdmap"}, { "time": "2014-11-09 00:14:17.385563", "event": "reached_pg"}, { "time": "2014-11-09 00:14:17.385793", "event": "started"}, { "time": "2014-11-09 00:14:17.385807", "event": "started"}, { "time": "2014-11-09 00:14:17.385875", "event": "waiting for subops from 1,10"}, { "time": "2014-11-09 00:14:17.386201", "event": "commit_queued_for_journal_write"}, { "time": "2014-11-09 00:14:17.386336", "event": "write_thread_in_journal_buffer"}, { "time": "2014-11-09 00:14:17.396293", "event": "journaled_completion_queued"}, { "time": "2014-11-09 00:14:17.396332", "event": "op_commit"}, { "time": "2014-11-09 00:14:17.396678", "event": "op_applied"}, { "time": "2014-11-09 00:14:17.397211", "event": "sub_op_commit_rec"}]]}]} and it looks ok. When I go to n7 which has 20% iowait, I see a much larger output http://pastebin.com/DPxsaf6z which includes a number of "event": "waiting_for_osdmap". I'm not sure what to make of this and it would certainly be better if n7 had a lower iowait. Also when I ceph -w I see a new pgmap is created every second which is also not a good sign. 2014-11-09 00:22:47.090795 mon.0 [INF] pgmap v4389613: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 3889 B/s rd, 2125 kB/s wr, 237 op/s 2014-11-09 00:22:48.143412 mon.0 [INF] pgmap v4389614: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1586 kB/s wr, 204 op/s 2014-11-09 00:22:49.172794 mon.0 [INF] pgmap v4389615: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 343 kB/s wr, 88 op/s 2014-11-09 00:22:50.222958 mon.0 [INF] pgmap v4389616: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 412 kB/s wr, 130 op/s 2014-11-09 00:22:51.281294 mon.0 [INF] pgmap v4389617: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1195 kB/s wr, 167 op/s 2014-11-09 00:22:52.318895 mon.0 [INF] pgmap v4389618: 460 pgs: 460 active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 5864 B/s rd, 2762 kB/s wr, 206 op/s Cheers > On Sat, Nov 8, 2014 at 2:30 PM Loic Dachary <l...@dachary.org > <mailto:l...@dachary.org>> wrote: > > Hi Greg, > > On 08/11/2014 20:19, Gregory Farnum wrote:> When acting as a cache pool > it needs to go do a lookup on the base pool for every object it hasn't > encountered before. I assume that's why it's slower. > > (The penalty should not be nearly as high as you're seeing here, but > based on the low numbers I imagine you're running everything on an overloaded > laptop or something.) > > It's running on a small cluster that is busy but not to a point that I > expect such a difference: > > # dsh --concurrent-shell --show-machine-names --remoteshellopt=-p2222 -m > g1 -m g2 -m g3 -m n7 -m stri dstat -c 10 3 > g1: ----total-cpu-usage---- > g1: usr sys idl wai hiq siq > g1: 6 1 88 6 0 0 > g2: ----total-cpu-usage---- > g2: usr sys idl wai hiq siq > g2: 4 1 88 7 0 0 > n7: ----total-cpu-usage---- > n7: usr sys idl wai hiq siq > n7: 18 3 58 20 0 1 > stri: ----total-cpu-usage---- > stri: usr sys idl wai hiq siq > stri: 6 1 86 6 0 0 > g3: ----total-cpu-usage---- > g3: usr sys idl wai hiq siq > g3: 37 2 55 5 0 1 > g1: 2 0 93 4 0 0 > g2: 2 0 92 6 0 0 > n7: 13 2 65 20 0 1 > stri: 4 1 92 3 0 0 > g3: 32 2 62 4 0 1 > g1: 3 0 94 3 0 0 > g2: 3 1 94 3 0 0 > n7: 13 3 61 22 0 1 > stri: 4 1 90 5 0 0 > g3: 31 2 61 4 0 1 > g1: 3 0 89 7 0 0 > g2: 2 1 89 8 0 0 > n7: 20 3 50 25 0 1 > stri: 6 1 87 5 0 0 > g3: 57 2 36 3 0 1 > > # ceph tell osd.\* version > osd.0: { "version": "ceph version 0.80.6 > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} > osd.1: { "version": "ceph version 0.80.6 > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} > osd.2: { "version": "ceph version 0.80.6 > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} > osd.3: { "version": "ceph version 0.80.6 > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} > osd.4: { "version": "ceph version 0.80.6 > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} > osd.5: { "version": "ceph version 0.80.6 > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} > osd.6: { "version": "ceph version 0.80.6 > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} > osd.7: { "version": "ceph version 0.80.6 > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} > osd.8: { "version": "ceph version 0.80.6 > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} > osd.9: { "version": "ceph version 0.80.6 > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} > osd.10: { "version": "ceph version 0.80.6 > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} > osd.11: { "version": "ceph version 0.80.6 > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} > osd.12: { "version": "ceph version 0.80.6 > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} > osd.13: { "version": "ceph version 0.80.6 > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} > osd.14: { "version": "ceph version 0.80.6 > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} > osd.15: { "version": "ceph version 0.80.6 > (__f93610a4421cb670b08e974c6550ee__715ac528ae)"} > > Cheers > > > -Greg > > On Sat, Nov 8, 2014 at 11:14 AM Loic Dachary <l...@dachary.org > <mailto:l...@dachary.org> <mailto:l...@dachary.org > <mailto:l...@dachary.org>>> wrote: > > > > Hi, > > > > This is a first attempt, it is entirely possible that the solution > is simple or RTFM ;-) > > > > Here is the problem observed: > > > > rados --pool ec4p1 bench 120 write # the erasure coded pool > > Total time run: 147.207804 > > Total writes made: 458 > > Write size: 4194304 > > Bandwidth (MB/sec): 12.445 > > > > rados --pool disks bench 120 write # same crush ruleset at the > cache tier > > Total time run: 126.312601 > > Total writes made: 1092 > > Write size: 4194304 > > Bandwidth (MB/sec): 34.581 > > > > There must be something wrong in how the cache tier is setup: one > would expect the same write speed since the total size written (a few GB) is > lower than the size of the cache pool. Instead the write speed is > consistently at least twice slower (12.445 * 2 < 34.581). > > > > root@g1:~# ceph osd dump | grep disks > > pool 58 'disks' replicated size 3 min_size 2 crush_ruleset 0 > object_hash rjenkins pg_num 128 pgp_num 128 last_change 15110 lfor 12228 > flags hashpspool stripe_width 0 > > root@g1:~# ceph osd dump | grep ec4 > > pool 74 'ec4p1' erasure size 5 min_size 4 crush_ruleset 2 > object_hash rjenkins pg_num 32 pgp_num 32 last_change 15604 lfor 15604 flags > hashpspool tiers 75 read_tier 75 write_tier 75 stripe_width 4096 > > pool 75 'ec4p1c' replicated size 3 min_size 2 crush_ruleset 0 > object_hash rjenkins pg_num 12 pgp_num 12 last_change 15613 flags > hashpspool,incomplete_clones tier_of 74 cache_mode writeback target_bytes > 1000000000 target_objects 1000000000 hit_set > bloom{false_positive_____probability: 0.05, target_size: 0, seed: 0} 3600s x1 > stripe_width 0 > > > > root@g1:~# ceph df > > GLOBAL: > > SIZE AVAIL RAW USED %RAW USED > > 26955G 18850G 6735G 24.99 > > POOLS: > > NAME ID USED %USED MAX AVAIL > OBJECTS > > .. > > disks 58 1823G 6.76 5305G > 471080 > > .. > > ec4p1 74 589G 2.19 12732G > 153623 > > ec4p1c 75 57501k 0 5305G > 491 > > > > > > Cheers > > -- > > Loïc Dachary, Artisan Logiciel Libre > > > > ___________________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > <mailto:ceph-us...@lists.ceph.__com <mailto:ceph-users@lists.ceph.com>> > > http://lists.ceph.com/____listinfo.cgi/ceph-users-ceph.____com > <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com> > <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com > <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>> > > > > -- > Loïc Dachary, Artisan Logiciel Libre > -- Loïc Dachary, Artisan Logiciel Libre
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com