Re: [ceph-users] Troubleshooting an erasure coded pool with a cache tier

Loic Dachary Sat, 08 Nov 2014 15:24:52 -0800


On 09/11/2014 00:03, Gregory Farnum wrote:
> It's all about the disk accesses. What's the slow part when you dump historic 
> and in-progress ops?


This is what I see on g1 (6% iowait)

root@g1:~# ceph daemon osd.0 dump_ops_in_flight
{ "num_ops": 0,
  "ops": []}

root@g1:~# ceph daemon osd.0 dump_ops_in_flight
{ "num_ops": 1,
  "ops": [
        { "description": "osd_op(client.4407100.0:11030174 
rb.0.410809.238e1f29.000000001038 [set-alloc-hint object_size 4194304 
write_size 4194304,write 4095488~4096] 58.3aabb66d ack+ondisk+write e15613)",
          "received_at": "2014-11-09 00:14:17.385256",
          "age": "0.538802",
          "duration": "0.011955",
          "type_data": [
                "waiting for sub ops",
                { "client": "client.4407100",
                  "tid": 11030174},
                [
                    { "time": "2014-11-09 00:14:17.385393",
                      "event": "waiting_for_osdmap"},
                    { "time": "2014-11-09 00:14:17.385563",
                      "event": "reached_pg"},
                    { "time": "2014-11-09 00:14:17.385793",
                      "event": "started"},
                    { "time": "2014-11-09 00:14:17.385807",
                      "event": "started"},
                    { "time": "2014-11-09 00:14:17.385875",
                      "event": "waiting for subops from 1,10"},
                    { "time": "2014-11-09 00:14:17.386201",
                      "event": "commit_queued_for_journal_write"},
                    { "time": "2014-11-09 00:14:17.386336",
                      "event": "write_thread_in_journal_buffer"},
                    { "time": "2014-11-09 00:14:17.396293",
                      "event": "journaled_completion_queued"},
                    { "time": "2014-11-09 00:14:17.396332",
                      "event": "op_commit"},
                    { "time": "2014-11-09 00:14:17.396678",
                      "event": "op_applied"},
                    { "time": "2014-11-09 00:14:17.397211",
                      "event": "sub_op_commit_rec"}]]}]}

and it looks ok. When I go to n7 which has 20% iowait, I see a much larger 
output http://pastebin.com/DPxsaf6z which includes a number of "event": 
"waiting_for_osdmap".

I'm not sure what to make of this and it would certainly be better if n7 had a 
lower iowait. Also when I ceph -w I see a new pgmap is created every second 
which is also not a good sign.

2014-11-09 00:22:47.090795 mon.0 [INF] pgmap v4389613: 460 pgs: 460 
active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 3889 B/s 
rd, 2125 kB/s wr, 237 op/s
2014-11-09 00:22:48.143412 mon.0 [INF] pgmap v4389614: 460 pgs: 460 
active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1586 kB/s 
wr, 204 op/s
2014-11-09 00:22:49.172794 mon.0 [INF] pgmap v4389615: 460 pgs: 460 
active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 343 kB/s 
wr, 88 op/s
2014-11-09 00:22:50.222958 mon.0 [INF] pgmap v4389616: 460 pgs: 460 
active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 412 kB/s 
wr, 130 op/s
2014-11-09 00:22:51.281294 mon.0 [INF] pgmap v4389617: 460 pgs: 460 
active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 1195 kB/s 
wr, 167 op/s
2014-11-09 00:22:52.318895 mon.0 [INF] pgmap v4389618: 460 pgs: 460 
active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail; 5864 B/s 
rd, 2762 kB/s wr, 206 op/s

Cheers


> On Sat, Nov 8, 2014 at 2:30 PM Loic Dachary <l...@dachary.org 
> <mailto:l...@dachary.org>> wrote:
> 
>     Hi Greg,
> 
>     On 08/11/2014 20:19, Gregory Farnum wrote:> When acting as a cache pool 
> it needs to go do a lookup on the base pool for every object it hasn't 
> encountered before. I assume that's why it's slower.
>     > (The penalty should not be nearly as high as you're seeing here, but 
> based on the low numbers I imagine you're running everything on an overloaded 
> laptop or something.)
> 
>     It's running on a small cluster that is busy but not to a point that I 
> expect such a difference:
> 
>     # dsh --concurrent-shell --show-machine-names --remoteshellopt=-p2222 -m 
> g1 -m g2 -m g3 -m n7 -m stri dstat -c 10 3
>     g1: ----total-cpu-usage----
>     g1: usr sys idl wai hiq siq
>     g1:   6   1  88   6   0   0
>     g2: ----total-cpu-usage----
>     g2: usr sys idl wai hiq siq
>     g2:   4   1  88   7   0   0
>     n7: ----total-cpu-usage----
>     n7: usr sys idl wai hiq siq
>     n7:  18   3  58  20   0   1
>     stri: ----total-cpu-usage----
>     stri: usr sys idl wai hiq siq
>     stri:   6   1  86   6   0   0
>     g3: ----total-cpu-usage----
>     g3: usr sys idl wai hiq siq
>     g3:  37   2  55   5   0   1
>     g1:   2   0  93   4   0   0
>     g2:   2   0  92   6   0   0
>     n7:  13   2  65  20   0   1
>     stri:   4   1  92   3   0   0
>     g3:  32   2  62   4   0   1
>     g1:   3   0  94   3   0   0
>     g2:   3   1  94   3   0   0
>     n7:  13   3  61  22   0   1
>     stri:   4   1  90   5   0   0
>     g3:  31   2  61   4   0   1
>     g1:   3   0  89   7   0   0
>     g2:   2   1  89   8   0   0
>     n7:  20   3  50  25   0   1
>     stri:   6   1  87   5   0   0
>     g3:  57   2  36   3   0   1
> 
>     # ceph tell osd.\* version
>     osd.0: { "version": "ceph version 0.80.6 
> (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.1: { "version": "ceph version 0.80.6 
> (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.2: { "version": "ceph version 0.80.6 
> (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.3: { "version": "ceph version 0.80.6 
> (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.4: { "version": "ceph version 0.80.6 
> (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.5: { "version": "ceph version 0.80.6 
> (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.6: { "version": "ceph version 0.80.6 
> (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.7: { "version": "ceph version 0.80.6 
> (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.8: { "version": "ceph version 0.80.6 
> (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.9: { "version": "ceph version 0.80.6 
> (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.10: { "version": "ceph version 0.80.6 
> (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.11: { "version": "ceph version 0.80.6 
> (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.12: { "version": "ceph version 0.80.6 
> (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.13: { "version": "ceph version 0.80.6 
> (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.14: { "version": "ceph version 0.80.6 
> (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
>     osd.15: { "version": "ceph version 0.80.6 
> (__f93610a4421cb670b08e974c6550ee__715ac528ae)"}
> 
>     Cheers
> 
>     > -Greg
>     > On Sat, Nov 8, 2014 at 11:14 AM Loic Dachary <l...@dachary.org 
> <mailto:l...@dachary.org> <mailto:l...@dachary.org 
> <mailto:l...@dachary.org>>> wrote:
>     >
>     >     Hi,
>     >
>     >     This is a first attempt, it is entirely possible that the solution 
> is simple or RTFM ;-)
>     >
>     >     Here is the problem observed:
>     >
>     >     rados --pool ec4p1 bench 120 write # the erasure coded pool
>     >     Total time run:         147.207804
>     >     Total writes made:      458
>     >     Write size:             4194304
>     >     Bandwidth (MB/sec):     12.445
>     >
>     >     rados --pool disks bench 120 write # same crush ruleset at the 
> cache tier
>     >     Total time run:         126.312601
>     >     Total writes made:      1092
>     >     Write size:             4194304
>     >     Bandwidth (MB/sec):     34.581
>     >
>     >     There must be something wrong in how the cache tier is setup: one 
> would expect the same write speed since the total size written (a few GB) is 
> lower than the size of the cache pool. Instead the write speed is 
> consistently at least twice slower (12.445 * 2 < 34.581).
>     >
>     >     root@g1:~# ceph osd dump | grep disks
>     >     pool 58 'disks' replicated size 3 min_size 2 crush_ruleset 0 
> object_hash rjenkins pg_num 128 pgp_num 128 last_change 15110 lfor 12228 
> flags hashpspool stripe_width 0
>     >     root@g1:~# ceph osd dump | grep ec4
>     >     pool 74 'ec4p1' erasure size 5 min_size 4 crush_ruleset 2 
> object_hash rjenkins pg_num 32 pgp_num 32 last_change 15604 lfor 15604 flags 
> hashpspool tiers 75 read_tier 75 write_tier 75 stripe_width 4096
>     >     pool 75 'ec4p1c' replicated size 3 min_size 2 crush_ruleset 0 
> object_hash rjenkins pg_num 12 pgp_num 12 last_change 15613 flags 
> hashpspool,incomplete_clones tier_of 74 cache_mode writeback target_bytes 
> 1000000000 target_objects 1000000000 hit_set 
> bloom{false_positive_____probability: 0.05, target_size: 0, seed: 0} 3600s x1 
> stripe_width 0
>     >
>     >     root@g1:~# ceph df
>     >     GLOBAL:
>     >         SIZE       AVAIL      RAW USED     %RAW USED
>     >         26955G     18850G        6735G         24.99
>     >     POOLS:
>     >         NAME            ID     USED       %USED     MAX AVAIL     
> OBJECTS
>     >     ..
>     >         disks           58      1823G      6.76         5305G      
> 471080
>     >     ..
>     >         ec4p1           74       589G      2.19        12732G      
> 153623
>     >         ec4p1c          75     57501k         0         5305G         
> 491
>     >
>     >
>     >     Cheers
>     >     --
>     >     Loïc Dachary, Artisan Logiciel Libre
>     >
>     >     ___________________________________________________
>     >     ceph-users mailing list
>     >     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
> <mailto:ceph-us...@lists.ceph.__com <mailto:ceph-users@lists.ceph.com>>
>     >     http://lists.ceph.com/____listinfo.cgi/ceph-users-ceph.____com 
> <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com> 
> <http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com 
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>
>     >
> 
>     --
>     Loïc Dachary, Artisan Logiciel Libre
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

signature.asc
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Troubleshooting an erasure coded pool with a cache tier

Reply via email to