I've just checked 1072 and 872, they both look the same, a single op for the object in question, in retry+read state, appears to be retrying forever.
On Thu, Dec 17, 2015 at 10:05 AM, Tom Christensen <pav...@gmail.com> wrote: > I had already nuked the previous hang, but we have another one: > > osdc output: > > 70385 osd853 5.fb666328 rbd_data.36f804a163d632a.00000000000370ff > read > > 11024940 osd1072 5.f438406c > rbd_id.volume-44c74bb5-14f8-4279-b44f-8e867248531b call > > 11241684 osd872 5.175f624d > rbd_id.volume-3e068bc7-75eb-4504-b109-df851a787f89 call > > 11689088 osd685 5.1fc9acd5 rbd_header.36f804a163d632a > 442390'5605610926112768 watch > > > #ceph osd map rbd rbd_data.36f804a163d632a.00000000000370ff > > osdmap e1309560 pool 'rbd' (5) object > 'rbd_data.36f804a163d632a.00000000000370ff' -> pg 5.fb666328 (5.6328) -> up > ([853,247,265], p853) acting ([853,247,265], p853) > > > As to the output of osd.853 ops, objecter_requests, dump_ops_in_flight, > dump_historic_ops.. I see a single request in the output of osd.853 ops: > > { > > "ops": [ > > { > > "description": "osd_op(client.58016244.1:70385 > rbd_data.36f804a163d632a.00000000000370ff [read 7274496~4096] 5.fb666328 > RETRY=1 retry+read e1309006)", > > "initiated_at": "2015-12-17 10:03:35.360401", > > "age": 0.000503, > > "duration": 0.000233, > > "type_data": [ > > "reached pg", > > { > > "client": "client.58016244", > > "tid": 70385 > > }, > > [ > > { > > "time": "2015-12-17 10:03:35.360401", > > "event": "initiated" > > }, > > { > > "time": "2015-12-17 10:03:35.360635", > > "event": "reached_pg" > > } > > ] > > ] > > } > > ], > > "num_ops": 1 > > } > > > The other commands either return nothing (ops_in_flight, > objecter_requests) or in the case of historic ops, it returns 20 ops (thats > what its set to keep), but none of them are this request or reference this > object. It seems this read is just retrying forever? > > > > On Sat, Dec 12, 2015 at 12:10 PM, Ilya Dryomov <idryo...@gmail.com> wrote: > >> On Sat, Dec 12, 2015 at 6:37 PM, Tom Christensen <pav...@gmail.com> >> wrote: >> > We had a kernel map get hung up again last night/this morning. The rbd >> is >> > mapped but unresponsive, if I try to unmap it I get the following error: >> > rbd: sysfs write failed >> > rbd: unmap failed: (16) Device or resource busy >> > >> > Now that this has happened attempting to map another RBD fails, using >> lsblk >> > fails as well, both of these tasks just hang forever. >> > >> > We have 1480 OSDs in the cluster so posting the osdmap seems excessive, >> > however here is the beginning (didn't change in 5 runs): >> > root@wrk-slc-01-02:~# cat >> > >> /sys/kernel/debug/ceph/f3b7f409-e061-4e39-b4d0-ae380e29ae7e.client55440310/osdmap >> > epoch 1284256 >> > flags >> > pool 0 pg_num 2048 (2047) read_tier -1 write_tier -1 >> > pool 1 pg_num 512 (511) read_tier -1 write_tier -1 >> > pool 3 pg_num 2048 (2047) read_tier -1 write_tier -1 >> > pool 4 pg_num 512 (511) read_tier -1 write_tier -1 >> > pool 5 pg_num 32768 (32767) read_tier -1 write_tier -1 >> > >> > Here is osdc output, it is not changed after 5 runs: >> > >> > root@wrk-slc-01-02:~# cat >> > >> /sys/kernel/debug/ceph/f3b7f409-e061-4e39-b4d0-ae380e29ae7e.client55440310/osdc >> > 93835 osd1206 5.6841959c >> rbd_data.34df3ac703ced61.0000000000001dff >> > read >> > 9065810 osd1382 5.a50fa0ea rbd_header.34df3ac703ced61 >> > 474103'5506530325561344 watch >> > root@wrk-slc-01-02:~# cat >> > >> /sys/kernel/debug/ceph/f3b7f409-e061-4e39-b4d0-ae380e29ae7e.client55440310/osdc >> > 93835 osd1206 5.6841959c >> rbd_data.34df3ac703ced61.0000000000001dff >> > read >> > 9067286 osd1382 5.a50fa0ea rbd_header.34df3ac703ced61 >> > 474103'5506530325561344 watch >> > root@wrk-slc-01-02:~# cat >> > >> /sys/kernel/debug/ceph/f3b7f409-e061-4e39-b4d0-ae380e29ae7e.client55440310/osdc >> > 93835 osd1206 5.6841959c >> rbd_data.34df3ac703ced61.0000000000001dff >> > read >> > 9067831 osd1382 5.a50fa0ea rbd_header.34df3ac703ced61 >> > 474103'5506530325561344 watch >> > root@wrk-slc-01-02:~# ls /dev/rbd/rbd >> > none volume-daac5f12-e39b-4d64-a4fa-86c810aeb72d >> > volume-daac5f12-e39b-4d64-a4fa-86c810aeb72d-part1 >> > root@wrk-slc-01-02:~# rbd info >> volume-daac5f12-e39b-4d64-a4fa-86c810aeb72d >> > rbd image 'volume-daac5f12-e39b-4d64-a4fa-86c810aeb72d': >> > size 61439 MB in 7680 objects >> > order 23 (8192 kB objects) >> > block_name_prefix: rbd_data.34df3ac703ced61 >> > format: 2 >> > features: layering >> > flags: >> > parent: >> > >> rbd/volume-93d9a102-260e-4500-b87d-9696c7fc2b67@snapshot-9ba998b6-ca57-40dd-8895-265023132e99 >> > overlap: 61439 MB >> > >> > ceph status indicates the current osdmap epoch >> > osdmap e1284866: 1480 osds: 1480 up, 1480 in >> > pgmap v10231386: 37888 pgs, 5 pools, 745 TB data, 293 Mobjects >> > >> > root@wrk-slc-01-02:~# uname -r >> > 3.19.0-25-generic >> > >> > So, the kernel driver is some 600 epochs behind current. This does >> seem to >> > be load related as we've been running 4 different kernels in our >> clients in >> > our test environment and have not been able to recreate it there in a >> little >> > over a week, however our production environment has had 2 of these >> hangs in >> > the last 4 days. Unfortunately I wasn't able to grab data from the >> first >> > one. >> >> If you haven't already nuked it, what's the output of: >> >> $ ceph osd map <pool name of pool with id 5> >> rbd_data.34df3ac703ced61.0000000000001dff >> $ ceph osd map <pool name of pool with id 5> rbd_header.34df3ac703ced61 >> >> $ ceph daemon osd.1206 ops >> $ ceph daemon osd.1206 objecter_requests >> $ ceph daemon osd.1206 dump_ops_in_flight >> $ ceph daemon osd.1206 dump_historic_ops >> >> and repeat for osd.1382. >> >> Thanks, >> >> Ilya >> > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com