Hello Jason,
i'm happy to tell you that i've currently one VM where i can reproduce
the problem.
> The best option would be to run "gcore" against the running VM whose
> IO is stuck, compress the dump, and use the "ceph-post-file" to
> provide the dump. I could then look at all the Ceph data structures to
> hopefully find the issue.
I've saved the dump but it will contain sensitive informations. I won't
upload it to a public server. I'll send you an private email with a
private server to download the core dump. Thanks!
> Enabling debug logs after the IO has stuck will most likely be of
> little value since it won't include the details of which IOs are
> outstanding. You could attempt to use "ceph --admin-daemon
> /path/to/stuck/vm/asok objecter_requests" to see if any IOs are just
> stuck waiting on an OSD to respond.
This is the output:
# ceph --admin-daemon
/var/run/ceph/ceph-client.admin.5295.140214539927552.asok objecter_requests
{
"ops": [
{
"tid": 384632,
"pg": "5.bd9616ad",
"osd": 46,
"object_id": "rbd_data.e10ca56b8b4567.000000000000311c",
"object_locator": "@5",
"target_object_id": "rbd_data.e10ca56b8b4567.000000000000311c",
"target_object_locator": "@5",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "2.28554e+06s",
"attempts": 1,
"snapid": "head",
"snap_context": "a07c2=[]",
"mtime": "2017-05-16 21:03:22.0.196102s",
"osd_ops": [
"delete"
]
}
],
"linger_ops": [
{
"linger_id": 1,
"pg": "5.5f3bd635",
"osd": 17,
"object_id": "rbd_header.e10ca56b8b4567",
"object_locator": "@5",
"target_object_id": "rbd_header.e10ca56b8b4567",
"target_object_locator": "@5",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"snapid": "head",
"registered": "1"
}
],
"pool_ops": [],
"pool_stat_ops": [],
"statfs_ops": [],
"command_ops": []
}
Greets,
Stefan
Am 16.05.2017 um 15:44 schrieb Jason Dillaman:
> On Tue, May 16, 2017 at 2:12 AM, Stefan Priebe - Profihost AG
> <s.pri...@profihost.ag> wrote:
>> 3.) it still happens on pre jewel images even when they got restarted /
>> killed and reinitialized. In that case they've the asok socket available
>> for now. Should i issue any command to the socket to get log out of the
>> hanging vm? Qemu is still responding just ceph / disk i/O gets stalled.
>
> The best option would be to run "gcore" against the running VM whose
> IO is stuck, compress the dump, and use the "ceph-post-file" to
> provide the dump. I could then look at all the Ceph data structures to
> hopefully find the issue.
>
> Enabling debug logs after the IO has stuck will most likely be of
> little value since it won't include the details of which IOs are
> outstanding. You could attempt to use "ceph --admin-daemon
> /path/to/stuck/vm/asok objecter_requests" to see if any IOs are just
> stuck waiting on an OSD to respond.
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com