FYI -- I opened a tracker ticket [1] against ceph-osd for this issue
so that it doesn't get dropped.
[1] http://tracker.ceph.com/issues/20041
On Mon, May 22, 2017 at 9:57 AM, Stefan Priebe - Profihost AG
wrote:
> Hello Jason,
>
> attached is the thread apply all bt outout while a pg is deadlocke
Hello Jason,
attached is the thread apply all bt outout while a pg is deadlocked in
scrubbing.
The core dump is 5.5GB / 500MB compressed.
pg_stat objects mip degrmispunf bytes log disklog
state state_stamp v reportedup up_primary
acting acting_prim
If you have the debug symbols installed, I'd say "thread apply all bt"
in addition to a "generate-core-file". The backtrace would only be
useful if it showed a thread deadlocked on something.
On Mon, May 22, 2017 at 9:29 AM, Stefan Priebe - Profihost AG
wrote:
> Hello Jason,
>
> should i do a cor
Hello Jason,
should i do a coredump or a thread apply all bt?
Don't know what is better.
Greets,
Stefan
Am 22.05.2017 um 15:19 schrieb Jason Dillaman:
> If you cannot recreate with debug logging enabled, that might be the
> next best option.
>
> On Mon, May 22, 2017 at 2:30 AM, Stefan Priebe -
If you cannot recreate with debug logging enabled, that might be the
next best option.
On Mon, May 22, 2017 at 2:30 AM, Stefan Priebe - Profihost AG
wrote:
> Hello Jason,
>
> i had another 8 cases where scrub was running for hours. Sadly i
> couldn't get it to hang again after an osd restart. Any
Hello Jason,
i had another 8 cases where scrub was running for hours. Sadly i
couldn't get it to hang again after an osd restart. Any further ideas?
Coredump of the OSD with hanging scrub?
Greets,
Stefan
Am 18.05.2017 um 17:26 schrieb Jason Dillaman:
> I'm unfortunately out of ideas at the mome
I'm unfortunately out of ideas at the moment. I think the best chance
of figuring out what is wrong is to repeat it while logs are enabled.
On Wed, May 17, 2017 at 4:51 PM, Stefan Priebe - Profihost AG
wrote:
> No i can't reproduce it with active logs. Any furthr ideas?
>
> Greets,
> Stefan
>
> A
No i can't reproduce it with active logs. Any furthr ideas?
Greets,
Stefan
Am 17.05.2017 um 21:26 schrieb Stefan Priebe - Profihost AG:
> Am 17.05.2017 um 21:21 schrieb Jason Dillaman:
>> Any chance you still have debug logs enabled on OSD 23 after you
>> restarted it and the scrub froze again?
Currently is does not issue a scrub again ;-(
Stefan
Am 17.05.2017 um 21:21 schrieb Jason Dillaman:
> Any chance you still have debug logs enabled on OSD 23 after you
> restarted it and the scrub froze again?
>
> On Wed, May 17, 2017 at 3:19 PM, Stefan Priebe - Profihost AG
> wrote:
>> Hello,
>
Am 17.05.2017 um 21:21 schrieb Jason Dillaman:
> Any chance you still have debug logs enabled on OSD 23 after you
> restarted it and the scrub froze again?
No but i can do that ;-) Hopefully it freezes again.
Stefan
>
> On Wed, May 17, 2017 at 3:19 PM, Stefan Priebe - Profihost AG
> wrote:
>>
Any chance you still have debug logs enabled on OSD 23 after you
restarted it and the scrub froze again?
On Wed, May 17, 2017 at 3:19 PM, Stefan Priebe - Profihost AG
wrote:
> Hello,
>
> now it shows again:
>>> 4095 active+clean
>>>1 active+clean+scrubbing
>
>
Hello,
now it shows again:
>> 4095 active+clean
>>1 active+clean+scrubbing
and:
# ceph pg dump | grep -i scrub
dumped all in format plain
pg_stat objects mip degrmispunf bytes log disklog
state state_stamp v reportedup
Can you share your current OSD configuration? It's very curious that
your scrub is getting randomly stuck on a few objects for hours at a
time until an OSD is reset.
On Wed, May 17, 2017 at 2:55 PM, Stefan Priebe - Profihost AG
wrote:
> Hello Jason,
>
> minutes ago i had another case where i rest
Hello Jason,
minutes ago i had another case where i restarted the osd which was shown
in objecter_requests output.
It seems also other scrubs and deep scrubs were hanging.
Output before:
4095 active+clean
1 active+clean+scrubbing
Output after restart:
Does your ceph status show pg 2.cebed0aa (still) scrubbing? Sure -- I
can quickly scan the new log if you directly send it to me.
On Wed, May 17, 2017 at 2:18 PM, Stefan Priebe - Profihost AG
wrote:
> can send the osd log - if you want?
>
> Stefan
>
> Am 17.05.2017 um 20:13 schrieb Stefan Priebe
can send the osd log - if you want?
Stefan
Am 17.05.2017 um 20:13 schrieb Stefan Priebe - Profihost AG:
> Hello Jason,
>
> the command
> # rados -p cephstor6 rm rbd_data.21aafa6b8b4567.0aaa
>
> hangs as well. Doing absolutely nothing... waiting forever.
>
> Greets,
> Stefan
>
> Am
Hello Jason,
the command
# rados -p cephstor6 rm rbd_data.21aafa6b8b4567.0aaa
hangs as well. Doing absolutely nothing... waiting forever.
Greets,
Stefan
Am 17.05.2017 um 17:05 schrieb Jason Dillaman:
> OSD 23 notes that object rbd_data.21aafa6b8b4567.0aaa is
> waiting fo
Ah no wrong thread. Will Test your Suggestion
Stefan
Excuse my typo sent from my mobile phone.
> Am 17.05.2017 um 17:05 schrieb Jason Dillaman :
>
> OSD 23 notes that object rbd_data.21aafa6b8b4567.0aaa is
> waiting for a scrub. What happens if you run "rados -p rm
> rbd_data.21aa
Can Test in 2 hours but it sounds like
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014773.html
Stefan
Excuse my typo sent from my mobile phone.
> Am 17.05.2017 um 17:05 schrieb Jason Dillaman :
>
> OSD 23 notes that object rbd_data.21aafa6b8b4567.0aaa is
> wai
OSD 23 notes that object rbd_data.21aafa6b8b4567.0aaa is
waiting for a scrub. What happens if you run "rados -p rm
rbd_data.21aafa6b8b4567.0aaa" (capturing the OSD 23 logs
during this)? If that succeeds while your VM remains blocked on that
remove op, it looks like there is
On Wed, May 17, 2017 at 10:25 AM, Stefan Priebe - Profihost AG
wrote:
> issue the delete request and send you the log?
Yes, please.
--
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
On Wed, May 17, 2017 at 10:21 AM, Stefan Priebe - Profihost AG
wrote:
> You mean the request no matter if it is successful or not? Which log
> level should be set to 20?
I'm hoping you can re-create the hung remove op when OSD logging is
increased -- "debug osd = 20" would be nice if you can tur
I can set the following debug levels for osd 46:
ceph --admin-daemon /var/run/ceph/ceph-osd.$1.asok config set debug_osd 20
ceph --admin-daemon /var/run/ceph/ceph-osd.$1.asok config set
debug_filestore 20
ceph --admin-daemon /var/run/ceph/ceph-osd.$1.asok config set debug_ms 1
ceph --admin-daemon
Am 17.05.2017 um 16:07 schrieb Jason Dillaman:
> The internals of the OSD is not exactly my area of expertise. Since
> you have the op tracker disabled and I'm assuming your cluster health
> is OK, perhaps you could run "gcore " to preserve its
> current state at a bare minimum. Then, assuming your
The internals of the OSD is not exactly my area of expertise. Since
you have the op tracker disabled and I'm assuming your cluster health
is OK, perhaps you could run "gcore " to preserve its
current state at a bare minimum. Then, assuming your can restart vm191
and re-issue the fstrim that blocks,
Hello Jason,
any idea how to debug this further? dmesg does not show any disk
failures. Smart values are also OK. There's also no xfs BUG or WARNING
from the kernel side.
I'm sure that it will work after restarting osd.46 - but i'm loosing the
ability to reproduce this in that case.
Should i ins
Perfect librbd log capture. I can see that a remove request to object
rbd_data.e10ca56b8b4567.311c was issued but it never
completed. This results in a hung discard and flush request.
Assuming that object is still mapped to OSD 46, I think there is
either something happening with that
Something I could do/test to find the bug?
Stefan
Excuse my typo sent from my mobile phone.
> Am 16.05.2017 um 22:54 schrieb Jason Dillaman :
>
> It looks like it's just a ping message in that capture.
>
> Are you saying that you restarted OSD 46 and the problem persisted?
>
> On Tue, May 16,
No I did not. I don't want that I can't reproduce it any longer.
Stefan
Excuse my typo sent from my mobile phone.
> Am 16.05.2017 um 22:54 schrieb Jason Dillaman :
>
> It looks like it's just a ping message in that capture.
>
> Are you saying that you restarted OSD 46 and the problem persisted
It looks like it's just a ping message in that capture.
Are you saying that you restarted OSD 46 and the problem persisted?
On Tue, May 16, 2017 at 4:02 PM, Stefan Priebe - Profihost AG
wrote:
> Hello,
>
> while reproducing the problem, objecter_requests looks like this:
>
> {
> "ops": [
>
Hello,
while reproducing the problem, objecter_requests looks like this:
{
"ops": [
{
"tid": 42029,
"pg": "5.bd9616ad",
"osd": 46,
"object_id": "rbd_data.e10ca56b8b4567.311c",
"object_locator": "@5",
"
Am 16.05.2017 um 21:45 schrieb Jason Dillaman:
> On Tue, May 16, 2017 at 3:37 PM, Stefan Priebe - Profihost AG
> wrote:
>> We've enabled the op tracker for performance reasons while using SSD
>> only storage ;-(
>
> Disabled you mean?
Sorry yes.
>> Can enable the op tracker using ceph osd tell?
On Tue, May 16, 2017 at 3:37 PM, Stefan Priebe - Profihost AG
wrote:
> We've enabled the op tracker for performance reasons while using SSD
> only storage ;-(
Disabled you mean?
> Can enable the op tracker using ceph osd tell? Than reproduce the
> problem. Check what has stucked again? Or should
Hello Jason,
Am 16.05.2017 um 21:32 schrieb Jason Dillaman:
> Thanks for the update. In the ops dump provided, the objecter is
> saying that OSD 46 hasn't responded to the deletion request of object
> rbd_data.e10ca56b8b4567.311c.
>
> Perhaps run "ceph daemon osd.46 dump_ops_in_flight
Thanks for the update. In the ops dump provided, the objecter is
saying that OSD 46 hasn't responded to the deletion request of object
rbd_data.e10ca56b8b4567.311c.
Perhaps run "ceph daemon osd.46 dump_ops_in_flight" or "...
dump_historic_ops" to see if that op is in the list? You can
Hello Jason,
i'm happy to tell you that i've currently one VM where i can reproduce
the problem.
> The best option would be to run "gcore" against the running VM whose
> IO is stuck, compress the dump, and use the "ceph-post-file" to
> provide the dump. I could then look at all the Ceph data stru
On Tue, May 16, 2017 at 2:12 AM, Stefan Priebe - Profihost AG
wrote:
> 3.) it still happens on pre jewel images even when they got restarted /
> killed and reinitialized. In that case they've the asok socket available
> for now. Should i issue any command to the socket to get log out of the
> hang
> 3.) it still happens on pre jewel images even when they got restarted
> / killed and reinitialized. In that case they've the asok socket
> available
> for now. Should i issue any command to the socket to get log out of
> the hanging vm? Qemu is still responding just ceph / disk i/O gets
> stall
Hello Jason,
it got some further hints. Please see below.
Am 15.05.2017 um 22:25 schrieb Jason Dillaman:
> On Mon, May 15, 2017 at 3:54 PM, Stefan Priebe - Profihost AG
> wrote:
>> Would it be possible that the problem is the same you fixed?
>
> No, I would not expect it to be related to the ot
On Mon, May 15, 2017 at 3:54 PM, Stefan Priebe - Profihost AG
wrote:
> Would it be possible that the problem is the same you fixed?
No, I would not expect it to be related to the other issues you are
seeing. The issue I just posted a fix against only occurs when a
client requests the lock from th
Hi,
great thanks.
I'm still trying but it's difficult to me as well. As it happens only
sometimes there must be an unknown additional factor. For the future
i've enabled client sockets for all VMs as well. But this does not help
in this case - as it seems to be fixed after migration.
Would it be
I was able to re-create the issue where "rbd feature disable" hangs if
the client experienced a long comms failure with the OSDs, and I have
a proposed fix posted [1]. Unfortunately, I haven't been successful in
repeating any stalled IO, discard issues, nor export-diff logged
errors. I'll keep tryi
Hello Jason,
> Just so I can attempt to repeat this:
Thanks.
> (1) you had an image that was built using Hammer clients and OSDs with
> exclusive lock disabled
Yes. It was created with the hammer rbd defaults.
> (2) you updated your clients and OSDs to Jewel
> (3) you restarted your OSDs and liv
Just so I can attempt to repeat this:
(1) you had an image that was built using Hammer clients and OSDs with
exclusive lock disabled
(2) you updated your clients and OSDs to Jewel
(3) you restarted your OSDs and live-migrated your VMs to pick up the
Jewel changes
(4) you enabled exclusive-lock, ob
I verified it. After a live migration of the VM i'm able to successfully
disable fast-diff,exclusive-lock,object-map.
The problem only seems to occur at all if a client has connected to
hammer without exclusive lock. Than got upgraded to jewel and exclusive
lock gets enabled.
Greets,
Stefan
Am 1
Hello Jason,
Am 14.05.2017 um 14:04 schrieb Jason Dillaman:
> It appears as though there is client.27994090 at 10.255.0.13 that
> currently owns the exclusive lock on that image. I am assuming the log
> is from "rbd feature disable"?
Yes.
> If so, I can see that it attempts to
> acquire the lock
It appears as though there is client.27994090 at 10.255.0.13 that
currently owns the exclusive lock on that image. I am assuming the log
is from "rbd feature disable"? If so, I can see that it attempts to
acquire the lock and the other side is not appropriately responding to
the request.
Assuming
Hello Jason,
as it still happens and VMs are crashing. I wanted to disable
exclusive-lock,fast-diff again. But i detected that there are images
where the rbd commands runs in an endless loop.
I canceled the command after 60s and used --debug-rbd=20. Will send the
log off list.
Thanks!
Greets,
S
Hello Jason,
it seems to be related to fstrim and discard. I cannot reproduce it for
images were we don't use trim - but it's still the case it's working
fine for images created with jewel and it is not for images pre jewel.
The only difference i can find is that the images created with jewel
also
Assuming the only log messages you are seeing are the following:
2017-05-06 03:20:50.830626 7f7876a64700 -1
librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating
object map in-memory
2017-05-06 03:20:50.830634 7f7876a64700 -1
librbd::object_map::InvalidateRequest: 0x7f7860004410 inval
Hi Jason,
it seems i can at least circumvent the crashes. Since i restarted ALL
osds after enabling exclusive lock and rebuilding the object maps it had
no new crashes.
What still makes me wonder are those
librbd::object_map::InvalidateRequest: 0x7f7860004410 should_complete: r=0
messages.
Gree
Hi Peter,
Am 08.05.2017 um 15:23 schrieb Peter Maloney:
> On 05/08/17 14:50, Stefan Priebe - Profihost AG wrote:
>> Hi,
>> Am 08.05.2017 um 14:40 schrieb Jason Dillaman:
>>> You are saying that you had v2 RBD images created against Hammer OSDs
>>> and client libraries where exclusive lock, objec
On 05/08/17 14:50, Stefan Priebe - Profihost AG wrote:
> Hi,
> Am 08.05.2017 um 14:40 schrieb Jason Dillaman:
>> You are saying that you had v2 RBD images created against Hammer OSDs
>> and client libraries where exclusive lock, object map, etc were never
>> enabled. You then upgraded the OSDs and
Hi,
Am 08.05.2017 um 14:40 schrieb Jason Dillaman:
> You are saying that you had v2 RBD images created against Hammer OSDs
> and client libraries where exclusive lock, object map, etc were never
> enabled. You then upgraded the OSDs and clients to Jewel and at some
> point enabled exclusive lock (a
You are saying that you had v2 RBD images created against Hammer OSDs
and client libraries where exclusive lock, object map, etc were never
enabled. You then upgraded the OSDs and clients to Jewel and at some
point enabled exclusive lock (and I'd assume object map) on these
images -- or were the ex
Hi,
also i'm getting these errors only for pre jewel images:
2017-05-06 03:20:50.830626 7f7876a64700 -1
librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating
object map in-memory
2017-05-06 03:20:50.830634 7f7876a64700 -1
librbd::object_map::InvalidateRequest: 0x7f7860004410 invalida
Hello Json,
while doing further testing it happens only with images created with
hammer and that got upgraded to jewel AND got enabled exclusive lock.
Greets,
Stefan
Am 04.05.2017 um 14:20 schrieb Jason Dillaman:
> Odd. Can you re-run "rbd rm" with "--debug-rbd=20" added to the
> command and pos
Hi Stefan - we simply disabled exclusive-lock on all older (pre-jewel)
images. We still allow the default jewel featuresets for newly created
images because as you mention - the issue does not seem to affect them.
On Thu, May 4, 2017 at 10:19 AM, Stefan Priebe - Profihost AG <
s.pri...@profihost.a
and yes i also see hung tasks in those VMs until they crash.
Stefan
Am 04.05.2017 um 19:11 schrieb Brian Andrus:
> Sounds familiar... and discussed in "disk timeouts in libvirt/qemu VMs..."
>
> We have not had this issue since reverting exclusive-lock, but it was
> suggested this was not the iss
Hello Brian,
this really sounds the same. I don't see this on a cluster with only
images created AFTER jewel. And it seems to start happening after i
enabled exclusive lock on all images.
Did just use feature disable, exclusive-lock,fast-diff,object-map or did
you also restart all those vms?
Gre
Sounds familiar... and discussed in "disk timeouts in libvirt/qemu VMs..."
We have not had this issue since reverting exclusive-lock, but it was
suggested this was not the issue. So far it's held up for us with not a
single corrupt filesystem since then.
On some images (ones created post-Jewel up
Hi Jason,
> Odd. Can you re-run "rbd rm" with "--debug-rbd=20" added to the
> command and post the resulting log to a new ticket at [1]?
will do so next time. I was able to solve this by restarting all osds.
After that i was able to successfuly delete the image.
> I'd also be interested if you c
Odd. Can you re-run "rbd rm" with "--debug-rbd=20" added to the
command and post the resulting log to a new ticket at [1]? I'd also be
interested if you could re-create that
"librbd::object_map::InvalidateRequest" issue repeatably.
[1] http://tracker.ceph.com/projects/rbd/issues
On Thu, May 4, 20
There are no watchers involved:
# rbd status cephstor2/vm-136-disk-1
Watchers: none
Greets,
Stefan
Am 04.05.2017 um 09:45 schrieb Stefan Priebe - Profihost AG:
> Example:
> # rbd rm cephstor2/vm-136-disk-1
> Removing image: 99% complete...
>
> Stuck at 99% and never completes. This is an image w
Example:
# rbd rm cephstor2/vm-136-disk-1
Removing image: 99% complete...
Stuck at 99% and never completes. This is an image which got corrupted
for an unknown reason.
Greets,
Stefan
Am 04.05.2017 um 08:32 schrieb Stefan Priebe - Profihost AG:
> I'm not sure whether this is related but our backu
I'm not sure whether this is related but our backup system uses rbd
snapshots and reports sometimes messages like these:
2017-05-04 02:42:47.661263 7f3316ffd700 -1
librbd::object_map::InvalidateRequest: 0x7f3310002570 should_complete: r=0
Stefan
Am 04.05.2017 um 07:49 schrieb Stefan Priebe - Pro
Hello,
since we've upgraded from hammer to jewel 10.2.7 and enabled
exclusive-lock,object-map,fast-diff we've problems with corrupting VM
filesystems.
Sometimes the VMs are just crashing with FS errors and a restart can
solve the problem. Sometimes the whole VM is not even bootable and we
need to
67 matches
Mail list logo