Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-22 Thread Jason Dillaman
FYI -- I opened a tracker ticket [1] against ceph-osd for this issue so that it doesn't get dropped. [1] http://tracker.ceph.com/issues/20041 On Mon, May 22, 2017 at 9:57 AM, Stefan Priebe - Profihost AG wrote: > Hello Jason, > > attached is the thread apply all bt outout while a pg is deadlocke

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-22 Thread Stefan Priebe - Profihost AG
Hello Jason, attached is the thread apply all bt outout while a pg is deadlocked in scrubbing. The core dump is 5.5GB / 500MB compressed. pg_stat objects mip degrmispunf bytes log disklog state state_stamp v reportedup up_primary acting acting_prim

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-22 Thread Jason Dillaman
If you have the debug symbols installed, I'd say "thread apply all bt" in addition to a "generate-core-file". The backtrace would only be useful if it showed a thread deadlocked on something. On Mon, May 22, 2017 at 9:29 AM, Stefan Priebe - Profihost AG wrote: > Hello Jason, > > should i do a cor

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-22 Thread Stefan Priebe - Profihost AG
Hello Jason, should i do a coredump or a thread apply all bt? Don't know what is better. Greets, Stefan Am 22.05.2017 um 15:19 schrieb Jason Dillaman: > If you cannot recreate with debug logging enabled, that might be the > next best option. > > On Mon, May 22, 2017 at 2:30 AM, Stefan Priebe -

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-22 Thread Jason Dillaman
If you cannot recreate with debug logging enabled, that might be the next best option. On Mon, May 22, 2017 at 2:30 AM, Stefan Priebe - Profihost AG wrote: > Hello Jason, > > i had another 8 cases where scrub was running for hours. Sadly i > couldn't get it to hang again after an osd restart. Any

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-21 Thread Stefan Priebe - Profihost AG
Hello Jason, i had another 8 cases where scrub was running for hours. Sadly i couldn't get it to hang again after an osd restart. Any further ideas? Coredump of the OSD with hanging scrub? Greets, Stefan Am 18.05.2017 um 17:26 schrieb Jason Dillaman: > I'm unfortunately out of ideas at the mome

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-18 Thread Jason Dillaman
I'm unfortunately out of ideas at the moment. I think the best chance of figuring out what is wrong is to repeat it while logs are enabled. On Wed, May 17, 2017 at 4:51 PM, Stefan Priebe - Profihost AG wrote: > No i can't reproduce it with active logs. Any furthr ideas? > > Greets, > Stefan > > A

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Stefan Priebe - Profihost AG
No i can't reproduce it with active logs. Any furthr ideas? Greets, Stefan Am 17.05.2017 um 21:26 schrieb Stefan Priebe - Profihost AG: > Am 17.05.2017 um 21:21 schrieb Jason Dillaman: >> Any chance you still have debug logs enabled on OSD 23 after you >> restarted it and the scrub froze again?

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Stefan Priebe - Profihost AG
Currently is does not issue a scrub again ;-( Stefan Am 17.05.2017 um 21:21 schrieb Jason Dillaman: > Any chance you still have debug logs enabled on OSD 23 after you > restarted it and the scrub froze again? > > On Wed, May 17, 2017 at 3:19 PM, Stefan Priebe - Profihost AG > wrote: >> Hello, >

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Stefan Priebe - Profihost AG
Am 17.05.2017 um 21:21 schrieb Jason Dillaman: > Any chance you still have debug logs enabled on OSD 23 after you > restarted it and the scrub froze again? No but i can do that ;-) Hopefully it freezes again. Stefan > > On Wed, May 17, 2017 at 3:19 PM, Stefan Priebe - Profihost AG > wrote: >>

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Jason Dillaman
Any chance you still have debug logs enabled on OSD 23 after you restarted it and the scrub froze again? On Wed, May 17, 2017 at 3:19 PM, Stefan Priebe - Profihost AG wrote: > Hello, > > now it shows again: >>> 4095 active+clean >>>1 active+clean+scrubbing > >

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Stefan Priebe - Profihost AG
Hello, now it shows again: >> 4095 active+clean >>1 active+clean+scrubbing and: # ceph pg dump | grep -i scrub dumped all in format plain pg_stat objects mip degrmispunf bytes log disklog state state_stamp v reportedup

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Jason Dillaman
Can you share your current OSD configuration? It's very curious that your scrub is getting randomly stuck on a few objects for hours at a time until an OSD is reset. On Wed, May 17, 2017 at 2:55 PM, Stefan Priebe - Profihost AG wrote: > Hello Jason, > > minutes ago i had another case where i rest

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Stefan Priebe - Profihost AG
Hello Jason, minutes ago i had another case where i restarted the osd which was shown in objecter_requests output. It seems also other scrubs and deep scrubs were hanging. Output before: 4095 active+clean 1 active+clean+scrubbing Output after restart:

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Jason Dillaman
Does your ceph status show pg 2.cebed0aa (still) scrubbing? Sure -- I can quickly scan the new log if you directly send it to me. On Wed, May 17, 2017 at 2:18 PM, Stefan Priebe - Profihost AG wrote: > can send the osd log - if you want? > > Stefan > > Am 17.05.2017 um 20:13 schrieb Stefan Priebe

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Stefan Priebe - Profihost AG
can send the osd log - if you want? Stefan Am 17.05.2017 um 20:13 schrieb Stefan Priebe - Profihost AG: > Hello Jason, > > the command > # rados -p cephstor6 rm rbd_data.21aafa6b8b4567.0aaa > > hangs as well. Doing absolutely nothing... waiting forever. > > Greets, > Stefan > > Am

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Stefan Priebe - Profihost AG
Hello Jason, the command # rados -p cephstor6 rm rbd_data.21aafa6b8b4567.0aaa hangs as well. Doing absolutely nothing... waiting forever. Greets, Stefan Am 17.05.2017 um 17:05 schrieb Jason Dillaman: > OSD 23 notes that object rbd_data.21aafa6b8b4567.0aaa is > waiting fo

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Stefan Priebe - Profihost AG
Ah no wrong thread. Will Test your Suggestion Stefan Excuse my typo sent from my mobile phone. > Am 17.05.2017 um 17:05 schrieb Jason Dillaman : > > OSD 23 notes that object rbd_data.21aafa6b8b4567.0aaa is > waiting for a scrub. What happens if you run "rados -p rm > rbd_data.21aa

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Stefan Priebe - Profihost AG
Can Test in 2 hours but it sounds like http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-December/014773.html Stefan Excuse my typo sent from my mobile phone. > Am 17.05.2017 um 17:05 schrieb Jason Dillaman : > > OSD 23 notes that object rbd_data.21aafa6b8b4567.0aaa is > wai

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Jason Dillaman
OSD 23 notes that object rbd_data.21aafa6b8b4567.0aaa is waiting for a scrub. What happens if you run "rados -p rm rbd_data.21aafa6b8b4567.0aaa" (capturing the OSD 23 logs during this)? If that succeeds while your VM remains blocked on that remove op, it looks like there is

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Jason Dillaman
On Wed, May 17, 2017 at 10:25 AM, Stefan Priebe - Profihost AG wrote: > issue the delete request and send you the log? Yes, please. -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Jason Dillaman
On Wed, May 17, 2017 at 10:21 AM, Stefan Priebe - Profihost AG wrote: > You mean the request no matter if it is successful or not? Which log > level should be set to 20? I'm hoping you can re-create the hung remove op when OSD logging is increased -- "debug osd = 20" would be nice if you can tur

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Stefan Priebe - Profihost AG
I can set the following debug levels for osd 46: ceph --admin-daemon /var/run/ceph/ceph-osd.$1.asok config set debug_osd 20 ceph --admin-daemon /var/run/ceph/ceph-osd.$1.asok config set debug_filestore 20 ceph --admin-daemon /var/run/ceph/ceph-osd.$1.asok config set debug_ms 1 ceph --admin-daemon

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Stefan Priebe - Profihost AG
Am 17.05.2017 um 16:07 schrieb Jason Dillaman: > The internals of the OSD is not exactly my area of expertise. Since > you have the op tracker disabled and I'm assuming your cluster health > is OK, perhaps you could run "gcore " to preserve its > current state at a bare minimum. Then, assuming your

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Jason Dillaman
The internals of the OSD is not exactly my area of expertise. Since you have the op tracker disabled and I'm assuming your cluster health is OK, perhaps you could run "gcore " to preserve its current state at a bare minimum. Then, assuming your can restart vm191 and re-issue the fstrim that blocks,

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Stefan Priebe - Profihost AG
Hello Jason, any idea how to debug this further? dmesg does not show any disk failures. Smart values are also OK. There's also no xfs BUG or WARNING from the kernel side. I'm sure that it will work after restarting osd.46 - but i'm loosing the ability to reproduce this in that case. Should i ins

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-17 Thread Jason Dillaman
Perfect librbd log capture. I can see that a remove request to object rbd_data.e10ca56b8b4567.311c was issued but it never completed. This results in a hung discard and flush request. Assuming that object is still mapped to OSD 46, I think there is either something happening with that

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Stefan Priebe - Profihost AG
Something I could do/test to find the bug? Stefan Excuse my typo sent from my mobile phone. > Am 16.05.2017 um 22:54 schrieb Jason Dillaman : > > It looks like it's just a ping message in that capture. > > Are you saying that you restarted OSD 46 and the problem persisted? > > On Tue, May 16,

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Stefan Priebe - Profihost AG
No I did not. I don't want that I can't reproduce it any longer. Stefan Excuse my typo sent from my mobile phone. > Am 16.05.2017 um 22:54 schrieb Jason Dillaman : > > It looks like it's just a ping message in that capture. > > Are you saying that you restarted OSD 46 and the problem persisted

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Jason Dillaman
It looks like it's just a ping message in that capture. Are you saying that you restarted OSD 46 and the problem persisted? On Tue, May 16, 2017 at 4:02 PM, Stefan Priebe - Profihost AG wrote: > Hello, > > while reproducing the problem, objecter_requests looks like this: > > { > "ops": [ >

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Stefan Priebe - Profihost AG
Hello, while reproducing the problem, objecter_requests looks like this: { "ops": [ { "tid": 42029, "pg": "5.bd9616ad", "osd": 46, "object_id": "rbd_data.e10ca56b8b4567.311c", "object_locator": "@5", "

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Stefan Priebe - Profihost AG
Am 16.05.2017 um 21:45 schrieb Jason Dillaman: > On Tue, May 16, 2017 at 3:37 PM, Stefan Priebe - Profihost AG > wrote: >> We've enabled the op tracker for performance reasons while using SSD >> only storage ;-( > > Disabled you mean? Sorry yes. >> Can enable the op tracker using ceph osd tell?

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Jason Dillaman
On Tue, May 16, 2017 at 3:37 PM, Stefan Priebe - Profihost AG wrote: > We've enabled the op tracker for performance reasons while using SSD > only storage ;-( Disabled you mean? > Can enable the op tracker using ceph osd tell? Than reproduce the > problem. Check what has stucked again? Or should

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Stefan Priebe - Profihost AG
Hello Jason, Am 16.05.2017 um 21:32 schrieb Jason Dillaman: > Thanks for the update. In the ops dump provided, the objecter is > saying that OSD 46 hasn't responded to the deletion request of object > rbd_data.e10ca56b8b4567.311c. > > Perhaps run "ceph daemon osd.46 dump_ops_in_flight

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Jason Dillaman
Thanks for the update. In the ops dump provided, the objecter is saying that OSD 46 hasn't responded to the deletion request of object rbd_data.e10ca56b8b4567.311c. Perhaps run "ceph daemon osd.46 dump_ops_in_flight" or "... dump_historic_ops" to see if that op is in the list? You can

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Stefan Priebe - Profihost AG
Hello Jason, i'm happy to tell you that i've currently one VM where i can reproduce the problem. > The best option would be to run "gcore" against the running VM whose > IO is stuck, compress the dump, and use the "ceph-post-file" to > provide the dump. I could then look at all the Ceph data stru

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-16 Thread Jason Dillaman
On Tue, May 16, 2017 at 2:12 AM, Stefan Priebe - Profihost AG wrote: > 3.) it still happens on pre jewel images even when they got restarted / > killed and reinitialized. In that case they've the asok socket available > for now. Should i issue any command to the socket to get log out of the > hang

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-15 Thread Stefan Priebe - Profihost AG
> 3.) it still happens on pre jewel images even when they got restarted > / killed and reinitialized. In that case they've the asok socket > available > for now. Should i issue any command to the socket to get log out of > the hanging vm? Qemu is still responding just ceph / disk i/O gets > stall

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-15 Thread Stefan Priebe - Profihost AG
Hello Jason, it got some further hints. Please see below. Am 15.05.2017 um 22:25 schrieb Jason Dillaman: > On Mon, May 15, 2017 at 3:54 PM, Stefan Priebe - Profihost AG > wrote: >> Would it be possible that the problem is the same you fixed? > > No, I would not expect it to be related to the ot

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-15 Thread Jason Dillaman
On Mon, May 15, 2017 at 3:54 PM, Stefan Priebe - Profihost AG wrote: > Would it be possible that the problem is the same you fixed? No, I would not expect it to be related to the other issues you are seeing. The issue I just posted a fix against only occurs when a client requests the lock from th

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-15 Thread Stefan Priebe - Profihost AG
Hi, great thanks. I'm still trying but it's difficult to me as well. As it happens only sometimes there must be an unknown additional factor. For the future i've enabled client sockets for all VMs as well. But this does not help in this case - as it seems to be fixed after migration. Would it be

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-15 Thread Jason Dillaman
I was able to re-create the issue where "rbd feature disable" hangs if the client experienced a long comms failure with the OSDs, and I have a proposed fix posted [1]. Unfortunately, I haven't been successful in repeating any stalled IO, discard issues, nor export-diff logged errors. I'll keep tryi

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-15 Thread Stefan Priebe - Profihost AG
Hello Jason, > Just so I can attempt to repeat this: Thanks. > (1) you had an image that was built using Hammer clients and OSDs with > exclusive lock disabled Yes. It was created with the hammer rbd defaults. > (2) you updated your clients and OSDs to Jewel > (3) you restarted your OSDs and liv

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-15 Thread Jason Dillaman
Just so I can attempt to repeat this: (1) you had an image that was built using Hammer clients and OSDs with exclusive lock disabled (2) you updated your clients and OSDs to Jewel (3) you restarted your OSDs and live-migrated your VMs to pick up the Jewel changes (4) you enabled exclusive-lock, ob

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-14 Thread Stefan Priebe - Profihost AG
I verified it. After a live migration of the VM i'm able to successfully disable fast-diff,exclusive-lock,object-map. The problem only seems to occur at all if a client has connected to hammer without exclusive lock. Than got upgraded to jewel and exclusive lock gets enabled. Greets, Stefan Am 1

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-14 Thread Stefan Priebe - Profihost AG
Hello Jason, Am 14.05.2017 um 14:04 schrieb Jason Dillaman: > It appears as though there is client.27994090 at 10.255.0.13 that > currently owns the exclusive lock on that image. I am assuming the log > is from "rbd feature disable"? Yes. > If so, I can see that it attempts to > acquire the lock

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-14 Thread Jason Dillaman
It appears as though there is client.27994090 at 10.255.0.13 that currently owns the exclusive lock on that image. I am assuming the log is from "rbd feature disable"? If so, I can see that it attempts to acquire the lock and the other side is not appropriately responding to the request. Assuming

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-13 Thread Stefan Priebe - Profihost AG
Hello Jason, as it still happens and VMs are crashing. I wanted to disable exclusive-lock,fast-diff again. But i detected that there are images where the rbd commands runs in an endless loop. I canceled the command after 60s and used --debug-rbd=20. Will send the log off list. Thanks! Greets, S

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-13 Thread Stefan Priebe - Profihost AG
Hello Jason, it seems to be related to fstrim and discard. I cannot reproduce it for images were we don't use trim - but it's still the case it's working fine for images created with jewel and it is not for images pre jewel. The only difference i can find is that the images created with jewel also

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-11 Thread Jason Dillaman
Assuming the only log messages you are seeing are the following: 2017-05-06 03:20:50.830626 7f7876a64700 -1 librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating object map in-memory 2017-05-06 03:20:50.830634 7f7876a64700 -1 librbd::object_map::InvalidateRequest: 0x7f7860004410 inval

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-11 Thread Stefan Priebe - Profihost AG
Hi Jason, it seems i can at least circumvent the crashes. Since i restarted ALL osds after enabling exclusive lock and rebuilding the object maps it had no new crashes. What still makes me wonder are those librbd::object_map::InvalidateRequest: 0x7f7860004410 should_complete: r=0 messages. Gree

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-08 Thread Stefan Priebe - Profihost AG
Hi Peter, Am 08.05.2017 um 15:23 schrieb Peter Maloney: > On 05/08/17 14:50, Stefan Priebe - Profihost AG wrote: >> Hi, >> Am 08.05.2017 um 14:40 schrieb Jason Dillaman: >>> You are saying that you had v2 RBD images created against Hammer OSDs >>> and client libraries where exclusive lock, objec

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-08 Thread Peter Maloney
On 05/08/17 14:50, Stefan Priebe - Profihost AG wrote: > Hi, > Am 08.05.2017 um 14:40 schrieb Jason Dillaman: >> You are saying that you had v2 RBD images created against Hammer OSDs >> and client libraries where exclusive lock, object map, etc were never >> enabled. You then upgraded the OSDs and

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-08 Thread Stefan Priebe - Profihost AG
Hi, Am 08.05.2017 um 14:40 schrieb Jason Dillaman: > You are saying that you had v2 RBD images created against Hammer OSDs > and client libraries where exclusive lock, object map, etc were never > enabled. You then upgraded the OSDs and clients to Jewel and at some > point enabled exclusive lock (a

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-08 Thread Jason Dillaman
You are saying that you had v2 RBD images created against Hammer OSDs and client libraries where exclusive lock, object map, etc were never enabled. You then upgraded the OSDs and clients to Jewel and at some point enabled exclusive lock (and I'd assume object map) on these images -- or were the ex

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-06 Thread Stefan Priebe - Profihost AG
Hi, also i'm getting these errors only for pre jewel images: 2017-05-06 03:20:50.830626 7f7876a64700 -1 librbd::object_map::InvalidateRequest: 0x7f7860004410 invalidating object map in-memory 2017-05-06 03:20:50.830634 7f7876a64700 -1 librbd::object_map::InvalidateRequest: 0x7f7860004410 invalida

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-05 Thread Stefan Priebe - Profihost AG
Hello Json, while doing further testing it happens only with images created with hammer and that got upgraded to jewel AND got enabled exclusive lock. Greets, Stefan Am 04.05.2017 um 14:20 schrieb Jason Dillaman: > Odd. Can you re-run "rbd rm" with "--debug-rbd=20" added to the > command and pos

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Brian Andrus
Hi Stefan - we simply disabled exclusive-lock on all older (pre-jewel) images. We still allow the default jewel featuresets for newly created images because as you mention - the issue does not seem to affect them. On Thu, May 4, 2017 at 10:19 AM, Stefan Priebe - Profihost AG < s.pri...@profihost.a

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Stefan Priebe - Profihost AG
and yes i also see hung tasks in those VMs until they crash. Stefan Am 04.05.2017 um 19:11 schrieb Brian Andrus: > Sounds familiar... and discussed in "disk timeouts in libvirt/qemu VMs..." > > We have not had this issue since reverting exclusive-lock, but it was > suggested this was not the iss

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Stefan Priebe - Profihost AG
Hello Brian, this really sounds the same. I don't see this on a cluster with only images created AFTER jewel. And it seems to start happening after i enabled exclusive lock on all images. Did just use feature disable, exclusive-lock,fast-diff,object-map or did you also restart all those vms? Gre

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Brian Andrus
Sounds familiar... and discussed in "disk timeouts in libvirt/qemu VMs..." We have not had this issue since reverting exclusive-lock, but it was suggested this was not the issue. So far it's held up for us with not a single corrupt filesystem since then. On some images (ones created post-Jewel up

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Stefan Priebe - Profihost AG
Hi Jason, > Odd. Can you re-run "rbd rm" with "--debug-rbd=20" added to the > command and post the resulting log to a new ticket at [1]? will do so next time. I was able to solve this by restarting all osds. After that i was able to successfuly delete the image. > I'd also be interested if you c

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Jason Dillaman
Odd. Can you re-run "rbd rm" with "--debug-rbd=20" added to the command and post the resulting log to a new ticket at [1]? I'd also be interested if you could re-create that "librbd::object_map::InvalidateRequest" issue repeatably. [1] http://tracker.ceph.com/projects/rbd/issues On Thu, May 4, 20

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Stefan Priebe - Profihost AG
There are no watchers involved: # rbd status cephstor2/vm-136-disk-1 Watchers: none Greets, Stefan Am 04.05.2017 um 09:45 schrieb Stefan Priebe - Profihost AG: > Example: > # rbd rm cephstor2/vm-136-disk-1 > Removing image: 99% complete... > > Stuck at 99% and never completes. This is an image w

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-04 Thread Stefan Priebe - Profihost AG
Example: # rbd rm cephstor2/vm-136-disk-1 Removing image: 99% complete... Stuck at 99% and never completes. This is an image which got corrupted for an unknown reason. Greets, Stefan Am 04.05.2017 um 08:32 schrieb Stefan Priebe - Profihost AG: > I'm not sure whether this is related but our backu

Re: [ceph-users] corrupted rbd filesystems since jewel

2017-05-03 Thread Stefan Priebe - Profihost AG
I'm not sure whether this is related but our backup system uses rbd snapshots and reports sometimes messages like these: 2017-05-04 02:42:47.661263 7f3316ffd700 -1 librbd::object_map::InvalidateRequest: 0x7f3310002570 should_complete: r=0 Stefan Am 04.05.2017 um 07:49 schrieb Stefan Priebe - Pro

[ceph-users] corrupted rbd filesystems since jewel

2017-05-03 Thread Stefan Priebe - Profihost AG
Hello, since we've upgraded from hammer to jewel 10.2.7 and enabled exclusive-lock,object-map,fast-diff we've problems with corrupting VM filesystems. Sometimes the VMs are just crashing with FS errors and a restart can solve the problem. Sometimes the whole VM is not even bootable and we need to