Hi Ashley
The command to reset the flag for ALL OSDs is
ceph config set osd bluefs_preextend_wal_files false
And for just an individual OSD:
ceph config set osd.5 bluefs_preextend_wal_files false
And to remove it from an individual one (so you just have the global one
left):
Hello,Great news can you confirm the exact command you used to inject the value
so I can replicate you exact steps.I will do that and then leave it a good
couple of days before trying a reboot to make sure the WAL is completely
flushed Thanks Ashley On Sat, 23 May 2020 23:20:45 +0800
chri
Status date:
We seem to have success. I followed the steps below. Only one more OSD
(on node3) failed to restart, showing the same WAL corruption messages.
After replacing that & backfilling I could then restart it. So we have a
healthy cluster with restartable OSDs again, with
bluefs_preexte
Hi Ashley
Setting bluefs_preextend_wal_files to false should stop any further
corruption of the WAL (subject to the small risk of doing this while the
OSD is active). Over time WAL blocks will be recycled and overwritten
with new good blocks, so the extent of the corruption may decrease or
ev
Hello Chris,
Great to hear, few questions.
Once you have injected the bluefs_preextend_wal_files to false, are you just
rebuilding the OSD's that failed? Or are you going through and rebuilding every
OSD even the working one's?
Or does setting the bluefs_preextend_wal_files value to fals
Hi Ashley
Igor has done a great job of tracking down the problem, and we have
finally shown evidence of the type of corruption it would produce in one
of my WALs. Our feeling at the moment is that the problem can be
detoured by setting bluefs_preextend_wal_files to false on affected OSDs
whil
Thanks Igor,
Do you have any idea on a e.t.a or plan for people that are running 15.2.2 to
be able to patch / fix the issue.
I had a read of the ticket and seems the corruption is happening but the WAL is
not read till OSD restart, so I imagine we will need some form of fix / patch
we can
Status update:
Finally we have the first patch to fix the issue in master:
https://github.com/ceph/ceph/pull/35201
And ticket has been updated with root cause
analysis:https://tracker.ceph.com/issues/45613On 5/21/2020 2:07 PM, Igor
Fedotov wrote:
@Chris - unfortunately it looks like the co
Short update on the issue:
Finally we're able to reproduce the issue in master (not octopus),
investigating further..
@Chris - to make sure you're facing the same issue could you please
check the content of the broken file. To do so:
1) run "ceph-bluestore-tool --path --our-dir
--command
Hi Igor
I've sent you these directly as they're a bit chunky. Let me know if you
haven't got them.
Thx, Chris
On 20/05/2020 14:43, Igor Fedotov wrote:
Hi Cris,
could you please share the full log prior to the first failure?
Also if possible please set debug-bluestore/debug bluefs to 20 and
I'm getting similar errors after rebooting a node. Cluster was upgraded
15.2.1 -> 15.2.2 yesterday. No problems after rebooting during upgrade.
On the node I just rebooted, 2/4 OSDs won't restart. Similar logs from
both. Logs from one below.
Neither OSDs have compression enabled, although there
Chris,
got them, thanks!
Investigating
Thanks,
Igor
On 5/20/2020 5:23 PM, Chris Palmer wrote:
Hi Igor
I've sent you these directly as they're a bit chunky. Let me know if
you haven't got them.
Thx, Chris
On 20/05/2020 14:43, Igor Fedotov wrote:
Hi Cris,
could you please share the f
Hi Cris,
could you please share the full log prior to the first failure?
Also if possible please set debug-bluestore/debug bluefs to 20 and
collect another one for failed OSD startup.
Thanks,
Igor
On 5/20/2020 4:39 PM, Chris Palmer wrote:
I'm getting similar errors after rebooting a node.
Hey Igor,
The OSDs only back two metadata pools, so only hold a couple of MB of data
(hence they was easy and quick to rebuild), there actually NVME LVM devices
passed through QEMU into a VM (hence only 10GB and showing as rotational)
I have large 10TB disks that back the EC(RBD/FS) them se
Thanks!
So for now I can see the following similarities between you case and the
ticket:
1) Single main spinner as an OSD backing device.
2) Corruption happens to RocksDB WAL file
3) OSD has user data compression enabled.
And one more question. Fro the following line:
May 20 06:05:14 sn-m
I attached the log but was too big and got moderated.
Here is it in a paste bin : https://pastebin.pl/view/69b2beb9
I have cut the log to start from the point of the original upgrade.
Thanks
On Wed, 20 May 2020 20:55:51 +0800 Igor Fedotov wrote
Dan, thanks for the info. Go
Dan, thanks for the info. Good to know.
Failed QA run in the ticket uses snappy though.
And in fact any stuff writing to process memory can introduce data
corruption in the similar manner.
So will keep that in mind but IMO relation to compression is still not
evident...
Kind regards,
Ig
Do you still have any original failure logs?
On 5/20/2020 3:45 PM, Ashley Merrick wrote:
Is a single shared main device.
Sadly I had already rebuilt the failed OSD's to bring me back in the
green after a while.
I have just tried a few restarts and none are failing (seems after a
rebuild usin
Is a single shared main device.
Sadly I had already rebuilt the failed OSD's to bring me back in the green
after a while.
I have just tried a few restarts and none are failing (seems after a rebuild
using 15.2.2 they are stable?)
I don't have any other servers/OSD's I am willing to risk no
lz4 ? It's not obviously related, but I've seen it involved in really
non-obvious ways: https://tracker.ceph.com/issues/39525
-- dan
On Wed, May 20, 2020 at 2:27 PM Ashley Merrick wrote:
>
> Thanks, fyi the OSD's that went down back two pools, an Erasure code Meta
> (RBD) and cephFS Meta. The c
I don't believe compression is related to be honest.
Wondering if these OSDs have standalone WAL and/or DB devices or just a
single shared main device.
Also could you please set debug-bluefs/debug-bluestore to 20 and collect
startup log for broken OSD.
Kind regards,
Igor
On 5/20/2020 3:2
Thanks, fyi the OSD's that went down back two pools, an Erasure code Meta (RBD)
and cephFS Meta. The cephFS Pool does have compresison enabled ( I noticed it
mentioned in the ceph tracker)
Thanks
On Wed, 20 May 2020 20:17:33 +0800 Igor Fedotov wrote
Hi Ashley,
looks like
Hi Ashley,
looks like this is a regression. Neha observed similar error(s) during
here QA run, see https://tracker.ceph.com/issues/45613
Please preserve broken OSDs for a while if possible, likely I'll come
back to you for more information to troubleshoot.
Thanks,
Igor
On 5/20/2020 1:26
So reading online it looked a dead end error, so I recreated the 3 OSD's on
that node and now working fine after a reboot.
However I restarted the next server with 3 OSD's and one of them is now facing
the same issue.
Let me know if you need any more logs.
Thanks
On Wed, 20 May 2
24 matches
Mail list logo