[ceph-users] Ceph failover claster

2021-04-12 Thread Várkonyi János
Hi All,

Use anybody windows file server with ceph storage? Finally I can do the 
gateways. We've a ceph storage with 3 nodes and we can add this to windows via 
ceph-iscsi. I'd like to use it with 2 windows 2019 servers in failover cluster. 
I can connect to the storage each sides. But when I check the MPIO device 
Details all nods are connected and active, I've not "stand by" node. I'm not 
for sure it is right or it is a problem. I setup up the deatils from the ceph 
documment. TimeOutValue = 65; LinkDownTime = 25; SRBTimeoutDelta = 15.
I try to validate failover cluster configuration and I get an error: "Failure 
issuing call to Persistent Reservation REGISTER AND IGNORE EXISTING on Test 
Disk 0 from node FS102.trafficom.hu when the disk has no existing registration. 
It is expected to succeed. The device is not ready."
Did anybody see this error?

jansz0
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph failover claster

2021-04-12 Thread Maged Mokhtar

Hello Varkonyi,

Windows clustering requires the use of SCSI 3 clustered persistent 
reservations, to support this with Ceph you could use our distribution 
PetaSAN:


www.petasan.org

which supports this and passes the Windows clustering tests.

/Maged


On 12/04/2021 10:28, Várkonyi János wrote:

Hi All,

Use anybody windows file server with ceph storage? Finally I can do the gateways. We've a 
ceph storage with 3 nodes and we can add this to windows via ceph-iscsi. I'd like to use 
it with 2 windows 2019 servers in failover cluster. I can connect to the storage each 
sides. But when I check the MPIO device Details all nods are connected and active, I've 
not "stand by" node. I'm not for sure it is right or it is a problem. I setup 
up the deatils from the ceph documment. TimeOutValue = 65; LinkDownTime = 25; 
SRBTimeoutDelta = 15.
I try to validate failover cluster configuration and I get an error: "Failure 
issuing call to Persistent Reservation REGISTER AND IGNORE EXISTING on Test Disk 0 from 
node FS102.trafficom.hu when the disk has no existing registration. It is expected to 
succeed. The device is not ready."
Did anybody see this error?

jansz0
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW failed to start after upgrade to pacific

2021-04-12 Thread Robert Sander
Am 06.04.21 um 18:53 schrieb Casey Bodley:
> thanks for the details. this is a regression from changes to the
> datalog storage for multisite - this -5 error is coming from the new
> 'fifo' backend. as a workaround, you can set the new
> 'rgw_data_log_backing' config variable back to 'omap'
> 
> Adam has fixes already merged to the pacific branch; be aware that the
> first pacific point release will change the name of
> 'rgw_data_log_backing' to 'rgw_default_data_log_backing' and default
> back to 'fifo'

So when you have a Ceph cluster with Rados-Gateways you should not
upgrade to Pacific currently.

Regards
-- 
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd info error opening image

2021-04-12 Thread Eugen Block

Hi,

have you checked if the rbd_header object still exists for that  
volume? If it's indeed missing you could rebuild it as described in  
[1], I haven't done that myself though.
It would help if you knew the block_name_prefix of that volume, if not  
you could figure that out by matching all existing rbd_headers to  
their rbd_data objects and see which rbd_data don't have a matching  
header object.


Regards,
Eugen

[1] https://fnordahl.com/2017/04/17/ceph-rbd-volume-header-recovery/


Zitat von Marcel Kuiper :


I hope someone can help out. I cannot run 'rbd info' on any image.

# rbd ls openstack-volumes

volume-628efc47-fc57-4630-8661-a13210a4e02c
volume-e4fe1e24-fb26-4abc-a458-f936a4e75715
volume-1ce1439d-767b-4b1d-8217-51464a11c5cc
volume-0a01d7e3-2c8f-4fab-9f9f-d84bbc7fa3c7
volume-a4aeb848-7283-4cd0-b5e6-ac2fc7f06dac

# rbd info openstack-volumes/volume-a4aeb848-7283-4cd0-b5e6-ac2fc7f06dac

rbd: error opening image  
volume-a4aeb848-7283-4cd0-b5e6-ac2fc7f06dac: (2) No such file or  
directory


We're running nautilus 14.2.16 on ubuntu bionic

Marcel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] cephadm custom mgr modules

2021-04-12 Thread Rob Haverkamp
Hi there,

I'm developing a custom ceph-mgr module and have issues deploying this on a 
cluster deployed with cephadm.
With a cluster deployed with ceph-deploy, I can just put my code under 
/usr/share/ceph/mgr/ and load the module. This works fine.

I think I found 2 options to do this with cephadm:

1. build a custom container image: 
https://docs.ceph.com/en/octopus/cephadm/install/#deploying-custom-containers
2. use the --shared_ceph_folder during cephadm bootstrap: 'Development mode. 
Several folders in containers are volumes mapped to different sub-folders in 
the ceph source folder'


The shared folder method is only meant for development. So that is not an 
option in a production environment.
Building a custom container image should be possible, but I don't think I want 
to go there.

Are there more options?

It would be nice if it was possible to deploy the managers with a custom 
service specification that for example mounts a folder from the host system to 
/usr/share/ceph/mgr/ in the container.


Thanks!

Rob Haverkamp
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus, Ceph-Ansible, existing OSDs, and ceph.conf updates [EXT]

2021-04-12 Thread Matthew Vernon

On 10/04/2021 13:03, Dave Hall wrote:

Hello,

A while back I asked about the troubles I was having with Ceph-Ansible when
I kept existing OSDs in my inventory file when managing my Nautilus cluster.

At the time it was suggested that once the OSDs have been configured they
should be excluded from the inventory file.

However, when processing certain configuration changes Ceph-Ansible updates
ceph.conf on all cluster nodes and clients in the inventory file.

Is there an alternative way to keep OSD nodes in the inventory file without
listing them as OSD nodes, so they get other updates, but also so
Ceph-Ansible doesn't try to do any of the ceph-volume stuff that seems to
be failing after the OSDs are configured?


Are you using LVM or LVM-batch? If the former, you might find
--skip-tags prepare_osd
Does what you want. I use that because otherwise ceph-ansible gets sad 
if your device names aren't exactly what it's expecting.


Regards,

Matthew


--
The Wellcome Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] rbd info error opening image

2021-04-12 Thread Marcel Kuiper

I hope someone can help out. I cannot run 'rbd info' on any image.

# rbd ls openstack-volumes

volume-628efc47-fc57-4630-8661-a13210a4e02c
volume-e4fe1e24-fb26-4abc-a458-f936a4e75715
volume-1ce1439d-767b-4b1d-8217-51464a11c5cc
volume-0a01d7e3-2c8f-4fab-9f9f-d84bbc7fa3c7
volume-a4aeb848-7283-4cd0-b5e6-ac2fc7f06dac

# rbd info openstack-volumes/volume-a4aeb848-7283-4cd0-b5e6-ac2fc7f06dac

rbd: error opening image volume-a4aeb848-7283-4cd0-b5e6-ac2fc7f06dac: 
(2) No such file or directory


We're running nautilus 14.2.16 on ubuntu bionic

Marcel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm custom mgr modules

2021-04-12 Thread Sebastian Wagner
You want to build a custom container for that user case indeed.

On Mon, Apr 12, 2021 at 2:18 PM Rob Haverkamp  wrote:

> Hi there,
>
> I'm developing a custom ceph-mgr module and have issues deploying this on
> a cluster deployed with cephadm.
> With a cluster deployed with ceph-deploy, I can just put my code under
> /usr/share/ceph/mgr/ and load the module. This works fine.
>
> I think I found 2 options to do this with cephadm:
>
> 1. build a custom container image:
> https://docs.ceph.com/en/octopus/cephadm/install/#deploying-custom-containers
> 2. use the --shared_ceph_folder during cephadm bootstrap: 'Development
> mode. Several folders in containers are volumes mapped to different
> sub-folders in the ceph source folder'
>
>
> The shared folder method is only meant for development. So that is not an
> option in a production environment.
> Building a custom container image should be possible, but I don't think I
> want to go there.
>
> Are there more options?
>
> It would be nice if it was possible to deploy the managers with a custom
> service specification that for example mounts a folder from the host system
> to /usr/share/ceph/mgr/ in the container.
>
>
> Thanks!
>
> Rob Haverkamp
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs RocksDB corrupted when upgrading nautilus->octopus: unknown WriteBatch tag

2021-04-12 Thread Dan van der Ster
Too bad. Let me continue trying to invoke Cunningham's Law for you ... ;)

Have you excluded any possible hardware issues?

15.2.10 has a new option to check for all zero reads; maybe try it with true?

Option("bluefs_check_for_zeros", Option::TYPE_BOOL, Option::LEVEL_DEV)
.set_default(false)
.set_flag(Option::FLAG_RUNTIME)
.set_description("Check data read for suspicious pages")
.set_long_description("Looks into data read to check if there is a
4K block entirely filled with zeros. "
"If this happens, we re-read data. If there is
difference, we print error to log.")
.add_see_also("bluestore_retry_disk_reads"),

The "fix zombie spanning blobs" feature was added in 15.2.9. Does
15.2.8 work for you?

Cheers, Dan

On Sun, Apr 11, 2021 at 10:17 PM Jonas Jelten  wrote:
>
> Thanks for the idea, I've tried it with 1 thread, and it shredded another OSD.
> I've updated the tracker ticket :)
>
> At least non-racecondition bugs are hopefully easier to spot...
>
> I wouldn't just disable the fsck and upgrade anyway until the cause is rooted 
> out.
>
> -- Jonas
>
>
> On 29/03/2021 14.34, Dan van der Ster wrote:
> > Hi,
> >
> > Saw that, looks scary!
> >
> > I have no experience with that particular crash, but I was thinking
> > that if you have already backfilled the degraded PGs, and can afford
> > to try another OSD, you could try:
> >
> > "bluestore_fsck_quick_fix_threads": "1",  # because
> > https://github.com/facebook/rocksdb/issues/5068 showed a similar crash
> > and the dev said it occurs because WriteBatch is not thread safe.
> >
> > "bluestore_fsck_quick_fix_on_mount": "false", # should disable the
> > fsck during upgrade. See https://github.com/ceph/ceph/pull/40198
> >
> > -- Dan
> >
> > On Mon, Mar 29, 2021 at 2:23 PM Jonas Jelten  wrote:
> >>
> >> Hi!
> >>
> >> After upgrading MONs and MGRs successfully, the first OSD host I upgraded 
> >> on Ubuntu Bionic from 14.2.16 to 15.2.10
> >> shredded all OSDs on it by corrupting RocksDB, and they now refuse to boot.
> >> RocksDB complains "Corruption: unknown WriteBatch tag".
> >>
> >> The initial crash/corruption occured when the automatic fsck was ran, and 
> >> when it committed the changes for a lot of "zombie spanning blobs".
> >>
> >> Tracker issue with logs: https://tracker.ceph.com/issues/50017
> >>
> >>
> >> Anyone else encountered this error? I've "suspended" the upgrade for now :)
> >>
> >> -- Jonas
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs RocksDB corrupted when upgrading nautilus->octopus: unknown WriteBatch tag

2021-04-12 Thread Igor Fedotov

Sorry for being too late to the party...

I think the root cause is related to the high amount of repairs made 
during the first post-upgrade fsck run.


The check (and fix) for zombie spanning blobs was been backported to 
v15.2.9 (here is the PR https://github.com/ceph/ceph/pull/39256). And I 
presumt it's the one which causes BlueFS data corruption due to huge 
transaction happening during such a repair.


I haven't seen this exact issue (as having that many zombie blobs is a 
rarely met bug by itself) but we had to some degree similar issue with 
upgrading omap names, see: https://github.com/ceph/ceph/pull/39377


Huge resulting transaction could cause too big write to WAL which in 
turn caused data corruption (see https://github.com/ceph/ceph/pull/39701)


Although the fix for the latter has been merged for 15.2.10 some 
additional issues with huge transactions might still exist...



If someone can afford another OSD loss it could be interesting to get an 
OSD log for such a repair with debug-bluefs set to 20...


I'm planning to make a fix to cap transaction size for repair in the 
nearest future anyway though..



Thanks,

Igor


On 4/12/2021 5:15 PM, Dan van der Ster wrote:

Too bad. Let me continue trying to invoke Cunningham's Law for you ... ;)

Have you excluded any possible hardware issues?

15.2.10 has a new option to check for all zero reads; maybe try it with true?

 Option("bluefs_check_for_zeros", Option::TYPE_BOOL, Option::LEVEL_DEV)
 .set_default(false)
 .set_flag(Option::FLAG_RUNTIME)
 .set_description("Check data read for suspicious pages")
 .set_long_description("Looks into data read to check if there is a
4K block entirely filled with zeros. "
 "If this happens, we re-read data. If there is
difference, we print error to log.")
 .add_see_also("bluestore_retry_disk_reads"),

The "fix zombie spanning blobs" feature was added in 15.2.9. Does
15.2.8 work for you?

Cheers, Dan

On Sun, Apr 11, 2021 at 10:17 PM Jonas Jelten  wrote:

Thanks for the idea, I've tried it with 1 thread, and it shredded another OSD.
I've updated the tracker ticket :)

At least non-racecondition bugs are hopefully easier to spot...

I wouldn't just disable the fsck and upgrade anyway until the cause is rooted 
out.

-- Jonas


On 29/03/2021 14.34, Dan van der Ster wrote:

Hi,

Saw that, looks scary!

I have no experience with that particular crash, but I was thinking
that if you have already backfilled the degraded PGs, and can afford
to try another OSD, you could try:

 "bluestore_fsck_quick_fix_threads": "1",  # because
https://github.com/facebook/rocksdb/issues/5068 showed a similar crash
and the dev said it occurs because WriteBatch is not thread safe.

 "bluestore_fsck_quick_fix_on_mount": "false", # should disable the
fsck during upgrade. See https://github.com/ceph/ceph/pull/40198

-- Dan

On Mon, Mar 29, 2021 at 2:23 PM Jonas Jelten  wrote:

Hi!

After upgrading MONs and MGRs successfully, the first OSD host I upgraded on 
Ubuntu Bionic from 14.2.16 to 15.2.10
shredded all OSDs on it by corrupting RocksDB, and they now refuse to boot.
RocksDB complains "Corruption: unknown WriteBatch tag".

The initial crash/corruption occured when the automatic fsck was ran, and when it 
committed the changes for a lot of "zombie spanning blobs".

Tracker issue with logs: https://tracker.ceph.com/issues/50017


Anyone else encountered this error? I've "suspended" the upgrade for now :)

-- Jonas
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs RocksDB corrupted when upgrading nautilus->octopus: unknown WriteBatch tag

2021-04-12 Thread DHilsbos
Is there a way to check for these zombie blobs, and other issues needing 
repair, prior to the upgrade?  That would allow us to know that issues might be 
coming, and perhaps address them before they result in corrupt OSDs.

I'm considering upgrading our clusters from 14 to 15, and would really like to 
avoid these kinds of issues.

Thank you,

Dominic L. Hilsbos, MBA 
Director - Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com

-Original Message-
From: Igor Fedotov [mailto:ifedo...@suse.de] 
Sent: Monday, April 12, 2021 7:55 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: OSDs RocksDB corrupted when upgrading 
nautilus->octopus: unknown WriteBatch tag

Sorry for being too late to the party...

I think the root cause is related to the high amount of repairs made 
during the first post-upgrade fsck run.

The check (and fix) for zombie spanning blobs was been backported to 
v15.2.9 (here is the PR https://github.com/ceph/ceph/pull/39256). And I 
presumt it's the one which causes BlueFS data corruption due to huge 
transaction happening during such a repair.

I haven't seen this exact issue (as having that many zombie blobs is a 
rarely met bug by itself) but we had to some degree similar issue with 
upgrading omap names, see: https://github.com/ceph/ceph/pull/39377

Huge resulting transaction could cause too big write to WAL which in 
turn caused data corruption (see https://github.com/ceph/ceph/pull/39701)

Although the fix for the latter has been merged for 15.2.10 some 
additional issues with huge transactions might still exist...


If someone can afford another OSD loss it could be interesting to get an 
OSD log for such a repair with debug-bluefs set to 20...

I'm planning to make a fix to cap transaction size for repair in the 
nearest future anyway though..


Thanks,

Igor


On 4/12/2021 5:15 PM, Dan van der Ster wrote:
> Too bad. Let me continue trying to invoke Cunningham's Law for you ... ;)
>
> Have you excluded any possible hardware issues?
>
> 15.2.10 has a new option to check for all zero reads; maybe try it with true?
>
>  Option("bluefs_check_for_zeros", Option::TYPE_BOOL, Option::LEVEL_DEV)
>  .set_default(false)
>  .set_flag(Option::FLAG_RUNTIME)
>  .set_description("Check data read for suspicious pages")
>  .set_long_description("Looks into data read to check if there is a
> 4K block entirely filled with zeros. "
>  "If this happens, we re-read data. If there is
> difference, we print error to log.")
>  .add_see_also("bluestore_retry_disk_reads"),
>
> The "fix zombie spanning blobs" feature was added in 15.2.9. Does
> 15.2.8 work for you?
>
> Cheers, Dan
>
> On Sun, Apr 11, 2021 at 10:17 PM Jonas Jelten  wrote:
>> Thanks for the idea, I've tried it with 1 thread, and it shredded another 
>> OSD.
>> I've updated the tracker ticket :)
>>
>> At least non-racecondition bugs are hopefully easier to spot...
>>
>> I wouldn't just disable the fsck and upgrade anyway until the cause is 
>> rooted out.
>>
>> -- Jonas
>>
>>
>> On 29/03/2021 14.34, Dan van der Ster wrote:
>>> Hi,
>>>
>>> Saw that, looks scary!
>>>
>>> I have no experience with that particular crash, but I was thinking
>>> that if you have already backfilled the degraded PGs, and can afford
>>> to try another OSD, you could try:
>>>
>>>  "bluestore_fsck_quick_fix_threads": "1",  # because
>>> https://github.com/facebook/rocksdb/issues/5068 showed a similar crash
>>> and the dev said it occurs because WriteBatch is not thread safe.
>>>
>>>  "bluestore_fsck_quick_fix_on_mount": "false", # should disable the
>>> fsck during upgrade. See https://github.com/ceph/ceph/pull/40198
>>>
>>> -- Dan
>>>
>>> On Mon, Mar 29, 2021 at 2:23 PM Jonas Jelten  wrote:
 Hi!

 After upgrading MONs and MGRs successfully, the first OSD host I upgraded 
 on Ubuntu Bionic from 14.2.16 to 15.2.10
 shredded all OSDs on it by corrupting RocksDB, and they now refuse to boot.
 RocksDB complains "Corruption: unknown WriteBatch tag".

 The initial crash/corruption occured when the automatic fsck was ran, and 
 when it committed the changes for a lot of "zombie spanning blobs".

 Tracker issue with logs: https://tracker.ceph.com/issues/50017


 Anyone else encountered this error? I've "suspended" the upgrade for now :)

 -- Jonas
 ___
 ceph-users mailing list -- ceph-users@ceph.io
 To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To u

[ceph-users] has anyone enabled bdev_enable_discard?

2021-04-12 Thread Dan van der Ster
Hi all,

bdev_enable_discard has been in ceph for several major releases now
but it is still off by default.
Did anyone try it recently -- is it safe to use? And do you have perf
numbers before and after enabling?

Cheers, Dan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph osd Reweight command in octopus

2021-04-12 Thread Brent Kennedy
Yes, I ended up doing that and you are right, it was just being stubborn.  I
had to drop all the way down to .9 to get those moving.  In Naultilus, I
don't have to tick that down so low before things start moving.  Been on
Ceph since firefly, so I try not to go too low.

Based on what I was reading, I thought Octopus would be better about
balancing, but then again, we might need more disks/hosts in that particular
cluster as it only has 25 disks across 5 hosts.  Perhaps things will get
better once we have the planned 100 disks.

-Brent

-Original Message-
From: Reed Dier  
Sent: Monday, March 15, 2021 3:48 PM
To: Brent Kennedy 
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Ceph osd Reweight command in octopus

Have you tried a more aggressive reweight value?

I've seen some stubborn crush maps that don't start moving date until 0.9 or
lower in some cases.

Reed

> On Mar 11, 2021, at 10:29 AM, Brent Kennedy  wrote:
> 
> We have a ceph octopus cluster running 15.2.6, its indicating a near 
> full osd which I can see is not weighted equally with the rest of the 
> osds.  I tried to do the usual "ceph osd reweight osd.0 0.95" to force 
> it down a little bit, but unlike the nautilus clusters, I see no data 
> movement when issuing the command.  If I run a ceph osd tree, it shows 
> the reweight setting, but no data movement appears to be occurring.
> 
> 
> 
> Is there some new thing in ocotopus I am missing?  I looked through 
> the release notes for .7, .8 and .9 and didn't see any fixes that 
> jumped out as resolving a bug related to this.  The Octopus cluster 
> was deployed using ceph-ansible and upgraded to 15.2.6.  I plan to 
> upgrade to 15.2.9 in the coming month.
> 
> 
> 
> Any thoughts?
> 
> 
> 
> Regards,
> 
> -Brent
> 
> 
> 
> Existing Clusters:
> 
> Test: Ocotpus 15.2.5 ( all virtual on nvme )
> 
> US Production(HDD): Nautilus 14.2.11 with 11 osd servers, 3 mons, 4 
> gateways, 2 iscsi gateways
> 
> UK Production(HDD): Nautilus 14.2.11 with 18 osd servers, 3 mons, 4 
> gateways, 2 iscsi gateways
> 
> US Production(SSD): Nautilus 14.2.11 with 6 osd servers, 3 mons, 4 
> gateways,
> 2 iscsi gateways
> 
> UK Production(SSD): Octopus 15.2.6 with 5 osd servers, 3 mons, 4 
> gateways
> 
> 
> 
> 
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs RocksDB corrupted when upgrading nautilus->octopus: unknown WriteBatch tag

2021-04-12 Thread Igor Fedotov
The workaround would be to disable bluestore_fsck_quick_fix_on_mount, do 
an upgrade and then do a regular fsck.


Depending on fsck  results either proceed with a repair or not.


Thanks,

Igor


On 4/12/2021 6:35 PM, dhils...@performair.com wrote:

Is there a way to check for these zombie blobs, and other issues needing 
repair, prior to the upgrade?  That would allow us to know that issues might be 
coming, and perhaps address them before they result in corrupt OSDs.

I'm considering upgrading our clusters from 14 to 15, and would really like to 
avoid these kinds of issues.

Thank you,

Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International Inc.
dhils...@performair.com
www.PerformAir.com

-Original Message-
From: Igor Fedotov [mailto:ifedo...@suse.de]
Sent: Monday, April 12, 2021 7:55 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: OSDs RocksDB corrupted when upgrading 
nautilus->octopus: unknown WriteBatch tag

Sorry for being too late to the party...

I think the root cause is related to the high amount of repairs made
during the first post-upgrade fsck run.

The check (and fix) for zombie spanning blobs was been backported to
v15.2.9 (here is the PR https://github.com/ceph/ceph/pull/39256). And I
presumt it's the one which causes BlueFS data corruption due to huge
transaction happening during such a repair.

I haven't seen this exact issue (as having that many zombie blobs is a
rarely met bug by itself) but we had to some degree similar issue with
upgrading omap names, see: https://github.com/ceph/ceph/pull/39377

Huge resulting transaction could cause too big write to WAL which in
turn caused data corruption (see https://github.com/ceph/ceph/pull/39701)

Although the fix for the latter has been merged for 15.2.10 some
additional issues with huge transactions might still exist...


If someone can afford another OSD loss it could be interesting to get an
OSD log for such a repair with debug-bluefs set to 20...

I'm planning to make a fix to cap transaction size for repair in the
nearest future anyway though..


Thanks,

Igor


On 4/12/2021 5:15 PM, Dan van der Ster wrote:

Too bad. Let me continue trying to invoke Cunningham's Law for you ... ;)

Have you excluded any possible hardware issues?

15.2.10 has a new option to check for all zero reads; maybe try it with true?

  Option("bluefs_check_for_zeros", Option::TYPE_BOOL, Option::LEVEL_DEV)
  .set_default(false)
  .set_flag(Option::FLAG_RUNTIME)
  .set_description("Check data read for suspicious pages")
  .set_long_description("Looks into data read to check if there is a
4K block entirely filled with zeros. "
  "If this happens, we re-read data. If there is
difference, we print error to log.")
  .add_see_also("bluestore_retry_disk_reads"),

The "fix zombie spanning blobs" feature was added in 15.2.9. Does
15.2.8 work for you?

Cheers, Dan

On Sun, Apr 11, 2021 at 10:17 PM Jonas Jelten  wrote:

Thanks for the idea, I've tried it with 1 thread, and it shredded another OSD.
I've updated the tracker ticket :)

At least non-racecondition bugs are hopefully easier to spot...

I wouldn't just disable the fsck and upgrade anyway until the cause is rooted 
out.

-- Jonas


On 29/03/2021 14.34, Dan van der Ster wrote:

Hi,

Saw that, looks scary!

I have no experience with that particular crash, but I was thinking
that if you have already backfilled the degraded PGs, and can afford
to try another OSD, you could try:

  "bluestore_fsck_quick_fix_threads": "1",  # because
https://github.com/facebook/rocksdb/issues/5068 showed a similar crash
and the dev said it occurs because WriteBatch is not thread safe.

  "bluestore_fsck_quick_fix_on_mount": "false", # should disable the
fsck during upgrade. See https://github.com/ceph/ceph/pull/40198

-- Dan

On Mon, Mar 29, 2021 at 2:23 PM Jonas Jelten  wrote:

Hi!

After upgrading MONs and MGRs successfully, the first OSD host I upgraded on 
Ubuntu Bionic from 14.2.16 to 15.2.10
shredded all OSDs on it by corrupting RocksDB, and they now refuse to boot.
RocksDB complains "Corruption: unknown WriteBatch tag".

The initial crash/corruption occured when the automatic fsck was ran, and when it 
committed the changes for a lot of "zombie spanning blobs".

Tracker issue with logs: https://tracker.ceph.com/issues/50017


Anyone else encountered this error? I've "suspended" the upgrade for now :)

-- Jonas
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph

[ceph-users] Re: cephadm custom mgr modules

2021-04-12 Thread Robert Sander
Hi,

this is one of the use cases mentioned in Tim Serong's talk: 
https://youtu.be/pPZsN_urpqw

Containers are great for deploying a fixed state of a software project (a 
release), but not so much for the development of plugins etc.

Regards
-- 
Robert Sander
Heinlein Support GmbH
Schwedter Str. 8/9b, 10119 Berlin

https://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Amtsgericht Berlin-Charlottenburg - HRB 93818 B
Geschäftsführer: Peer Heinlein - Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs RocksDB corrupted when upgrading nautilus->octopus: unknown WriteBatch tag

2021-04-12 Thread DHilsbos
Igor;

Does this only impact CephFS then?

Thank you,

Dominic L. Hilsbos, MBA 
Director – Information Technology 
Perform Air International Inc.
dhils...@performair.com 
www.PerformAir.com


-Original Message-
From: Igor Fedotov [mailto:ifedo...@suse.de] 
Sent: Monday, April 12, 2021 9:16 AM
To: Dominic Hilsbos; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: OSDs RocksDB corrupted when upgrading 
nautilus->octopus: unknown WriteBatch tag

The workaround would be to disable bluestore_fsck_quick_fix_on_mount, do 
an upgrade and then do a regular fsck.

Depending on fsck  results either proceed with a repair or not.


Thanks,

Igor


On 4/12/2021 6:35 PM, dhils...@performair.com wrote:
> Is there a way to check for these zombie blobs, and other issues needing 
> repair, prior to the upgrade?  That would allow us to know that issues might 
> be coming, and perhaps address them before they result in corrupt OSDs.
>
> I'm considering upgrading our clusters from 14 to 15, and would really like 
> to avoid these kinds of issues.
>
> Thank you,
>
> Dominic L. Hilsbos, MBA
> Director - Information Technology
> Perform Air International Inc.
> dhils...@performair.com
> www.PerformAir.com
>
> -Original Message-
> From: Igor Fedotov [mailto:ifedo...@suse.de]
> Sent: Monday, April 12, 2021 7:55 AM
> To: ceph-users@ceph.io
> Subject: [ceph-users] Re: OSDs RocksDB corrupted when upgrading 
> nautilus->octopus: unknown WriteBatch tag
>
> Sorry for being too late to the party...
>
> I think the root cause is related to the high amount of repairs made
> during the first post-upgrade fsck run.
>
> The check (and fix) for zombie spanning blobs was been backported to
> v15.2.9 (here is the PR https://github.com/ceph/ceph/pull/39256). And I
> presumt it's the one which causes BlueFS data corruption due to huge
> transaction happening during such a repair.
>
> I haven't seen this exact issue (as having that many zombie blobs is a
> rarely met bug by itself) but we had to some degree similar issue with
> upgrading omap names, see: https://github.com/ceph/ceph/pull/39377
>
> Huge resulting transaction could cause too big write to WAL which in
> turn caused data corruption (see https://github.com/ceph/ceph/pull/39701)
>
> Although the fix for the latter has been merged for 15.2.10 some
> additional issues with huge transactions might still exist...
>
>
> If someone can afford another OSD loss it could be interesting to get an
> OSD log for such a repair with debug-bluefs set to 20...
>
> I'm planning to make a fix to cap transaction size for repair in the
> nearest future anyway though..
>
>
> Thanks,
>
> Igor
>
>
> On 4/12/2021 5:15 PM, Dan van der Ster wrote:
>> Too bad. Let me continue trying to invoke Cunningham's Law for you ... ;)
>>
>> Have you excluded any possible hardware issues?
>>
>> 15.2.10 has a new option to check for all zero reads; maybe try it with true?
>>
>>   Option("bluefs_check_for_zeros", Option::TYPE_BOOL, Option::LEVEL_DEV)
>>   .set_default(false)
>>   .set_flag(Option::FLAG_RUNTIME)
>>   .set_description("Check data read for suspicious pages")
>>   .set_long_description("Looks into data read to check if there is a
>> 4K block entirely filled with zeros. "
>>   "If this happens, we re-read data. If there is
>> difference, we print error to log.")
>>   .add_see_also("bluestore_retry_disk_reads"),
>>
>> The "fix zombie spanning blobs" feature was added in 15.2.9. Does
>> 15.2.8 work for you?
>>
>> Cheers, Dan
>>
>> On Sun, Apr 11, 2021 at 10:17 PM Jonas Jelten  wrote:
>>> Thanks for the idea, I've tried it with 1 thread, and it shredded another 
>>> OSD.
>>> I've updated the tracker ticket :)
>>>
>>> At least non-racecondition bugs are hopefully easier to spot...
>>>
>>> I wouldn't just disable the fsck and upgrade anyway until the cause is 
>>> rooted out.
>>>
>>> -- Jonas
>>>
>>>
>>> On 29/03/2021 14.34, Dan van der Ster wrote:
 Hi,

 Saw that, looks scary!

 I have no experience with that particular crash, but I was thinking
 that if you have already backfilled the degraded PGs, and can afford
 to try another OSD, you could try:

   "bluestore_fsck_quick_fix_threads": "1",  # because
 https://github.com/facebook/rocksdb/issues/5068 showed a similar crash
 and the dev said it occurs because WriteBatch is not thread safe.

   "bluestore_fsck_quick_fix_on_mount": "false", # should disable the
 fsck during upgrade. See https://github.com/ceph/ceph/pull/40198

 -- Dan

 On Mon, Mar 29, 2021 at 2:23 PM Jonas Jelten  wrote:
> Hi!
>
> After upgrading MONs and MGRs successfully, the first OSD host I upgraded 
> on Ubuntu Bionic from 14.2.16 to 15.2.10
> shredded all OSDs on it by corrupting RocksDB, and they now refuse to 
> boot.
> RocksDB complains "Corruption: unknown WriteBatch tag".
>
> The initial crash/corruption occured when t

[ceph-users] Re: HEALTH_WARN - Recovery Stuck?

2021-04-12 Thread Marc


You know you can play a bit with the ratios?

ceph tell osd.* injectargs '--mon_osd_full_ratio=0.95'
ceph tell osd.* injectargs '--mon_osd_backfillfull_ratio=0.90'


> -Original Message-
> From: Ml Ml 
> Sent: 12 April 2021 19:31
> To: ceph-users 
> Subject: [ceph-users] HEALTH_WARN - Recovery Stuck?
> 
> Hello,
> 
> i kind of ran out of disk space, so i added another host with osd.37.
> But it does not seem to move much data on it. (85MB in 2h)
> 
> Any idea why the recovery process seems to be stuck? Should i fix the
> 4 backfillfull osds first? (by changing the weight)?
> 
> root@ceph01:~# ceph -s
>   cluster:
> id: 5436dd5d-83d4-4dc8-a93b-60ab5db145df
> health: HEALTH_WARN
> 4 backfillfull osd(s)
> 9 nearfull osd(s)
> Low space hindering backfill (add storage if this doesn't
> resolve itself): 1 pg backfill_toofull
> 4 pool(s) backfillfull
> 
>   services:
> mon: 3 daemons, quorum ceph03,ceph01,ceph02 (age 12d)
> mgr: ceph03(active, since 4M), standbys: ceph02.jwvivm
> mds: backup:1 {0=backup.ceph06.hdjehi=up:active} 3 up:standby
> osd: 53 osds: 53 up (since 2h), 53 in (since 2h); 235 remapped pgs
> 
>   task status:
> scrub status:
> mds.backup.ceph06.hdjehi: idle
> 
>   data:
> pools:   4 pools, 1185 pgs
> objects: 24.69M objects, 45 TiB
> usage:   149 TiB used, 42 TiB / 191 TiB avail
> pgs: 5388809/74059569 objects misplaced (7.276%)
>  950 active+clean
>  232 active+remapped+backfill_wait
>  2   active+remapped+backfilling
>  1   active+remapped+backfill_wait+backfill_toofull
> 
>   io:
> recovery: 0 B/s, 171 keys/s, 16 objects/s
> 
>   progress:
> Rebalancing after osd.37 marked in (2h)
>   [] (remaining: 6d)
> 
> 
> 
> root@ceph01:~# ceph health detail
> HEALTH_WARN 4 backfillfull osd(s); 9 nearfull osd(s); Low space
> hindering backfill (add storage if this doesn't resolve itself): 1 pg
> backfill_toofull; 4 pool(s) backfillfull
> [WRN] OSD_BACKFILLFULL: 4 backfillfull osd(s)
> osd.28 is backfill full
> osd.32 is backfill full
> osd.66 is backfill full
> osd.68 is backfill full
> [WRN] OSD_NEARFULL: 9 nearfull osd(s)
> osd.11 is near full
> osd.24 is near full
> osd.27 is near full
> osd.39 is near full
> osd.40 is near full
> osd.42 is near full
> osd.43 is near full
> osd.45 is near full
> osd.69 is near full
> [WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if
> this doesn't resolve itself): 1 pg backfill_toofull
> pg 23.295 is active+remapped+backfill_wait+backfill_toofull,
> acting [8,67,32]
> [WRN] POOL_BACKFILLFULL: 4 pool(s) backfillfull
> pool 'backurne-rbd' is backfillfull
> pool 'device_health_metrics' is backfillfull
> pool 'cephfs.backup.meta' is backfillfull
> pool 'cephfs.backup.data' is backfillfull
> 
> 
> root@ceph01:~# ceph osd df tree
> ID   CLASS  WEIGHT REWEIGHT  SIZE RAW USE  DATA OMAP
> META AVAIL%USE   VAR   PGS  STATUS  TYPE NAME
>  -1 182.59897 -  191 TiB  149 TiB  149 TiB35 GiB
> 503 GiB   42 TiB  77.96  1.00-  root default
>  -2  24.62473 -   29 TiB   22 TiB   22 TiB   5.0 GiB
> 80 GiB  7.1 TiB  75.23  0.96-  host ceph01
>   0hdd2.3   1.0  2.7 TiB  2.2 TiB  2.2 TiB   665 MiB
> 8.0 GiB  480 GiB  82.43  1.06   53  up  osd.0
>   1hdd2.2   1.0  2.7 TiB  2.1 TiB  2.1 TiB   446 MiB
> 7.5 GiB  590 GiB  78.44  1.01   49  up  osd.1
>   4hdd2.67029   0.91066  2.7 TiB  2.2 TiB  2.2 TiB   484 MiB
> 7.9 GiB  440 GiB  83.90  1.08   53  up  osd.4
>   8hdd2.3   1.0  2.7 TiB  2.1 TiB  2.1 TiB   490 MiB
> 7.9 GiB  533 GiB  80.49  1.03   51  up  osd.8
>  11hdd1.71660   1.0  1.7 TiB  1.5 TiB  1.5 TiB   406 MiB
> 5.5 GiB  200 GiB  88.60  1.14   36  up  osd.11
>  12hdd1.2   1.0  2.7 TiB  1.2 TiB  1.2 TiB   366 MiB
> 4.9 GiB  1.5 TiB  43.89  0.56   28  up  osd.12
>  14hdd2.2   1.0  2.7 TiB  2.0 TiB  2.0 TiB   418 MiB
> 7.1 GiB  693 GiB  74.66  0.96   47  up  osd.14
>  18hdd2.2   1.0  2.7 TiB  2.0 TiB  1.9 TiB   434 MiB
> 7.3 GiB  737 GiB  73.05  0.94   47  up  osd.18
>  22hdd1.0   1.0  1.7 TiB  890 GiB  886 GiB   110 MiB
> 3.6 GiB  868 GiB  50.62  0.65   20  up  osd.22
>  30hdd1.5   1.0  1.7 TiB  1.4 TiB  1.3 TiB   361 MiB
> 4.9 GiB  370 GiB  78.93  1.01   32  up  osd.30
>  33hdd1.5   0.97437  1.6 TiB  1.4 TiB  1.4 TiB   397 MiB
> 5.4 GiB  213 GiB  87.20  1.12   34  up  osd.33
>  64hdd3.33789   0.89752  3.3 TiB  2.7 TiB  2.7 TiB   573 MiB
> 9.9 GiB  647 GiB  81.07  1.04   64  up  osd.64
>  -3  2

[ceph-users] Re: HEALTH_WARN - Recovery Stuck?

2021-04-12 Thread Andrew Walker-Brown
If you increase the number of pgs, effectively each one is smaller so the 
backfill process may be able to ‘squeeze’ them onto the nearly full osds while 
it sorts things out. 

I’ve had something similar before and this def helped. 

Sent from my iPhone

On 12 Apr 2021, at 19:11, Marc  wrote:


You know you can play a bit with the ratios?

ceph tell osd.* injectargs '--mon_osd_full_ratio=0.95'
ceph tell osd.* injectargs '--mon_osd_backfillfull_ratio=0.90'


> -Original Message-
> From: Ml Ml 
> Sent: 12 April 2021 19:31
> To: ceph-users 
> Subject: [ceph-users] HEALTH_WARN - Recovery Stuck?
> 
> Hello,
> 
> i kind of ran out of disk space, so i added another host with osd.37.
> But it does not seem to move much data on it. (85MB in 2h)
> 
> Any idea why the recovery process seems to be stuck? Should i fix the
> 4 backfillfull osds first? (by changing the weight)?
> 
> root@ceph01:~# ceph -s
>  cluster:
>id: 5436dd5d-83d4-4dc8-a93b-60ab5db145df
>health: HEALTH_WARN
>4 backfillfull osd(s)
>9 nearfull osd(s)
>Low space hindering backfill (add storage if this doesn't
> resolve itself): 1 pg backfill_toofull
>4 pool(s) backfillfull
> 
>  services:
>mon: 3 daemons, quorum ceph03,ceph01,ceph02 (age 12d)
>mgr: ceph03(active, since 4M), standbys: ceph02.jwvivm
>mds: backup:1 {0=backup.ceph06.hdjehi=up:active} 3 up:standby
>osd: 53 osds: 53 up (since 2h), 53 in (since 2h); 235 remapped pgs
> 
>  task status:
>scrub status:
>mds.backup.ceph06.hdjehi: idle
> 
>  data:
>pools:   4 pools, 1185 pgs
>objects: 24.69M objects, 45 TiB
>usage:   149 TiB used, 42 TiB / 191 TiB avail
>pgs: 5388809/74059569 objects misplaced (7.276%)
> 950 active+clean
> 232 active+remapped+backfill_wait
> 2   active+remapped+backfilling
> 1   active+remapped+backfill_wait+backfill_toofull
> 
>  io:
>recovery: 0 B/s, 171 keys/s, 16 objects/s
> 
>  progress:
>Rebalancing after osd.37 marked in (2h)
>  [] (remaining: 6d)
> 
> 
> 
> root@ceph01:~# ceph health detail
> HEALTH_WARN 4 backfillfull osd(s); 9 nearfull osd(s); Low space
> hindering backfill (add storage if this doesn't resolve itself): 1 pg
> backfill_toofull; 4 pool(s) backfillfull
> [WRN] OSD_BACKFILLFULL: 4 backfillfull osd(s)
>osd.28 is backfill full
>osd.32 is backfill full
>osd.66 is backfill full
>osd.68 is backfill full
> [WRN] OSD_NEARFULL: 9 nearfull osd(s)
>osd.11 is near full
>osd.24 is near full
>osd.27 is near full
>osd.39 is near full
>osd.40 is near full
>osd.42 is near full
>osd.43 is near full
>osd.45 is near full
>osd.69 is near full
> [WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if
> this doesn't resolve itself): 1 pg backfill_toofull
>pg 23.295 is active+remapped+backfill_wait+backfill_toofull,
> acting [8,67,32]
> [WRN] POOL_BACKFILLFULL: 4 pool(s) backfillfull
>pool 'backurne-rbd' is backfillfull
>pool 'device_health_metrics' is backfillfull
>pool 'cephfs.backup.meta' is backfillfull
>pool 'cephfs.backup.data' is backfillfull
> 
> 
> root@ceph01:~# ceph osd df tree
> ID   CLASS  WEIGHT REWEIGHT  SIZE RAW USE  DATA OMAP
> META AVAIL%USE   VAR   PGS  STATUS  TYPE NAME
> -1 182.59897 -  191 TiB  149 TiB  149 TiB35 GiB
> 503 GiB   42 TiB  77.96  1.00-  root default
> -2  24.62473 -   29 TiB   22 TiB   22 TiB   5.0 GiB
> 80 GiB  7.1 TiB  75.23  0.96-  host ceph01
>  0hdd2.3   1.0  2.7 TiB  2.2 TiB  2.2 TiB   665 MiB
> 8.0 GiB  480 GiB  82.43  1.06   53  up  osd.0
>  1hdd2.2   1.0  2.7 TiB  2.1 TiB  2.1 TiB   446 MiB
> 7.5 GiB  590 GiB  78.44  1.01   49  up  osd.1
>  4hdd2.67029   0.91066  2.7 TiB  2.2 TiB  2.2 TiB   484 MiB
> 7.9 GiB  440 GiB  83.90  1.08   53  up  osd.4
>  8hdd2.3   1.0  2.7 TiB  2.1 TiB  2.1 TiB   490 MiB
> 7.9 GiB  533 GiB  80.49  1.03   51  up  osd.8
> 11hdd1.71660   1.0  1.7 TiB  1.5 TiB  1.5 TiB   406 MiB
> 5.5 GiB  200 GiB  88.60  1.14   36  up  osd.11
> 12hdd1.2   1.0  2.7 TiB  1.2 TiB  1.2 TiB   366 MiB
> 4.9 GiB  1.5 TiB  43.89  0.56   28  up  osd.12
> 14hdd2.2   1.0  2.7 TiB  2.0 TiB  2.0 TiB   418 MiB
> 7.1 GiB  693 GiB  74.66  0.96   47  up  osd.14
> 18hdd2.2   1.0  2.7 TiB  2.0 TiB  1.9 TiB   434 MiB
> 7.3 GiB  737 GiB  73.05  0.94   47  up  osd.18
> 22hdd1.0   1.0  1.7 TiB  890 GiB  886 GiB   110 MiB
> 3.6 GiB  868 GiB  50.62  0.65   20  up  osd.22
> 30hdd1.5   1.0  1.7 TiB  1.4 TiB  1.3 TiB   361 MiB
> 4.9 GiB  370 GiB  78.93  1.01   32  up  osd.30
> 33hdd1.5   0.97437  1.6 TiB  

[ceph-users] Re: OSDs RocksDB corrupted when upgrading nautilus->octopus: unknown WriteBatch tag

2021-04-12 Thread Jonas Jelten
Hi Igor!

I have plenty of OSDs to loose, as long as the recovery works well afterward, 
so I can go ahead with it :D

What debug flags should I activate? osd=10, bluefs=20, bluestore=20, 
rocksdb=10, ...?

I'm not sure it's really the transaction size, since the broken WriteBatch is 
dumped, and the command index is out of range (that's the WriteBatch tag).
I don't see why the transaction size would result in such a corruption - my 
naive look at the rocksdb sources looks like 14851 repairs shouldn't overflow 
the 32-bit WriteBatch entry counter, but who knows.

Are rocksdb keys like this normal? If yes, what's the construction logic? The 
pool is called 'dumpsite'.

0x80800a194027'Rdumpsite!rbd_data.6.28423ad8f48ca1.01b366ff!='0xfffe'o'
0x80800a1940f69264756d'psite!rbd_data.6.28423ad8f48ca1.011bdd0c!='0xfffe'o'


-- Jonas





On 12/04/2021 16.54, Igor Fedotov wrote:
> Sorry for being too late to the party...
> 
> I think the root cause is related to the high amount of repairs made during 
> the first post-upgrade fsck run.
> 
> The check (and fix) for zombie spanning blobs was been backported to v15.2.9 
> (here is the PR https://github.com/ceph/ceph/pull/39256). And I presumt it's 
> the one which causes BlueFS data corruption due to huge transaction happening 
> during such a repair.
> 
> I haven't seen this exact issue (as having that many zombie blobs is a rarely 
> met bug by itself) but we had to some degree similar issue with upgrading 
> omap names, see: https://github.com/ceph/ceph/pull/39377
> 
> Huge resulting transaction could cause too big write to WAL which in turn 
> caused data corruption (see https://github.com/ceph/ceph/pull/39701)
> 
> Although the fix for the latter has been merged for 15.2.10 some additional 
> issues with huge transactions might still exist...
> 
> 
> If someone can afford another OSD loss it could be interesting to get an OSD 
> log for such a repair with debug-bluefs set to 20...
> 
> I'm planning to make a fix to cap transaction size for repair in the nearest 
> future anyway though..
> 
> 
> Thanks,
> 
> Igor
> 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd info error opening image

2021-04-12 Thread Marcel Kuiper

Hi Eugen,

Thanks for your response
Apparently we ran into network troubles where sometimes traffic was 
delivered to the wrong firewall over L2


Regards

Marcel

Eugen Block schreef op 2021-04-12 12:09:

Hi,

have you checked if the rbd_header object still exists for that
volume? If it's indeed missing you could rebuild it as described in
[1], I haven't done that myself though.
It would help if you knew the block_name_prefix of that volume, if not
 you could figure that out by matching all existing rbd_headers to
their rbd_data objects and see which rbd_data don't have a matching
header object.

Regards,
Eugen

[1] https://fnordahl.com/2017/04/17/ceph-rbd-volume-header-recovery/


Zitat von Marcel Kuiper :


I hope someone can help out. I cannot run 'rbd info' on any image.

# rbd ls openstack-volumes

volume-628efc47-fc57-4630-8661-a13210a4e02c
volume-e4fe1e24-fb26-4abc-a458-f936a4e75715
volume-1ce1439d-767b-4b1d-8217-51464a11c5cc
volume-0a01d7e3-2c8f-4fab-9f9f-d84bbc7fa3c7
volume-a4aeb848-7283-4cd0-b5e6-ac2fc7f06dac

# rbd info 
openstack-volumes/volume-a4aeb848-7283-4cd0-b5e6-ac2fc7f06dac


rbd: error opening image  volume-a4aeb848-7283-4cd0-b5e6-ac2fc7f06dac: 
(2) No such file or  directory


We're running nautilus 14.2.16 on ubuntu bionic

Marcel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs RocksDB corrupted when upgrading nautilus->octopus: unknown WriteBatch tag

2021-04-12 Thread Igor Fedotov
No, it has absolutely no relation to CephFS. I presume it's a generic 
Bluestore/BlueFS issue.



On 4/12/2021 9:07 PM, dhils...@performair.com wrote:

Igor;

Does this only impact CephFS then?

Thank you,

Dominic L. Hilsbos, MBA
Director – Information Technology
Perform Air International Inc.
dhils...@performair.com
www.PerformAir.com


-Original Message-
From: Igor Fedotov [mailto:ifedo...@suse.de]
Sent: Monday, April 12, 2021 9:16 AM
To: Dominic Hilsbos; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: OSDs RocksDB corrupted when upgrading 
nautilus->octopus: unknown WriteBatch tag

The workaround would be to disable bluestore_fsck_quick_fix_on_mount, do
an upgrade and then do a regular fsck.

Depending on fsck  results either proceed with a repair or not.


Thanks,

Igor


On 4/12/2021 6:35 PM, dhils...@performair.com wrote:

Is there a way to check for these zombie blobs, and other issues needing 
repair, prior to the upgrade?  That would allow us to know that issues might be 
coming, and perhaps address them before they result in corrupt OSDs.

I'm considering upgrading our clusters from 14 to 15, and would really like to 
avoid these kinds of issues.

Thank you,

Dominic L. Hilsbos, MBA
Director - Information Technology
Perform Air International Inc.
dhils...@performair.com
www.PerformAir.com

-Original Message-
From: Igor Fedotov [mailto:ifedo...@suse.de]
Sent: Monday, April 12, 2021 7:55 AM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: OSDs RocksDB corrupted when upgrading 
nautilus->octopus: unknown WriteBatch tag

Sorry for being too late to the party...

I think the root cause is related to the high amount of repairs made
during the first post-upgrade fsck run.

The check (and fix) for zombie spanning blobs was been backported to
v15.2.9 (here is the PR https://github.com/ceph/ceph/pull/39256). And I
presumt it's the one which causes BlueFS data corruption due to huge
transaction happening during such a repair.

I haven't seen this exact issue (as having that many zombie blobs is a
rarely met bug by itself) but we had to some degree similar issue with
upgrading omap names, see: https://github.com/ceph/ceph/pull/39377

Huge resulting transaction could cause too big write to WAL which in
turn caused data corruption (see https://github.com/ceph/ceph/pull/39701)

Although the fix for the latter has been merged for 15.2.10 some
additional issues with huge transactions might still exist...


If someone can afford another OSD loss it could be interesting to get an
OSD log for such a repair with debug-bluefs set to 20...

I'm planning to make a fix to cap transaction size for repair in the
nearest future anyway though..


Thanks,

Igor


On 4/12/2021 5:15 PM, Dan van der Ster wrote:

Too bad. Let me continue trying to invoke Cunningham's Law for you ... ;)

Have you excluded any possible hardware issues?

15.2.10 has a new option to check for all zero reads; maybe try it with true?

   Option("bluefs_check_for_zeros", Option::TYPE_BOOL, Option::LEVEL_DEV)
   .set_default(false)
   .set_flag(Option::FLAG_RUNTIME)
   .set_description("Check data read for suspicious pages")
   .set_long_description("Looks into data read to check if there is a
4K block entirely filled with zeros. "
   "If this happens, we re-read data. If there is
difference, we print error to log.")
   .add_see_also("bluestore_retry_disk_reads"),

The "fix zombie spanning blobs" feature was added in 15.2.9. Does
15.2.8 work for you?

Cheers, Dan

On Sun, Apr 11, 2021 at 10:17 PM Jonas Jelten  wrote:

Thanks for the idea, I've tried it with 1 thread, and it shredded another OSD.
I've updated the tracker ticket :)

At least non-racecondition bugs are hopefully easier to spot...

I wouldn't just disable the fsck and upgrade anyway until the cause is rooted 
out.

-- Jonas


On 29/03/2021 14.34, Dan van der Ster wrote:

Hi,

Saw that, looks scary!

I have no experience with that particular crash, but I was thinking
that if you have already backfilled the degraded PGs, and can afford
to try another OSD, you could try:

   "bluestore_fsck_quick_fix_threads": "1",  # because
https://github.com/facebook/rocksdb/issues/5068 showed a similar crash
and the dev said it occurs because WriteBatch is not thread safe.

   "bluestore_fsck_quick_fix_on_mount": "false", # should disable the
fsck during upgrade. See https://github.com/ceph/ceph/pull/40198

-- Dan

On Mon, Mar 29, 2021 at 2:23 PM Jonas Jelten  wrote:

Hi!

After upgrading MONs and MGRs successfully, the first OSD host I upgraded on 
Ubuntu Bionic from 14.2.16 to 15.2.10
shredded all OSDs on it by corrupting RocksDB, and they now refuse to boot.
RocksDB complains "Corruption: unknown WriteBatch tag".

The initial crash/corruption occured when the automatic fsck was ran, and when it 
committed the changes for a lot of "zombie spanning blobs".

Tracker issue with logs: https://tracker.ceph.com/issues/50017


Anyone else 

[ceph-users] Re: HEALTH_WARN - Recovery Stuck?

2021-04-12 Thread Michael Thomas
I recently had a similar issue when reducing the number of PGs on a 
pool.  A few OSDs became backfillful even though there was enough space; 
the OSDs were just not balanced well.


To fix, I reweighted the most-full OSDs:

ceph osd reweight-by-utilization 120

After it finished (~1 hour), I had fewer backfillful OSDs.  I repeated 
this 2 more times, after which the OSDs were no longer backfillful and 
recovery data movement resumed.


Once the recovery was complete, I reweighted all OSDs back to 1.0, and 
all was fine.


--Mike

On 4/12/21 12:30 PM, Ml Ml wrote:

Hello,

i kind of ran out of disk space, so i added another host with osd.37.
But it does not seem to move much data on it. (85MB in 2h)

Any idea why the recovery process seems to be stuck? Should i fix the
4 backfillfull osds first? (by changing the weight)?

root@ceph01:~# ceph -s
   cluster:
 id: 5436dd5d-83d4-4dc8-a93b-60ab5db145df
 health: HEALTH_WARN
 4 backfillfull osd(s)
 9 nearfull osd(s)
 Low space hindering backfill (add storage if this doesn't
resolve itself): 1 pg backfill_toofull
 4 pool(s) backfillfull

   services:
 mon: 3 daemons, quorum ceph03,ceph01,ceph02 (age 12d)
 mgr: ceph03(active, since 4M), standbys: ceph02.jwvivm
 mds: backup:1 {0=backup.ceph06.hdjehi=up:active} 3 up:standby
 osd: 53 osds: 53 up (since 2h), 53 in (since 2h); 235 remapped pgs

   task status:
 scrub status:
 mds.backup.ceph06.hdjehi: idle

   data:
 pools:   4 pools, 1185 pgs
 objects: 24.69M objects, 45 TiB
 usage:   149 TiB used, 42 TiB / 191 TiB avail
 pgs: 5388809/74059569 objects misplaced (7.276%)
  950 active+clean
  232 active+remapped+backfill_wait
  2   active+remapped+backfilling
  1   active+remapped+backfill_wait+backfill_toofull

   io:
 recovery: 0 B/s, 171 keys/s, 16 objects/s

   progress:
 Rebalancing after osd.37 marked in (2h)
   [] (remaining: 6d)



root@ceph01:~# ceph health detail
HEALTH_WARN 4 backfillfull osd(s); 9 nearfull osd(s); Low space
hindering backfill (add storage if this doesn't resolve itself): 1 pg
backfill_toofull; 4 pool(s) backfillfull
[WRN] OSD_BACKFILLFULL: 4 backfillfull osd(s)
 osd.28 is backfill full
 osd.32 is backfill full
 osd.66 is backfill full
 osd.68 is backfill full
[WRN] OSD_NEARFULL: 9 nearfull osd(s)
 osd.11 is near full
 osd.24 is near full
 osd.27 is near full
 osd.39 is near full
 osd.40 is near full
 osd.42 is near full
 osd.43 is near full
 osd.45 is near full
 osd.69 is near full
[WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if
this doesn't resolve itself): 1 pg backfill_toofull
 pg 23.295 is active+remapped+backfill_wait+backfill_toofull,
acting [8,67,32]
[WRN] POOL_BACKFILLFULL: 4 pool(s) backfillfull
 pool 'backurne-rbd' is backfillfull
 pool 'device_health_metrics' is backfillfull
 pool 'cephfs.backup.meta' is backfillfull
 pool 'cephfs.backup.data' is backfillfull


root@ceph01:~# ceph osd df tree
ID   CLASS  WEIGHT REWEIGHT  SIZE RAW USE  DATA OMAP
META AVAIL%USE   VAR   PGS  STATUS  TYPE NAME
  -1 182.59897 -  191 TiB  149 TiB  149 TiB35 GiB
503 GiB   42 TiB  77.96  1.00-  root default
  -2  24.62473 -   29 TiB   22 TiB   22 TiB   5.0 GiB
80 GiB  7.1 TiB  75.23  0.96-  host ceph01
   0hdd2.3   1.0  2.7 TiB  2.2 TiB  2.2 TiB   665 MiB
8.0 GiB  480 GiB  82.43  1.06   53  up  osd.0
   1hdd2.2   1.0  2.7 TiB  2.1 TiB  2.1 TiB   446 MiB
7.5 GiB  590 GiB  78.44  1.01   49  up  osd.1
   4hdd2.67029   0.91066  2.7 TiB  2.2 TiB  2.2 TiB   484 MiB
7.9 GiB  440 GiB  83.90  1.08   53  up  osd.4
   8hdd2.3   1.0  2.7 TiB  2.1 TiB  2.1 TiB   490 MiB
7.9 GiB  533 GiB  80.49  1.03   51  up  osd.8
  11hdd1.71660   1.0  1.7 TiB  1.5 TiB  1.5 TiB   406 MiB
5.5 GiB  200 GiB  88.60  1.14   36  up  osd.11
  12hdd1.2   1.0  2.7 TiB  1.2 TiB  1.2 TiB   366 MiB
4.9 GiB  1.5 TiB  43.89  0.56   28  up  osd.12
  14hdd2.2   1.0  2.7 TiB  2.0 TiB  2.0 TiB   418 MiB
7.1 GiB  693 GiB  74.66  0.96   47  up  osd.14
  18hdd2.2   1.0  2.7 TiB  2.0 TiB  1.9 TiB   434 MiB
7.3 GiB  737 GiB  73.05  0.94   47  up  osd.18
  22hdd1.0   1.0  1.7 TiB  890 GiB  886 GiB   110 MiB
3.6 GiB  868 GiB  50.62  0.65   20  up  osd.22
  30hdd1.5   1.0  1.7 TiB  1.4 TiB  1.3 TiB   361 MiB
4.9 GiB  370 GiB  78.93  1.01   32  up  osd.30
  33hdd1.5   0.97437  1.6 TiB  1.4 TiB  1.4 TiB   397 MiB
5.4 GiB  213 GiB  87.20  1.12   34  up  osd.33
  64hdd3.33789   0.89752  3.3 TiB  2.7 TiB 

[ceph-users] HEALTH_WARN - Recovery Stuck?

2021-04-12 Thread Ml Ml
Hello,

i kind of ran out of disk space, so i added another host with osd.37.
But it does not seem to move much data on it. (85MB in 2h)

Any idea why the recovery process seems to be stuck? Should i fix the
4 backfillfull osds first? (by changing the weight)?

root@ceph01:~# ceph -s
  cluster:
id: 5436dd5d-83d4-4dc8-a93b-60ab5db145df
health: HEALTH_WARN
4 backfillfull osd(s)
9 nearfull osd(s)
Low space hindering backfill (add storage if this doesn't
resolve itself): 1 pg backfill_toofull
4 pool(s) backfillfull

  services:
mon: 3 daemons, quorum ceph03,ceph01,ceph02 (age 12d)
mgr: ceph03(active, since 4M), standbys: ceph02.jwvivm
mds: backup:1 {0=backup.ceph06.hdjehi=up:active} 3 up:standby
osd: 53 osds: 53 up (since 2h), 53 in (since 2h); 235 remapped pgs

  task status:
scrub status:
mds.backup.ceph06.hdjehi: idle

  data:
pools:   4 pools, 1185 pgs
objects: 24.69M objects, 45 TiB
usage:   149 TiB used, 42 TiB / 191 TiB avail
pgs: 5388809/74059569 objects misplaced (7.276%)
 950 active+clean
 232 active+remapped+backfill_wait
 2   active+remapped+backfilling
 1   active+remapped+backfill_wait+backfill_toofull

  io:
recovery: 0 B/s, 171 keys/s, 16 objects/s

  progress:
Rebalancing after osd.37 marked in (2h)
  [] (remaining: 6d)



root@ceph01:~# ceph health detail
HEALTH_WARN 4 backfillfull osd(s); 9 nearfull osd(s); Low space
hindering backfill (add storage if this doesn't resolve itself): 1 pg
backfill_toofull; 4 pool(s) backfillfull
[WRN] OSD_BACKFILLFULL: 4 backfillfull osd(s)
osd.28 is backfill full
osd.32 is backfill full
osd.66 is backfill full
osd.68 is backfill full
[WRN] OSD_NEARFULL: 9 nearfull osd(s)
osd.11 is near full
osd.24 is near full
osd.27 is near full
osd.39 is near full
osd.40 is near full
osd.42 is near full
osd.43 is near full
osd.45 is near full
osd.69 is near full
[WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if
this doesn't resolve itself): 1 pg backfill_toofull
pg 23.295 is active+remapped+backfill_wait+backfill_toofull,
acting [8,67,32]
[WRN] POOL_BACKFILLFULL: 4 pool(s) backfillfull
pool 'backurne-rbd' is backfillfull
pool 'device_health_metrics' is backfillfull
pool 'cephfs.backup.meta' is backfillfull
pool 'cephfs.backup.data' is backfillfull


root@ceph01:~# ceph osd df tree
ID   CLASS  WEIGHT REWEIGHT  SIZE RAW USE  DATA OMAP
META AVAIL%USE   VAR   PGS  STATUS  TYPE NAME
 -1 182.59897 -  191 TiB  149 TiB  149 TiB35 GiB
503 GiB   42 TiB  77.96  1.00-  root default
 -2  24.62473 -   29 TiB   22 TiB   22 TiB   5.0 GiB
80 GiB  7.1 TiB  75.23  0.96-  host ceph01
  0hdd2.3   1.0  2.7 TiB  2.2 TiB  2.2 TiB   665 MiB
8.0 GiB  480 GiB  82.43  1.06   53  up  osd.0
  1hdd2.2   1.0  2.7 TiB  2.1 TiB  2.1 TiB   446 MiB
7.5 GiB  590 GiB  78.44  1.01   49  up  osd.1
  4hdd2.67029   0.91066  2.7 TiB  2.2 TiB  2.2 TiB   484 MiB
7.9 GiB  440 GiB  83.90  1.08   53  up  osd.4
  8hdd2.3   1.0  2.7 TiB  2.1 TiB  2.1 TiB   490 MiB
7.9 GiB  533 GiB  80.49  1.03   51  up  osd.8
 11hdd1.71660   1.0  1.7 TiB  1.5 TiB  1.5 TiB   406 MiB
5.5 GiB  200 GiB  88.60  1.14   36  up  osd.11
 12hdd1.2   1.0  2.7 TiB  1.2 TiB  1.2 TiB   366 MiB
4.9 GiB  1.5 TiB  43.89  0.56   28  up  osd.12
 14hdd2.2   1.0  2.7 TiB  2.0 TiB  2.0 TiB   418 MiB
7.1 GiB  693 GiB  74.66  0.96   47  up  osd.14
 18hdd2.2   1.0  2.7 TiB  2.0 TiB  1.9 TiB   434 MiB
7.3 GiB  737 GiB  73.05  0.94   47  up  osd.18
 22hdd1.0   1.0  1.7 TiB  890 GiB  886 GiB   110 MiB
3.6 GiB  868 GiB  50.62  0.65   20  up  osd.22
 30hdd1.5   1.0  1.7 TiB  1.4 TiB  1.3 TiB   361 MiB
4.9 GiB  370 GiB  78.93  1.01   32  up  osd.30
 33hdd1.5   0.97437  1.6 TiB  1.4 TiB  1.4 TiB   397 MiB
5.4 GiB  213 GiB  87.20  1.12   34  up  osd.33
 64hdd3.33789   0.89752  3.3 TiB  2.7 TiB  2.7 TiB   573 MiB
9.9 GiB  647 GiB  81.07  1.04   64  up  osd.64
 -3  26.79504 -   30 TiB   24 TiB   24 TiB   6.2 GiB
89 GiB  5.4 TiB  81.80  1.05-  host ceph02
  2hdd1.5   1.0  1.7 TiB  1.4 TiB  1.4 TiB   363 MiB
5.3 GiB  359 GiB  79.58  1.02   32  up  osd.2
  3hdd2.5   1.0  2.7 TiB  2.2 TiB  2.2 TiB   647 MiB
7.8 GiB  469 GiB  82.85  1.06   53  up  osd.3
  7hdd2.0   1.0  2.7 TiB  1.8 TiB  1.8 TiB   453 MiB
7.0 GiB  848 GiB  69.00  0.89   43  up  osd.7
  9hdd2.67029   0.98323  2.7 TiB  2.4 

[ceph-users] Re: Nautilus 14.2.19 mon 100% CPU

2021-04-12 Thread Brad Hubbard
On Mon, Apr 12, 2021 at 11:35 AM Robert LeBlanc  wrote:
>
> On Sun, Apr 11, 2021 at 4:19 PM Brad Hubbard  wrote:
> >
> > PSA.
> >
> > https://docs.ceph.com/en/latest/releases/general/#lifetime-of-stable-releases
> >
> > https://docs.ceph.com/en/latest/releases/#ceph-releases-index
>
> I'm very well aware that we are living on the dying edge (well, past
> dead), but a good chunk of machines are Ubuntu 14.04 not by choice.
> Getting this upgrade done was sorely needed, but very risky at the
> same time.

Sure Robert,

I understand the realities of maintaining large installations which
may have many reasons holding them back from upgrading any of the
interdependent software they run. The other side of the page however
is that we can not support releases indefinitely as each additional
supported release places a huge burden on limited dev, support, and QA
resources. We try to strike a balance but it's not "one size fits all"
unfortunately.

-- 
Cheers,
Brad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus 14.2.19 mon 100% CPU

2021-04-12 Thread Robert LeBlanc
On Mon, Apr 12, 2021 at 3:41 PM Brad Hubbard  wrote:
>
> Sure Robert,
>
> I understand the realities of maintaining large installations which
> may have many reasons holding them back from upgrading any of the
> interdependent software they run. The other side of the page however
> is that we can not support releases indefinitely as each additional
> supported release places a huge burden on limited dev, support, and QA
> resources. We try to strike a balance but it's not "one size fits all"
> unfortunately.

I do appreciate the considerable effort that allows for such massive
backwards compatibility, so I want to make sure I'm not coming off
ungrateful or anything like that. I know it takes a lot of effort to
maintain software for such old systems. We are spending far too much
time maintaining old machines, so we understand. At least there is the
FUSE client from Luminous that we can deploy on 14.04, still better
than the Jewel or Hammer compatibility of the kernel from 14.04. Life
and safety systems are alway so difficult to make changes to.

Do you think it would be possible to build Nautilus FUSE or newer on
14.04, or do you think the toolchain has evolved too much since then?

Thanks,
Robert LeBlanc
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph rgw why are reads faster for larger than 64kb object size

2021-04-12 Thread Ronnie Puthukkeril
Environment: Ceph Nautilus 14.2.8 Object Storage
Data nodes: 12 * HDD OSDs drives each with a 12TB capacity + 2 * SSD OSDs 
drives for rgw bucket index pool & rgw meta pool.

Custom configs (since we dealing with a majority smaller sized objects)

bluestore_min_alloc_size_ssd 4096

bluestore_min_alloc_size_hdd 4096

Observations from cosbench performance tests
Stage
Op-Type
Op-Count
Byte-Count
Avg-ResTime
s7-read1KB 48W
read
2004202
2004202000
43.11
s13-read2KB 48W
read
2013906
4027812000
42.9
s19-read4KB 48W
read
2014701
8058804000
42.88
s25-read8KB 48W
read
2002337
16018696000
43.15
s31-read16KB 48W
read
1987785
3180456
43.46
s37-read32KB 48W
read
1976190
6323808
43.7
s43-read64KB 48W
read
1929183
123467712000
44.78
s49-read128KB 48W
read
9965032
1275524096000
8.67
s55-read256KB 48W
read
6505554
1665421824000
13.28

The response time improves drastically when the object size is greater than 
64KB. What could be the reason?

Thanks,
Ronnie



[https://d1dejaj6dcqv24.cloudfront.net/asset/image/email-banner-384-2x.png]



This message may contain confidential and privileged information. If it has 
been sent to you in error, please reply to advise the sender of the error and 
then immediately delete it. If you are not the intended recipient, do not read, 
copy, disclose or otherwise use this message. The sender disclaims any 
liability for such unauthorized use. NOTE that all incoming emails sent to 
Qualys email accounts will be archived and may be scanned by us and/or by 
external service providers to detect and prevent threats to our systems, 
investigate illegal or inappropriate behavior, and/or eliminate unsolicited 
promotional emails ("spam"). If you have any concerns about this process, 
please contact us.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph rgw why are reads faster for larger than 64kb object size

2021-04-12 Thread Ronnie Puthukkeril
Sorry about the formatting in the earlier email. Hope this one works.

Below are the read response times from cosbench

Stage   Op-NameOp-Type   Op-Count   
  Byte-Count Avg-ResTime
s7-read1KB 48W  read  read  2004202 
 2004202000   43.11
s13-read2KB 48Wread  read  2013906  
4027812000   42.9
s19-read4KB 48Wread  read  2014701  
8058804000   42.88
s25-read8KB 48Wread  read  2002337  
1601869600043.15
s31-read16KB 48W read  read  1987785
  318045643.46
s37-read32KB 48W read  read  1976190
  632380843.7
s43-read64KB 48W read  read  1929183
  123467712000  44.78
s49-read128KB 48W   read  read  9965032 
 12755240960008.67
s55-read256KB 48W   read  read  6505554 
 166542182400013.28

Thanks,
Ronnie

From: Ronnie Puthukkeril 
Date: Monday, April 12, 2021 at 5:24 PM
To: ceph-users@ceph.io 
Subject: ceph rgw why are reads faster for larger than 64kb object size
Environment: Ceph Nautilus 14.2.8 Object Storage
Data nodes: 12 * HDD OSDs drives each with a 12TB capacity + 2 * SSD OSDs 
drives for rgw bucket index pool & rgw meta pool.

Custom configs (since we dealing with a majority smaller sized objects)

bluestore_min_alloc_size_ssd 4096

bluestore_min_alloc_size_hdd 4096

Observations from cosbench performance tests
Stage
Op-Type
Op-Count
Byte-Count
Avg-ResTime
s7-read1KB 48W
read
2004202
2004202000
43.11
s13-read2KB 48W
read
2013906
4027812000
42.9
s19-read4KB 48W
read
2014701
8058804000
42.88
s25-read8KB 48W
read
2002337
16018696000
43.15
s31-read16KB 48W
read
1987785
3180456
43.46
s37-read32KB 48W
read
1976190
6323808
43.7
s43-read64KB 48W
read
1929183
123467712000
44.78
s49-read128KB 48W
read
9965032
1275524096000
8.67
s55-read256KB 48W
read
6505554
1665421824000
13.28

The response time improves drastically when the object size is greater than 
64KB. What could be the reason?

Thanks,
Ronnie



[https://d1dejaj6dcqv24.cloudfront.net/asset/image/email-banner-384-2x.png]



This message may contain confidential and privileged information. If it has 
been sent to you in error, please reply to advise the sender of the error and 
then immediately delete it. If you are not the intended recipient, do not read, 
copy, disclose or otherwise use this message. The sender disclaims any 
liability for such unauthorized use. NOTE that all incoming emails sent to 
Qualys email accounts will be archived and may be scanned by us and/or by 
external service providers to detect and prevent threats to our systems, 
investigate illegal or inappropriate behavior, and/or eliminate unsolicited 
promotional emails ("spam"). If you have any concerns about this process, 
please contact us.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: ceph rgw why are reads faster for larger than 64kb object size

2021-04-12 Thread Ronnie Puthukkeril
Environment: Ceph Nautilus 14.2.8 Object Storage
Data nodes: 12 * HDD OSDs drives each with a 12TB capacity + 2 * SSD OSDs 
drives for rgw bucket index pool & rgw meta pool.

Custom configs (since we dealing with a majority smaller sized objects)
bluestore_min_alloc_size_ssd 4096
bluestore_min_alloc_size_hdd 4096

Stage   Avg-ResTime
s7-read1KB 48W  43.11
s13-read2KB 48W42.9
s19-read4KB 48W42.88
s25-read8KB 48W43.15
s31-read16KB 48W 43.46
s37-read32KB 48W 43.7
s43-read64KB 48W 44.78
s49-read128KB 48W   8.67
s55-read256KB 48W   13.28

The read latencies for 128KB objects are 5x faster compared to 64KB objects. 
Why?

From: Ronnie Puthukkeril 
Date: Monday, April 12, 2021 at 5:34 PM
To: ceph-users@ceph.io 
Subject: Re: ceph rgw why are reads faster for larger than 64kb object size
Sorry about the formatting in the earlier email. Hope this one works.

Below are the read response times from cosbench

Stage   Op-NameOp-Type   Op-Count   
  Byte-Count Avg-ResTime
s7-read1KB 48W  read  read  2004202 
 2004202000   43.11
s13-read2KB 48Wread  read  2013906  
4027812000   42.9
s19-read4KB 48Wread  read  2014701  
8058804000   42.88
s25-read8KB 48Wread  read  2002337  
1601869600043.15
s31-read16KB 48W read  read  1987785
  318045643.46
s37-read32KB 48W read  read  1976190
  632380843.7
s43-read64KB 48W read  read  1929183
  123467712000  44.78
s49-read128KB 48W   read  read  9965032 
 12755240960008.67
s55-read256KB 48W   read  read  6505554 
 166542182400013.28

Thanks,
Ronnie

From: Ronnie Puthukkeril 
Date: Monday, April 12, 2021 at 5:24 PM
To: ceph-users@ceph.io 
Subject: ceph rgw why are reads faster for larger than 64kb object size
Environment: Ceph Nautilus 14.2.8 Object Storage
Data nodes: 12 * HDD OSDs drives each with a 12TB capacity + 2 * SSD OSDs 
drives for rgw bucket index pool & rgw meta pool.

Custom configs (since we dealing with a majority smaller sized objects)

bluestore_min_alloc_size_ssd 4096

bluestore_min_alloc_size_hdd 4096

Observations from cosbench performance tests
Stage
Op-Type
Op-Count
Byte-Count
Avg-ResTime
s7-read1KB 48W
read
2004202
2004202000
43.11
s13-read2KB 48W
read
2013906
4027812000
42.9
s19-read4KB 48W
read
2014701
8058804000
42.88
s25-read8KB 48W
read
2002337
16018696000
43.15
s31-read16KB 48W
read
1987785
3180456
43.46
s37-read32KB 48W
read
1976190
6323808
43.7
s43-read64KB 48W
read
1929183
123467712000
44.78
s49-read128KB 48W
read
9965032
1275524096000
8.67
s55-read256KB 48W
read
6505554
1665421824000
13.28

The response time improves drastically when the object size is greater than 
64KB. What could be the reason?

Thanks,
Ronnie



[https://d1dejaj6dcqv24.cloudfront.net/asset/image/email-banner-384-2x.png]



This message may contain confidential and privileged information. If it has 
been sent to you in error, please reply to advise the sender of the error and 
then immediately delete it. If you are not the intended recipient, do not read, 
copy, disclose or otherwise use this message. The sender disclaims any 
liability for such unauthorized use. NOTE that all incoming emails sent to 
Qualys email accounts will be archived and may be scanned by us and/or by 
external service providers to detect and prevent threats to our systems, 
investigate illegal or inappropriate behavior, and/or eliminate unsolicited 
promotional emails ("spam"). If you have any concerns about this process, 
please contact us.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Nautilus 14.2.19 mon 100% CPU

2021-04-12 Thread Brad Hubbard
On Tue, Apr 13, 2021 at 8:40 AM Robert LeBlanc  wrote:
>
> Do you think it would be possible to build Nautilus FUSE or newer on
> 14.04, or do you think the toolchain has evolved too much since then?
>

An interesting question.

# cat /etc/os-release
NAME="Ubuntu"
VERSION="14.04.6 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.6 LTS"
VERSION_ID="14.04"
HOME_URL="http://www.ubuntu.com/";
SUPPORT_URL="http://help.ubuntu.com/";
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/";

Had to tell cmake not to look for lz4 because the version on Trusty is too old.

# ./do_cmake.sh -DWITH_LZ4=off
# cd build/
# make -j8 ceph-fuse
# make -j8 rbd-fuse
# ./bin/rbd-fuse --version
ceph version 14.2.19-83-g53aefaa
(53aefaa1443c3a9bbd4e6448aa69e3d88b58cd51) nautilus (stable)
# ./bin/ceph-fuse --version
ceph version 14.2.19-83-g53aefaa
(53aefaa1443c3a9bbd4e6448aa69e3d88b58cd51) nautilus (stable)

I don't think Octopus would build on 14.04.

--
Cheers,
Brad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] BADAUTHORIZER in Nautilus, unknown PGs, slow peering, very slow client I/O

2021-04-12 Thread Nico Schottelius


Good morning,

I've look somewhat intensively through the list and it seems we are
rather hard hit by this. Originally yesterday started on a mixed 14.2.9
and 14.2.16 cluster (osds, mons were all 14.2.16).

We started phasing in 7 new osds, 6 of them throttled by reweighting to
0.1.
Symptoms are many unknown PGs, very long stuck in peering (hours), slow
activating and the infamous BADAUTHORIZER message. Client I/O is almost
0, not only of the pool with new OSDs, but also of other pools.

We tried restarting all OSDs one by one, which seemed to clear the
unknown PGs, however after around 1h they came back to unknown state.

Tuning the --osd-max-backfills=.. and  --osd-recovery-max-active=... to
1 does not improve the situation, so we are currently running at =7 for
both of them.

An excerpt from one of the osds which heavily logs this issue is below.

Any pointer on how this problem was solved on nautilus before is much
appreciated, as the issue started late last night.

Best regards,

Nico


2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: (Original Log Time 
2021/04/13-07:38:37.275255) EVENT_LOG_v1 {"time_micros": 1618292317275255, 
"job": 5, "event": "compaction_finished", "compaction_time_micros": 5288403, 
"compaction_time_cpu_micros": 2940370, "output_level": 2, "num_output_files": 
6, "total_output_size": 370673432, "num_input_records": 983055, 
"num_output_records": 641964, "num_subcompactions": 1, "output_compression": 
"NoCompression", "num_single_delete_mismatches": 0, 
"num_single_delete_fallthrough": 0, "lsm_state": [0, 5, 35, 0, 0, 0, 0]}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
421257}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
401289}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
401288}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
401278}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
401277}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
401258}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
401257}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
401256}
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: [db/compaction_job.cc:1645] 
[default] [JOB 6] Compacting 1@1 + 5@2 files to L2, score 1.17
2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: [db/compaction_job.cc:1649] 
[default] Compaction start summary: Base version 5 Base level 1, inputs: 
[421255(65MB)], [401318(65MB) 401319(65MB) 401320(65MB) 401321(65MB) 
401322(65MB)]

2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
1618292317275255, "job": 6, "event": "compaction_started", "compaction_reason": 
"LevelMaxLevelSize", "files_L1": [421255], "files_L2": [401318, 401319, 401320, 
401321, 401322], "score": 1.17224, "input_data_size": 413630461}
2021-04-13 07:38:37.307 7f15a0704700  0 --1- 
[v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534]
 >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6862/8751 conn(0x55d520fbcd80 
0x55d520ffa000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 
l=0).handle_connect_reply_2 connect got BADAUTHORIZER
2021-04-13 07:38:37.315 7f15a372c700  0 --1- 
[v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534]
 >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:5060]:6835/13651 conn(0x55d521b1bf80 
0x55d521b19800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 
l=0).handle_connect_reply_2 connect got BADAUTHORIZER
2021-04-13 07:38:37.707 7f15a0704700  0 --1- 
[v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534]
 >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6862/8751 conn(0x55d520fbcd80 
0x55d520ffa000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 
l=0).handle_connect_reply_2 connect got BADAUTHORIZER
2021-04-13 07:38:37.715 7f15a372c700  0 --1- 
[v2:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6816/5534,v1:[2a0a:e5c0:2:1:21b:21ff:febc:7cb6]:6817/5534]
 >> v1:[2a0a:e5c0:2:1:21b:21ff:febc:5060]:6835/13651 conn(0x55d521b1bf80 
0x55d521b19800 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 
l=0).handle_connect_reply_2 connect got BADAUTHORIZER
2021-04-13 07:38:38.119 7f159a469700  4 rocksdb: [db/compaction_job.cc:1332] 
[default] [JOB 6] Gener

[ceph-users] Re: BADAUTHORIZER in Nautilus, unknown PGs, slow peering, very slow client I/O

2021-04-12 Thread Nico Schottelius


Update, posting information from other posts before:

[08:09:40] server3.place6:~# ceph config-key dump | grep config/
"config/global/auth_client_required": "cephx",
"config/global/auth_cluster_required": "cephx",
"config/global/auth_service_required": "cephx",
"config/global/cluster_network": "2a0a:e5c0:2:1::/64",
"config/global/ms_bind_ipv4": "false",
"config/global/ms_bind_ipv6": "true",
"config/global/osd_class_update_on_start": "false",
"config/global/osd_pool_default_size": "3",
"config/global/public_network": "2a0a:e5c0:2:1::/64",
"config/mgr/mgr/balancer/active": "1",
"config/mgr/mgr/balancer/max_misplaced": ".01",
"config/mgr/mgr/balancer/mode": "upmap",
"config/mgr/mgr/prometheus/rbd_stats_pools": "hdd,ssd,xruk-ssd-pool",
"config/osd/osd_max_backfills": "1",
"config/osd/osd_recovery_max_active": "1",
"config/osd/osd_recovery_op_priority": "1",
[08:09:56] server3.place6:~#

To some degree finding osds with huge logs sizes that usually contain

2021-04-13 08:31:25.388 7fa4896a4700  0 --1- 
[2a0a:e5c0:2:1:21b:21ff:febb:68d8]:0/19492 >> 
v1:[2a0a:e5c0:2:1:21b:21ff:fe85:a3a2]:6881/29606 conn(0x55a04deadf80 
0x55a0273ad000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 
l=1).handle_connect_reply_2 connect got BADAUTHORIZER
2021-04-13 08:31:25.388 7fa48a6b7700  0 --1- 
[2a0a:e5c0:2:1:21b:21ff:febb:68d8]:0/19492 >> 
v1:[2a0a:e5c0:2:1:21b:21ff:fe85:a3a2]:6879/29606 conn(0x55a04eb16d80 
0x55a02742c000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 
l=1).handle_connect_reply_2 connect got BADAUTHORIZER

(with varying osds IPs) and restarting them seems to improve the
situation for a bit, but even newly restarted osds go back into the
BADAUTHORIZER state.

Nico Schottelius  writes:

> Good morning,
>
> I've look somewhat intensively through the list and it seems we are
> rather hard hit by this. Originally yesterday started on a mixed 14.2.9
> and 14.2.16 cluster (osds, mons were all 14.2.16).
>
> We started phasing in 7 new osds, 6 of them throttled by reweighting to
> 0.1.
> Symptoms are many unknown PGs, very long stuck in peering (hours), slow
> activating and the infamous BADAUTHORIZER message. Client I/O is almost
> 0, not only of the pool with new OSDs, but also of other pools.
>
> We tried restarting all OSDs one by one, which seemed to clear the
> unknown PGs, however after around 1h they came back to unknown state.
>
> Tuning the --osd-max-backfills=.. and  --osd-recovery-max-active=... to
> 1 does not improve the situation, so we are currently running at =7 for
> both of them.
>
> An excerpt from one of the osds which heavily logs this issue is below.
>
> Any pointer on how this problem was solved on nautilus before is much
> appreciated, as the issue started late last night.
>
> Best regards,
>
> Nico
>
>
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: (Original Log Time 
> 2021/04/13-07:38:37.275255) EVENT_LOG_v1 {"time_micros": 1618292317275255, 
> "job": 5, "event": "compaction_finished", "compaction_time_micros": 5288403, 
> "compaction_time_cpu_micros": 2940370, "output_level": 2, "num_output_files": 
> 6, "total_output_size": 370673432, "num_input_records": 983055, 
> "num_output_records": 641964, "num_subcompactions": 1, "output_compression": 
> "NoCompression", "num_single_delete_mismatches": 0, 
> "num_single_delete_fallthrough": 0, "lsm_state": [0, 5, 35, 0, 0, 0, 0]}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
> 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
> 421257}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
> 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
> 401289}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
> 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
> 401288}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
> 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
> 401278}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
> 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
> 401277}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
> 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
> 401258}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
> 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
> 401257}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 
> 1618292317275255, "job": 5, "event": "table_file_deletion", "file_number": 
> 401256}
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: [db/compaction_job.cc:1645] 
> [default] [JOB 6] Compacting 1@1 + 5@2 files to L2, score 1.17
> 2021-04-13 07:38:37.275 7f159a469700  4 rocksdb: [db/compaction_job.cc:1649] 
> [defau