[ceph-users] Re: alerts in dashboard

2023-06-21 Thread Nizamudeen A
Hi Ben,

It looks like you forgot to attach the screenshots.

Regards,
Nizam

On Wed, Jun 21, 2023, 12:23 Ben  wrote:

> Hi,
>
> I got many critical alerts in ceph dashboard. Meanwhile the cluster shows
> health ok status.
>
> See attached screenshot for detail. My questions are, are they real alerts?
> How to get rid of them?
>
> Thanks
> Ben
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs cannot join cluster anymore

2023-06-21 Thread Eugen Block

Hi,
can you share more details what exactly you did? How did you remove  
the nodes? Hopefully, you waited for the draining to finish? But if  
the remaining OSDs wait for removed OSDs it sounds like the draining  
was not finished.


Zitat von Malte Stroem :


Hello,

we removed some nodes from our cluster. This worked without problems.

Now, lots of OSDs do not want to join the cluster anymore if we  
reboot one of the still available nodes.


It always runs into timeouts:

--> ceph-volume lvm activate successful for osd ID: XX
monclient(hunting): authenticate timed out after 300

MONs and MGRs are running fine.

Network is working, netcat to the MONs' ports are open.

Setting a higher debug level has no effect even if we add it to the  
ceph.conf file.


The PGs are pretty unhappy, e. g.:

7.143  87771   0 0  00  
3147449022350   0  10081 10081  down   
2023-06-20T09:16:03.546158+961275'1395646 961300:9605547   
[209,NONE,NONE] 209  [209,NONE,NONE] 209961231'1395512   
2023-06-19T23:46:40.101791+961231'1395512   
2023-06-19T23:46:40.101791+


PG query wants us to set an OSD lost however I do not want to do this.

OSDs are blocked by OSDs from the removed nodes:

ceph osd blocked-by
osd  num_blocked
152   38
244   41
144   54
...

We added the removed hosts again and tried to start the OSDs on this  
node and they also failed into the timeout mentioned above.


This is a containerized cluster running version 16.2.10.

Replication is 3, some pools use an erasure coded profile.

Best regards,
Malte


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs cannot join cluster anymore

2023-06-21 Thread Malte Stroem

Hello Eugen,

thank you. Yesterday I thought: Well, Eugen can help!

Yes, we drained the nodes. It needed two weeks to finish the process, 
and yes, I think this is the root cause.


So we still have the nodes but when I try to restart one of those OSDs 
it still cannot join:


Jun 21 09:46:03 ceph-node bash[2323668]: Running command: /usr/bin/chown 
-h ceph:ceph /var/lib/ceph/osd/ceph-66/block
Jun 21 09:46:03 ceph-node bash[2323668]: Running command: /usr/bin/chown 
-R ceph:ceph /dev/dm-19
Jun 21 09:46:03 ceph-node bash[2323668]: Running command: /usr/bin/chown 
-R ceph:ceph /var/lib/ceph/osd/ceph-66
Jun 21 09:46:03 ceph-node bash[2323668]: --> ceph-volume lvm activate 
successful for osd ID: 66 

Jun 21 09:51:04 ceph-node bash[2323668]: debug 
2023-06-21T07:51:04.176+ 7fabef5a1200  0 monclient(hunting): 
authenticate timed out after 300
Jun 21 09:56:04 ceph-node bash[2323668]: debug 
2023-06-21T07:56:04.179+ 7fabef5a1200  0 monclient(hunting): 
authenticate timed out after 300
Jun 21 10:01:04 ceph-node bash[2323668]: debug 
2023-06-21T08:01:04.177+ 7fabef5a1200  0 monclient(hunting): 
authenticate timed out after 300
Jun 21 10:06:04 ceph-node bash[2323668]: debug 
2023-06-21T08:06:04.179+ 7fabef5a1200  0 monclient(hunting): 
authenticate timed out after 300
Jun 21 10:11:04 ceph-node bash[2323668]: debug 
2023-06-21T08:11:04.174+ 7fabef5a1200  0 monclient(hunting): 
authenticate timed out after 300


Same messages on all OSDs.

We still have some nodes running and did not restart those OSDs.

Best,
Malte

Am 21.06.23 um 09:50 schrieb Eugen Block:

Hi,
can you share more details what exactly you did? How did you remove the 
nodes? Hopefully, you waited for the draining to finish? But if the 
remaining OSDs wait for removed OSDs it sounds like the draining was not 
finished.


Zitat von Malte Stroem :


Hello,

we removed some nodes from our cluster. This worked without problems.

Now, lots of OSDs do not want to join the cluster anymore if we reboot 
one of the still available nodes.


It always runs into timeouts:

--> ceph-volume lvm activate successful for osd ID: XX
monclient(hunting): authenticate timed out after 300

MONs and MGRs are running fine.

Network is working, netcat to the MONs' ports are open.

Setting a higher debug level has no effect even if we add it to the 
ceph.conf file.


The PGs are pretty unhappy, e. g.:

7.143  87771   0 0  0    0 
314744902235    0   0  10081 10081  down  
2023-06-20T09:16:03.546158+    961275'1395646 961300:9605547  
[209,NONE,NONE] 209  [209,NONE,NONE] 209    961231'1395512  
2023-06-19T23:46:40.101791+    961231'1395512  
2023-06-19T23:46:40.101791+


PG query wants us to set an OSD lost however I do not want to do this.

OSDs are blocked by OSDs from the removed nodes:

ceph osd blocked-by
osd  num_blocked
152   38
244   41
144   54
...

We added the removed hosts again and tried to start the OSDs on this 
node and they also failed into the timeout mentioned above.


This is a containerized cluster running version 16.2.10.

Replication is 3, some pools use an erasure coded profile.

Best regards,
Malte


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs cannot join cluster anymore

2023-06-21 Thread Eugen Block

Hi,

Yes, we drained the nodes. It needed two weeks to finish the  
process, and yes, I think this is the root cause.
So we still have the nodes but when I try to restart one of those  
OSDs it still cannot join:


if the nodes were drained successfully (can you confirm that all PGs  
were active+clean after draining before you removed the nodes?) then  
the disks on the removed nodes wouldn't have any data to bring back.  
The question would be, why do the remaining OSDs still reference  
removed OSDs. Or am I misunderstanding something? I think it would  
help to know the whole story, can you provide more details? Also some  
more general cluster info would be helpful:

$ ceph -s
$ ceph osd tree
$ ceph health detail


Zitat von Malte Stroem :


Hello Eugen,

thank you. Yesterday I thought: Well, Eugen can help!

Yes, we drained the nodes. It needed two weeks to finish the  
process, and yes, I think this is the root cause.


So we still have the nodes but when I try to restart one of those  
OSDs it still cannot join:


Jun 21 09:46:03 ceph-node bash[2323668]: Running command:  
/usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-66/block
Jun 21 09:46:03 ceph-node bash[2323668]: Running command:  
/usr/bin/chown -R ceph:ceph /dev/dm-19
Jun 21 09:46:03 ceph-node bash[2323668]: Running command:  
/usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-66
Jun 21 09:46:03 ceph-node bash[2323668]: --> ceph-volume lvm  
activate successful for osd ID: 66 Jun 21 09:51:04 ceph-node  
bash[2323668]: debug 2023-06-21T07:51:04.176+ 7fabef5a1200  0  
monclient(hunting): authenticate timed out after 300
Jun 21 09:56:04 ceph-node bash[2323668]: debug  
2023-06-21T07:56:04.179+ 7fabef5a1200  0 monclient(hunting):  
authenticate timed out after 300
Jun 21 10:01:04 ceph-node bash[2323668]: debug  
2023-06-21T08:01:04.177+ 7fabef5a1200  0 monclient(hunting):  
authenticate timed out after 300
Jun 21 10:06:04 ceph-node bash[2323668]: debug  
2023-06-21T08:06:04.179+ 7fabef5a1200  0 monclient(hunting):  
authenticate timed out after 300
Jun 21 10:11:04 ceph-node bash[2323668]: debug  
2023-06-21T08:11:04.174+ 7fabef5a1200  0 monclient(hunting):  
authenticate timed out after 300


Same messages on all OSDs.

We still have some nodes running and did not restart those OSDs.

Best,
Malte

Am 21.06.23 um 09:50 schrieb Eugen Block:

Hi,
can you share more details what exactly you did? How did you remove  
the nodes? Hopefully, you waited for the draining to finish? But if  
the remaining OSDs wait for removed OSDs it sounds like the  
draining was not finished.


Zitat von Malte Stroem :


Hello,

we removed some nodes from our cluster. This worked without problems.

Now, lots of OSDs do not want to join the cluster anymore if we  
reboot one of the still available nodes.


It always runs into timeouts:

--> ceph-volume lvm activate successful for osd ID: XX
monclient(hunting): authenticate timed out after 300

MONs and MGRs are running fine.

Network is working, netcat to the MONs' ports are open.

Setting a higher debug level has no effect even if we add it to  
the ceph.conf file.


The PGs are pretty unhappy, e. g.:

7.143  87771   0 0  0    0  
314744902235    0   0  10081 10081  down   
2023-06-20T09:16:03.546158+    961275'1395646 961300:9605547   
[209,NONE,NONE] 209  [209,NONE,NONE] 209    961231'1395512  
 2023-06-19T23:46:40.101791+    961231'1395512   
2023-06-19T23:46:40.101791+


PG query wants us to set an OSD lost however I do not want to do this.

OSDs are blocked by OSDs from the removed nodes:

ceph osd blocked-by
osd  num_blocked
152   38
244   41
144   54
...

We added the removed hosts again and tried to start the OSDs on  
this node and they also failed into the timeout mentioned above.


This is a containerized cluster running version 16.2.10.

Replication is 3, some pools use an erasure coded profile.

Best regards,
Malte


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Pacific bluefs enospc bug with newly created OSDs

2023-06-21 Thread Carsten Grommel
Hi Igor,

thank you for your ansere!

>first of all Quincy does have a fix for the issue, see
>https://tracker.ceph.com/issues/53466 (and its Quincy counterpart
>https://tracker.ceph.com/issues/58588)

Thank you I somehow missed that release, good to know!

>SSD or HDD? Standalone or shared DB volume? I presume the latter... What
>is disk size and current utilization?
>
>Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if
>possible

We use 4 TB NVMe SSDs, shared db yes and mainly Micron with some Dell and 
Samsung in this cluster:
Micron_7400_MTFDKCB3T8TDZ_214733D291B1 cloud5-1561:nvme5n1  osd.5

All Disks are at ~ 88% utilization. I noticed that around 92% our disks tend to 
run into this bug.

Here are some bluefs-bdev-sizes from different OSDs on different hosts in this 
cluster:

ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-36/
inferring bluefs devices from bluestore path
1 : device size 0x37e3ec0 : using 0x2e1b390(2.9 TiB)

ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-24/
inferring bluefs devices from bluestore path
1 : device size 0x37e3ec0 : using 0x2d4e318d000(2.8 TiB)

ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-5/
inferring bluefs devices from bluestore path
1 : device size 0x37e3ec0 : using 0x2f2da93d000(2.9 TiB)

>Generally, given my assumption that DB volume is currently collocated
>and you still want to stay on Pacific, you might want to consider
>redeploying OSDs with a standalone DB volume setup.
>
>Just create large enough (2x of the current DB size seems to be pretty
>conservative estimation for that volume's size) additional LV on top of
>the same physical disk. And put DB there...
>
>Separating DB from main disk would result in much less fragmentation at
>DB volume and hence work around the problem. The cost would be having
>some extra spare space at DB volume unavailable for user data .

I guess that makes, so the suggestion would be to deploy the osd and db on the 
same NVMe
but with different logical volumes or updating to quincy.

Thank you!

Carsten

Von: Igor Fedotov 
Datum: Dienstag, 20. Juni 2023 um 12:48
An: Carsten Grommel , ceph-users@ceph.io 

Betreff: Re: [ceph-users] Ceph Pacific bluefs enospc bug with newly created OSDs
Hi Carsten,

first of all Quincy does have a fix for the issue, see
https://tracker.ceph.com/issues/53466 (and its Quincy counterpart
https://tracker.ceph.com/issues/58588)

Could you please share a bit more info on OSD disk layout?

SSD or HDD? Standalone or shared DB volume? I presume the latter... What
is disk size and current utilization?

Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if
possible


Generally, given my assumption that DB volume is currently collocated
and you still want to stay on Pacific, you might want to consider
redeploying OSDs with a standalone DB volume setup.

Just create large enough (2x of the current DB size seems to be pretty
conservative estimation for that volume's size) additional LV on top of
the same physical disk. And put DB there...

Separating DB from main disk would result in much less fragmentation at
DB volume and hence work around the problem. The cost would be having
some extra spare space at DB volume unavailable for user data .


Hope this helps,

Igor


On 20/06/2023 10:29, Carsten Grommel wrote:
> Hi all,
>
> we are experiencing the “bluefs enospc bug” again after redeploying all OSDs 
> of our Pacific Cluster.
> I know that our cluster is a bit too utilized at the moment with 87.26 % raw 
> usage but still this should not happen afaik.
> We never hat this problem with previous ceph versions and right now I am kind 
> of out of ideas at how to tackle these crashes.
> Compacting the database did not help in the past either.
> Redeploy seems to no help in the long run as well. For documentation I used 
> these commands to redeploy the osds:
>
> systemctl stop ceph-osd@${OSDNUM}
> ceph osd destroy --yes-i-really-mean-it ${OSDNUM}
> blkdiscard ${DEVICE}
> sgdisk -Z ${DEVICE}
> dmsetup remove ${DMDEVICE}
> ceph-volume lvm create --osd-id ${OSDNUM} --data ${DEVICE}
>
> Any ideas or possible solutions on this?  I am not yet ready to upgrade our 
> clusters to quincy, also I do presume that this bug is still present in 
> quincy as well?
>
> Follow our cluster information:
>
> Crash Info:
> ceph crash info 
> 2023-06-19T21:23:51.285180Z_ac4105d7-cb09-45c8-a6e3-8a6bb6727b25
> {
>  "assert_condition": "abort",
>  "assert_file": "/build/ceph/src/os/bluestore/BlueFS.cc",
>  "assert_func": "int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, 
> uint64_t)",
>  "assert_line": 2810,
>  "assert_msg": "/build/ceph/src/os/bluestore/BlueFS.cc: In function 'int 
> BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 
> 7fd561810100 time 
> 2023-06-19T23:23:51.261617+0200\n/build/ceph/src/os/bluestore/BlueFS.cc: 
> 2810: ceph_abort_msg(\"bluefs enospc\")\n",
> 

[ceph-users] Re: radosgw new zonegroup hammers master with metadata sync

2023-06-21 Thread Boris Behrens
I've update the dc3 site from octopus to pacific and the problem is still
there.
I find it very weird that in only happens from one single zonegroup to the
master and not from the other two.

Am Mi., 21. Juni 2023 um 01:59 Uhr schrieb Boris Behrens :

> I recreated the site and the problem still persists.
>
> I've upped the logging and saw this for a lot of buckets (i've stopped the
> debug log after some seconds).
> 2023-06-20T23:32:29.365+ 7fcaab7fe700 20 get_system_obj_state:
> rctx=0x7fcaab7f9320 obj=dc3.rgw.meta:root:s3bucket-fra2
> state=0x7fcba05ac0a0 s->prefetch_data=0
> 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 cache get:
> name=dc3.rgw.meta+root+s3bucket-fra2 : miss
> 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 cache put:
> name=dc3.rgw.meta+root+s3bucket-fra2 info.flags=0x6
> 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 adding
> dc3.rgw.meta+root+s3bucket-fra2 to cache LRU end
> 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 cache get:
> name=dc3.rgw.meta+root+s3bucket-fra2 : type miss (requested=0x1, cached=0x6)
> 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 cache put:
> name=dc3.rgw.meta+root+s3bucket-fra2 info.flags=0x1
> 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 moving
> dc3.rgw.meta+root+s3bucket-fra2 to cache LRU end
> 2023-06-20T23:32:29.365+ 7fcaab7fe700 20 get_system_obj_state:
> rctx=0x7fcaab7f9320
> obj=dc3.rgw.meta:root:.bucket.meta.s3bucket-fra2:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297866866.29
> state=0x7fcba43ce0a0 s->prefetch_data=0
> 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 cache get:
> name=dc3.rgw.meta+root+.bucket.meta.s3bucket-fra2:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297866866.29
> : miss
> 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 cache put:
> name=dc3.rgw.meta+root+.bucket.meta.s3bucket-fra2:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297866866.29
> info.flags=0x16
> 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 adding
> dc3.rgw.meta+root+.bucket.meta.s3bucket-fra2:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297866866.29
> to cache LRU end
> 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 cache get:
> name=dc3.rgw.meta+root+.bucket.meta.s3bucket-fra2:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297866866.29
> : type miss (requested=0x13, cached=0x16)
> 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 cache put:
> name=dc3.rgw.meta+root+.bucket.meta.s3bucket-fra2:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297866866.29
> info.flags=0x13
> 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 moving
> dc3.rgw.meta+root+.bucket.meta.s3bucket-fra2:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297866866.29
> to cache LRU end
> 2023-06-20T23:32:29.365+ 7fcaab7fe700 10 chain_cache_entry:
> cache_locator=dc3.rgw.meta+root+.bucket.meta.s3bucket-fra2:ff7a8b0c-07e6-463a-861b-78f0adeba8ad.2297866866.29
>
> Am Di., 20. Juni 2023 um 19:29 Uhr schrieb Boris :
>
>> Hi Casey,
>> already did restart all RGW instances.  Only helped for 2 minutes. We now
>> stopped the new site.
>>
>> I will remove and recreate it later.
>> As twi other sites don't have the problem I currently think I made a
>> mistake in the process.
>>
>> Mit freundlichen Grüßen
>>  - Boris Behrens
>>
>> > Am 20.06.2023 um 18:30 schrieb Casey Bodley :
>> >
>> > hi Boris,
>> >
>> > we've been investigating reports of excessive polling from metadata
>> > sync. i just opened https://tracker.ceph.com/issues/61743 to track
>> > this. restarting the secondary zone radosgws should help as a
>> > temporary workaround
>> >
>> >> On Tue, Jun 20, 2023 at 5:57 AM Boris Behrens  wrote:
>> >>
>> >> Hi,
>> >> yesterday I added a new zonegroup and it looks like it seems to cycle
>> over
>> >> the same requests over and over again.
>> >>
>> >> In the log of the main zone I see these requests:
>> >> 2023-06-20T09:48:37.979+ 7f8941fb3700  1 beast: 0x7f8a602f3700:
>> >> fd00:2380:0:24::136 - - [2023-06-20T09:48:37.979941+] "GET
>> >>
>> /admin/log?type=metadata&id=62&period=e8fc96f1-ae86-4dc1-b432-470b0772fded&max-entries=100&&rgwx-zonegroup=b39392eb-75f8-47f0-b4f3-7d3882930b26
>> >> HTTP/1.1" 200 44 - - -
>> >>
>> >> Only thing that changes is the &id.
>> >>
>> >> We have two other zonegroups that are configured identical (ceph.conf
>> and
>> >> period) and these don;t seem to spam the main rgw.
>> >>
>> >> root@host:~# radosgw-admin sync status
>> >>  realm 5d6f2ea4-b84a-459b-bce2-bccac338b3ef (main)
>> >>  zonegroup b39392eb-75f8-47f0-b4f3-7d3882930b26 (dc3)
>> >>   zone 96f5eca9-425b-4194-a152-86e310e91ddb (dc3)
>> >>  metadata sync syncing
>> >>full sync: 0/64 shards
>> >>incremental sync: 64/64 shards
>> >>metadata is caught up with master
>> >>
>> >> root@host:~# radosgw-admin period get
>> >> {
>> >>"id": "e8fc96f1-ae86-4dc1-b432-470b0772fded",
>> >>"epoch": 92,
>> >>"predecessor_uuid": "5349ac85-3d6d-4088-993f-7a1d4be3835a",
>> >>"sync_status": [
>> >>"",
>> >> ...
>> >>""
>> >>],
>> >>"period_map": {
>> >>"id": "e8fc96f1-ae86-4dc1-b4

[ceph-users] Re: Ceph iSCSI GW not working with VMware VMFS and Windows Clustered Storage Volumes (CSV)

2023-06-21 Thread Maged Mokhtar


On 20/06/2023 01:16, Work Ceph wrote:

I see, thanks for the feedback guys!

It is interesting that Ceph Manager does not allow us to export iSCSI 
blocks without selecting 2 or more iSCSI portals. Therefore, we will 
always use at least two, and as a consequence that feature is not 
going to be supported. Can I export an RBD image via iSCSI gateway 
using only one portal via GwCli?


@Maged Mokhtar, I am not sure I follow. Do you guys have an iSCSI 
implementation that we can use to somehow replace the default iSCSI 
server in the default Ceph iSCSI Gateway? I didn't quite understand 
what the petasan project is, and if it is an OpenSource solution that 
we can somehow just pick/select/use one of its modules (e.g. just the 
iSCSI implementation) that you guys have.




For sure PetaSAN is open source..you should see this from the home page :)
we use Consul
https://www.consul.io/use-cases/multi-platform-service-mesh
to scale-out the service/protocol layers above Ceph in a scale-out 
active/active fashion.
Most of our target use cases are non linux, such as VMWare and Windows, 
we provide easy to use deployment and management.


For iSCSI, we use kernel/LIO rbd backstore originally developed by SUSE 
Enterprise storge. We have done some changes to send persistence 
reservations using the Ceph watch/notify, we also added changes to 
coordinate pre-snapshot quiescing/flushing across different gateways. We 
ported rbd backstore to 5.14 kernel.


You should be able to use the iSCSI gateway by itself on existing non 
PetaSAN clusters but it is not a setup we support. You would use the LIO 
targercli to script the setup. There are some things to take care of 
such as setting the disk serial wwn to be the same across the different 
gateways serving the same image, setting up the multiple tpgs (target 
portal groups) for an image but only enabling the tpgs for local node. 
This setup will be using multi path MPIO to provide HA. Again it is not 
a setup we support, you could try it yourself in a test environment, you 
can also setup a test PetaSAN setup and examine the LIO configuration 
using targetcli. You can send me email if you need any clarifications.


Cheers /Maged

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How does a "ceph orch restart SERVICE" affect availability?

2023-06-21 Thread Eugen Block

Hi,


Will that try to be smart and just restart a few at a time to keep things
up and available. Or will it just trigger a restart everywhere
simultaneously.


basically, that's what happens for example during an upgrade if  
services are restarted. It's designed to be a rolling upgrade  
procedure so restarting all daemons of a specific service at the same  
time would cause an interruption. So the daemons are scheduled to  
restart and the mgr decides when it's safe to restart the next (this  
is a test cluster started on Nautilus, but it's on Quincy now):


nautilus:~ # ceph orch restart osd.osd-hdd-ssd
Scheduled to restart osd.5 on host 'nautilus'
Scheduled to restart osd.0 on host 'nautilus'
Scheduled to restart osd.2 on host 'nautilus'
Scheduled to restart osd.1 on host 'nautilus2'
Scheduled to restart osd.4 on host 'nautilus2'
Scheduled to restart osd.7 on host 'nautilus2'
Scheduled to restart osd.3 on host 'nautilus3'
Scheduled to restart osd.8 on host 'nautilus3'
Scheduled to restart osd.6 on host 'nautilus3'

When it comes to OSDs it's possible (or even likely) that multiple  
OSDs are restarted at the same time, depending on the pools (and their  
replication size) they are part of. But ceph tries to avoid "inactive  
PGs" which is critical, of course. An edge case would be a pool with  
size 1 where restarting an OSD would cause an inactive PG until the  
OSD is up again. But since size 1 would be a bad idea anyway (except  
for testing purposes) you'd have to live with that.
If you have the option I'd recommend to create a test cluster and play  
around with these things to get a better understanding, especially  
when it comes to upgrade tests etc.



I guess in my current scenario, restarting one host at the time makes most
sense, with a
systemctl restart ceph-{fsid}.target
and then checking that "ceph -s" says OK before proceeding to the next


Yes, if your crush-failure-domain is host that should be safe, too.

Regards,
Eugen

Zitat von Mikael Öhman :


The documentation very briefly explains a few core commands for restarting
things;
https://docs.ceph.com/en/quincy/cephadm/operations/#starting-and-stopping-daemons
but I feel I'm lacking quite some details of what is safe to do.

I have a system in production, clusters connected via CephFS and some
shared block devices.
We would like to restart some things due to some new network
configurations. Going daemon by daemon would take forever, so I'm curious
as to what happens if one tries the command;

ceph orch restart osd

Will that try to be smart and just restart a few at a time to keep things
up and available. Or will it just trigger a restart everywhere
simultaneously.

I guess in my current scenario, restarting one host at the time makes most
sense, with a
systemctl restart ceph-{fsid}.target
and then checking that "ceph -s" says OK before proceeding to the next
host, but I'm still curious as to what the "ceph orch restart xxx" command
would do (but not enough to try it out in production)

Best regards, Mikael
Chalmers University of Technology
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: OSDs cannot join cluster anymore

2023-06-21 Thread Malte Stroem

Hello Eugen,

recovery and rebalancing was finished however now all PGs show missing OSDs.

Everything looks like the PGs are missing OSDs although it finished 
correctly.


As if we shut down the servers immediately.

But we removed the nodes the way it is described in the documentation.

We just added new disks and they join the cluster immediately.

So the old OSDs removed from the cluster are available, I restored 
OSD.40 but it does not want to join the cluster.


Following are the outputs of the mentioned commands:

ceph -s

  cluster:
id: X
health: HEALTH_WARN
1 failed cephadm daemon(s)
1 filesystem is degraded
1 MDSs report slow metadata IOs
19 osds down
4 hosts (50 osds) down
Reduced data availability: 1220 pgs inactive
Degraded data redundancy: 132 pgs undersized

  services:
mon: 3 daemons, quorum cephx02,cephx04,cephx06 (age 4m)
mgr: cephx02.xx(active, since 92s), standbys: cephx04.yy, 
cephx06.zz 


mds: 2/2 daemons up, 2 standby
osd: 130 osds: 78 up (since 13m), 97 in (since 35m); 171 remapped pgs
rgw: 1 daemon active (1 hosts, 1 zones)

  data:
volumes: 1/2 healthy, 1 recovering
pools:   12 pools, 1345 pgs
objects: 11.02k objects, 1.9 GiB
usage:   145 TiB used, 669 TiB / 814 TiB avail
pgs: 86.617% pgs unknown
 4.089% pgs not active
 39053/33069 objects misplaced (118.095%)
 1165 unknown
 77   active+undersized+remapped
 55   undersized+remapped+peered
 38   active+clean+remapped
 10   active+clean

ceph osd tree

ID   CLASS  WEIGHT  TYPE NAMESTATUS  REWEIGHT  PRI-AFF
-214.36646  root ssds
-610.87329  host cephx01-ssd
186ssd 0.87329  osd.186down   1.0  1.0
-760.87329  host cephx02-ssd
263ssd 0.87329  osd.263  up   1.0  1.0
-850.87329  host cephx04-ssd
237ssd 0.87329  osd.237  up   1.0  1.0
-880.87329  host cephx06-ssd
236ssd 0.87329  osd.236  up   1.0  1.0
-940.87329  host cephx08-ssd
262ssd 0.87329  osd.262down   1.0  1.0
 -1 1347.07397  root default
-62  261.93823  host cephx01
139hdd10.91409  osd.139down 0  1.0
140hdd10.91409  osd.140down 0  1.0
142hdd10.91409  osd.142down 0  1.0
144hdd10.91409  osd.144down 0  1.0
146hdd10.91409  osd.146down 0  1.0
148hdd10.91409  osd.148down 0  1.0
150hdd10.91409  osd.150down 0  1.0
152hdd10.91409  osd.152down 0  1.0
154hdd10.91409  osd.154down   1.0  1.0
156hdd10.91409  osd.156down   1.0  1.0
158hdd10.91409  osd.158down   1.0  1.0
160hdd10.91409  osd.160down   1.0  1.0
162hdd10.91409  osd.162down   1.0  1.0
164hdd10.91409  osd.164down   1.0  1.0
166hdd10.91409  osd.166down   1.0  1.0
168hdd10.91409  osd.168down   1.0  1.0
170hdd10.91409  osd.170down   1.0  1.0
172hdd10.91409  osd.172down   1.0  1.0
174hdd10.91409  osd.174down   1.0  1.0
176hdd10.91409  osd.176down   1.0  1.0
178hdd10.91409  osd.178down   1.0  1.0
180hdd10.91409  osd.180down   1.0  1.0
182hdd10.91409  osd.182down   1.0  1.0
184hdd10.91409  osd.184down   1.0  1.0
-67  261.93823  host cephx02
138hdd10.91409  osd.138  up   1.0  1.0
141hdd10.91409  osd.141  up   1.0  1.0
143hdd10.91409  osd.143  up   1.0  1.0
145hdd10.91409  osd.145  up   1.0  1.0
147hdd10.91409  osd.147  up   1.0  1.0
149hdd10.91409  osd.149  up   1.0  1.0
151hdd10.91409  osd.151  up   1.0  1.0
153hdd10.91409  osd.153  up   1.0  1.0
155hdd10.91409  osd.155  up   1.0  1.0
157

[ceph-users] Re: alerts in dashboard

2023-06-21 Thread Ankush Behl
Hi Ben, also if some alerts are noisy, we have option in dashboard to
silence those alerts.

Also, can you provide the list of critical alerts that you see?

On Wed, 21 Jun 2023 at 12:48, Nizamudeen A  wrote:

> Hi Ben,
>
> It looks like you forgot to attach the screenshots.
>
> Regards,
> Nizam
>
> On Wed, Jun 21, 2023, 12:23 Ben  wrote:
>
> > Hi,
> >
> > I got many critical alerts in ceph dashboard. Meanwhile the cluster shows
> > health ok status.
> >
> > See attached screenshot for detail. My questions are, are they real
> alerts?
> > How to get rid of them?
> >
> > Thanks
> > Ben
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Recover OSDs from folder /var/lib/ceph/uuid/removed

2023-06-21 Thread Malte Stroem

Yes, I am missing create:

ceph osd create uuid id

This works!

Best,
Malte

Am 20.06.23 um 18:42 schrieb Malte Stroem:

Well, things I would do:

- add the keyring to ceph auth

ceph auth add osd.XX osd 'allow *' mon 'allow rwx' -i 
/var/lib/ceph/uuid(osd.XX/keyring


- add OSD to crush

ceph osd crush set osd.XX 1.0 root=default ...

- create systemd service

systemctl enable ceph-u...@osd.xx.service

Is there something I am missing?

Best,
Malte

Am 20.06.23 um 18:04 schrieb Malte Stroem:

Hello,

is it possible to recover an OSD if it was removed?

The systemd service was removed but the block device is still listed 
under


lsblk

and the config files are still available under

/var/lib/ceph/uuid/removed

It is a containerized cluster. So I think we need to add the cephx 
entries, use ceph-volume, crush, and so on.


Best regards,
Malte

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Pacific bluefs enospc bug with newly created OSDs

2023-06-21 Thread Igor Fedotov

Hi Carsten,

please also note a workaround to bring the osds back for e.g. data 
recovery - set bluefs_shared_alloc_size to 32768.


This will hopefully allow OSD to startup and pull data out of it. But I 
wouldn't discourage you from using such OSDs long term as fragmentation 
might evolve and this workaround will become ineffective as well.


Please do not apply this change to healthy OSDs as it's irreversible.


BTW, having two namespace at NVMe drive is a good alternative to Logical 
Volumes if for some reasons one needs two "physical" disks for OSD setup...


Thanks,

Igor

On 21/06/2023 11:41, Carsten Grommel wrote:


Hi Igor,

thank you for your ansere!

>first of all Quincy does have a fix for the issue, see
>https://tracker.ceph.com/issues/53466 (and its Quincy counterpart
>https://tracker.ceph.com/issues/58588)

Thank you I somehow missed that release, good to know!

>SSD or HDD? Standalone or shared DB volume? I presume the latter... What
>is disk size and current utilization?
>
>Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if
>possible

We use 4 TB NVMe SSDs, shared db yes and mainly Micron with some Dell 
and Samsung in this cluster:


Micron_7400_MTFDKCB3T8TDZ_214733D291B1 cloud5-1561:nvme5n1  osd.5

All Disks are at ~ 88% utilization. I noticed that around 92% our 
disks tend to run into this bug.


Here are some bluefs-bdev-sizes from different OSDs on different hosts 
in this cluster:


ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-36/

inferring bluefs devices from bluestore path

1 : device size 0x37e3ec0 : using 0x2e1b390(2.9 TiB)

ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-24/

inferring bluefs devices from bluestore path

1 : device size 0x37e3ec0 : using 0x2d4e318d000(2.8 TiB)

ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-5/

inferring bluefs devices from bluestore path

1 : device size 0x37e3ec0 : using 0x2f2da93d000(2.9 TiB)

>Generally, given my assumption that DB volume is currently collocated
>and you still want to stay on Pacific, you might want to consider
>redeploying OSDs with a standalone DB volume setup.
>
>Just create large enough (2x of the current DB size seems to be pretty
>conservative estimation for that volume's size) additional LV on top of
>the same physical disk. And put DB there...
>
>Separating DB from main disk would result in much less fragmentation at
>DB volume and hence work around the problem. The cost would be having
>some extra spare space at DB volume unavailable for user data .

I guess that makes, so the suggestion would be to deploy the osd and 
db on the same NVMe


but with different logical volumes or updating to quincy.

Thank you!

Carsten

*Von: *Igor Fedotov 
*Datum: *Dienstag, 20. Juni 2023 um 12:48
*An: *Carsten Grommel , ceph-users@ceph.io 

*Betreff: *Re: [ceph-users] Ceph Pacific bluefs enospc bug with newly 
created OSDs


Hi Carsten,

first of all Quincy does have a fix for the issue, see
https://tracker.ceph.com/issues/53466 (and its Quincy counterpart
https://tracker.ceph.com/issues/58588)

Could you please share a bit more info on OSD disk layout?

SSD or HDD? Standalone or shared DB volume? I presume the latter... What
is disk size and current utilization?

Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if
possible


Generally, given my assumption that DB volume is currently collocated
and you still want to stay on Pacific, you might want to consider
redeploying OSDs with a standalone DB volume setup.

Just create large enough (2x of the current DB size seems to be pretty
conservative estimation for that volume's size) additional LV on top of
the same physical disk. And put DB there...

Separating DB from main disk would result in much less fragmentation at
DB volume and hence work around the problem. The cost would be having
some extra spare space at DB volume unavailable for user data .


Hope this helps,

Igor


On 20/06/2023 10:29, Carsten Grommel wrote:
> Hi all,
>
> we are experiencing the “bluefs enospc bug” again after redeploying 
all OSDs of our Pacific Cluster.
> I know that our cluster is a bit too utilized at the moment with 
87.26 % raw usage but still this should not happen afaik.
> We never hat this problem with previous ceph versions and right now 
I am kind of out of ideas at how to tackle these crashes.

> Compacting the database did not help in the past either.
> Redeploy seems to no help in the long run as well. For documentation 
I used these commands to redeploy the osds:

>
> systemctl stop ceph-osd@${OSDNUM}
> ceph osd destroy --yes-i-really-mean-it ${OSDNUM}
> blkdiscard ${DEVICE}
> sgdisk -Z ${DEVICE}
> dmsetup remove ${DMDEVICE}
> ceph-volume lvm create --osd-id ${OSDNUM} --data ${DEVICE}
>
> Any ideas or possible solutions on this?  I am not yet ready to 
upgrade our clusters to quincy, also I do presume that this bug is 
still present in quincy as well?

>
> Follow our clus

[ceph-users] Re: RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake

2023-06-21 Thread Christian Theune
Aaaand another dead end: there is too much meta-data involved (bucket and 
object ACLs, lifecycle, policy, …) that won’t be possible to perfectly migrate. 
Also, lifecycles _might_ be affected if mtimes change.

So, I’m going to try and go back to a single-cluster multi-zone setup. For that 
I’m going to change all buckets with explicit placements to remove the explicit 
placement markers (those were created from old versions of Ceph and weren’t 
intentional by us, they perfectly reflect the default placement configuration).

Here’s the patch I’m going to try on top of our Nautilus branch now:
https://github.com/flyingcircusio/ceph/commit/b3a317987e50f089efc4e9694cf6e3d5d9c23bd5

All our buckets with explicit placements conform perfectly to the default 
placement, so this seems safe.

Otherwise Zone migration was perfect until I noticed the objects with explicit 
placements in our staging and production clusters. (The dev cluster seems to 
have been purged intermediately, so this wasn’t noticed).

I’m actually wondering whether explicit placements are really a sensible thing 
to have, even in multi-cluster multi-zone setups. AFAICT due to realms you 
might end up with different zonegroups referring to the same pools and this 
should only run through proper abstractions … o_O

Cheers,
Christian

> On 14. Jun 2023, at 17:42, Christian Theune  wrote:
> 
> Hi,
> 
> further note to self and for posterity … ;)
> 
> This turned out to be a no-go as well, because you can’t silently switch the 
> pools to a different storage class: the objects will be found, but the index 
> still refers to the old storage class and lifecycle migrations won’t work.
> 
> I’ve brainstormed for further options and it appears that the last resort is 
> to use placement targets and copy the buckets explicitly - twice, because on 
> Nautilus I don’t have renames available, yet. :( 
> 
> This will require temporary downtimes prohibiting users to access their 
> bucket. Fortunately we only have a few very large buckets (200T+) that will 
> take a while to copy. We can pre-sync them of course, so the downtime will 
> only be during the second copy.
> 
> Christian
> 
>> On 13. Jun 2023, at 14:52, Christian Theune  wrote:
>> 
>> Following up to myself and for posterity:
>> 
>> I’m going to try to perform a switch here using (temporary) storage classes 
>> and renaming of the pools to ensure that I can quickly change the STANDARD 
>> class to a better EC pool and have new objects located there. After that 
>> we’ll add (temporary) lifecycle rules to all buckets to ensure their objects 
>> will be migrated to the STANDARD class.
>> 
>> Once that is finished we should be able to delete the old pool and the 
>> temporary storage class.
>> 
>> First tests appear successfull, but I’m a bit struggling to get the bucket 
>> rules working (apparently 0 days isn’t a real rule … and the debug interval 
>> setting causes high frequent LC runs but doesn’t seem move objects just yet. 
>> I’ll play around with that setting a bit more, though, I think I might have 
>> tripped something that only wants to process objects every so often and on 
>> an interval of 10 a day is still 2.4 hours … 
>> 
>> Cheers,
>> Christian
>> 
>>> On 9. Jun 2023, at 11:16, Christian Theune  wrote:
>>> 
>>> Hi,
>>> 
>>> we are running a cluster that has been alive for a long time and we tread 
>>> carefully regarding updates. We are still a bit lagging and our cluster 
>>> (that started around Firefly) is currently at Nautilus. We’re updating and 
>>> we know we’re still behind, but we do keep running into challenges along 
>>> the way that typically are still unfixed on main and - as I started with - 
>>> have to tread carefully.
>>> 
>>> Nevertheless, mistakes happen, and we found ourselves in this situation: we 
>>> converted our RGW data pool from replicated (n=3) to erasure coded (k=10, 
>>> m=3, with 17 hosts) but when doing the EC profile selection we missed that 
>>> our hosts are not evenly balanced (this is a growing cluster and some 
>>> machines have around 20TiB capacity for the RGW data pool, wheres newer 
>>> machines have around 160TiB and we rather should have gone with k=4, m=3.  
>>> In any case, having 13 chunks causes too many hosts to participate in each 
>>> object. Going for k+m=7 will allow distribution to be more effective as we 
>>> have 7 hosts that have the 160TiB sizing.
>>> 
>>> Our original migration used the “cache tiering” approach, but that only 
>>> works once when moving from replicated to EC and can not be used for 
>>> further migrations.
>>> 
>>> The amount of data is at 215TiB somewhat significant, so using an approach 
>>> that scales when copying data[1] to avoid ending up with months of 
>>> migration.
>>> 
>>> I’ve run out of ideas doing this on a low-level (i.e. trying to fix it on a 
>>> rados/pool level) and I guess we can only fix this on an application level 
>>> using multi-zone replication.
>>> 
>>> I have the setup nailed in gen

[ceph-users] Re: OSDs cannot join cluster anymore

2023-06-21 Thread Eugen Block
I still can’t really grasp what might have happened here. But could  
you please clarify which of the down OSDs (or Hosts) are supposed to  
be down and which you’re trying to bring back online? Obviously osd.40  
is one of your attempts. But what about the hosts cephx01 and cephx08?  
Are those the ones refusing to start their OSDs? And the remaining up  
OSDs you haven’t touched yet, correct?
And regarding debug logs, you should set it with ceph config set  
because the local ceph.conf won’t have an effect. It could help to  
have the startup debug logs from one of the OSDs.


Zitat von Malte Stroem :


Hello Eugen,

recovery and rebalancing was finished however now all PGs show missing OSDs.

Everything looks like the PGs are missing OSDs although it finished  
correctly.


As if we shut down the servers immediately.

But we removed the nodes the way it is described in the documentation.

We just added new disks and they join the cluster immediately.

So the old OSDs removed from the cluster are available, I restored  
OSD.40 but it does not want to join the cluster.


Following are the outputs of the mentioned commands:

ceph -s

  cluster:
id: X
health: HEALTH_WARN
1 failed cephadm daemon(s)
1 filesystem is degraded
1 MDSs report slow metadata IOs
19 osds down
4 hosts (50 osds) down
Reduced data availability: 1220 pgs inactive
Degraded data redundancy: 132 pgs undersized

  services:
mon: 3 daemons, quorum cephx02,cephx04,cephx06 (age 4m)
mgr: cephx02.xx(active, since 92s), standbys:  
cephx04.yy, cephx06.zz mds: 2/2 daemons up, 2 standby

osd: 130 osds: 78 up (since 13m), 97 in (since 35m); 171 remapped pgs
rgw: 1 daemon active (1 hosts, 1 zones)

  data:
volumes: 1/2 healthy, 1 recovering
pools:   12 pools, 1345 pgs
objects: 11.02k objects, 1.9 GiB
usage:   145 TiB used, 669 TiB / 814 TiB avail
pgs: 86.617% pgs unknown
 4.089% pgs not active
 39053/33069 objects misplaced (118.095%)
 1165 unknown
 77   active+undersized+remapped
 55   undersized+remapped+peered
 38   active+clean+remapped
 10   active+clean

ceph osd tree

ID   CLASS  WEIGHT  TYPE NAMESTATUS  REWEIGHT  PRI-AFF
-214.36646  root ssds
-610.87329  host cephx01-ssd
186ssd 0.87329  osd.186down   1.0  1.0
-760.87329  host cephx02-ssd
263ssd 0.87329  osd.263  up   1.0  1.0
-850.87329  host cephx04-ssd
237ssd 0.87329  osd.237  up   1.0  1.0
-880.87329  host cephx06-ssd
236ssd 0.87329  osd.236  up   1.0  1.0
-940.87329  host cephx08-ssd
262ssd 0.87329  osd.262down   1.0  1.0
 -1 1347.07397  root default
-62  261.93823  host cephx01
139hdd10.91409  osd.139down 0  1.0
140hdd10.91409  osd.140down 0  1.0
142hdd10.91409  osd.142down 0  1.0
144hdd10.91409  osd.144down 0  1.0
146hdd10.91409  osd.146down 0  1.0
148hdd10.91409  osd.148down 0  1.0
150hdd10.91409  osd.150down 0  1.0
152hdd10.91409  osd.152down 0  1.0
154hdd10.91409  osd.154down   1.0  1.0
156hdd10.91409  osd.156down   1.0  1.0
158hdd10.91409  osd.158down   1.0  1.0
160hdd10.91409  osd.160down   1.0  1.0
162hdd10.91409  osd.162down   1.0  1.0
164hdd10.91409  osd.164down   1.0  1.0
166hdd10.91409  osd.166down   1.0  1.0
168hdd10.91409  osd.168down   1.0  1.0
170hdd10.91409  osd.170down   1.0  1.0
172hdd10.91409  osd.172down   1.0  1.0
174hdd10.91409  osd.174down   1.0  1.0
176hdd10.91409  osd.176down   1.0  1.0
178hdd10.91409  osd.178down   1.0  1.0
180hdd10.91409  osd.180down   1.0  1.0
182hdd10.91409  osd.182down   1.0  1.0
184hdd10.91409  osd.184down   1.0  1.0
-67  261.93823  host cephx02
138hdd10.91409  osd.138  up   1.0  1.

[ceph-users] Re: Ceph Pacific bluefs enospc bug with newly created OSDs

2023-06-21 Thread Fox, Kevin M
Does quincy automatically switch existing things to 4k or do you need to do a 
new ost to get the 4k size?

Thanks,
Kevin


From: Igor Fedotov 
Sent: Wednesday, June 21, 2023 5:56 AM
To: Carsten Grommel; ceph-users@ceph.io
Subject: [ceph-users] Re: Ceph Pacific bluefs enospc bug with newly created OSDs

Check twice before you click! This email originated from outside PNNL.


Hi Carsten,

please also note a workaround to bring the osds back for e.g. data
recovery - set bluefs_shared_alloc_size to 32768.

This will hopefully allow OSD to startup and pull data out of it. But I
wouldn't discourage you from using such OSDs long term as fragmentation
might evolve and this workaround will become ineffective as well.

Please do not apply this change to healthy OSDs as it's irreversible.


BTW, having two namespace at NVMe drive is a good alternative to Logical
Volumes if for some reasons one needs two "physical" disks for OSD setup...

Thanks,

Igor

On 21/06/2023 11:41, Carsten Grommel wrote:
>
> Hi Igor,
>
> thank you for your ansere!
>
> >first of all Quincy does have a fix for the issue, see
> >https://tracker.ceph.com/issues/53466 (and its Quincy counterpart
> >https://tracker.ceph.com/issues/58588)
>
> Thank you I somehow missed that release, good to know!
>
> >SSD or HDD? Standalone or shared DB volume? I presume the latter... What
> >is disk size and current utilization?
> >
> >Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if
> >possible
>
> We use 4 TB NVMe SSDs, shared db yes and mainly Micron with some Dell
> and Samsung in this cluster:
>
> Micron_7400_MTFDKCB3T8TDZ_214733D291B1 cloud5-1561:nvme5n1  osd.5
>
> All Disks are at ~ 88% utilization. I noticed that around 92% our
> disks tend to run into this bug.
>
> Here are some bluefs-bdev-sizes from different OSDs on different hosts
> in this cluster:
>
> ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-36/
>
> inferring bluefs devices from bluestore path
>
> 1 : device size 0x37e3ec0 : using 0x2e1b390(2.9 TiB)
>
> ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-24/
>
> inferring bluefs devices from bluestore path
>
> 1 : device size 0x37e3ec0 : using 0x2d4e318d000(2.8 TiB)
>
> ceph-bluestore-tool bluefs-bdev-sizes --path /var/lib/ceph/osd/ceph-5/
>
> inferring bluefs devices from bluestore path
>
> 1 : device size 0x37e3ec0 : using 0x2f2da93d000(2.9 TiB)
>
> >Generally, given my assumption that DB volume is currently collocated
> >and you still want to stay on Pacific, you might want to consider
> >redeploying OSDs with a standalone DB volume setup.
> >
> >Just create large enough (2x of the current DB size seems to be pretty
> >conservative estimation for that volume's size) additional LV on top of
> >the same physical disk. And put DB there...
> >
> >Separating DB from main disk would result in much less fragmentation at
> >DB volume and hence work around the problem. The cost would be having
> >some extra spare space at DB volume unavailable for user data .
>
> I guess that makes, so the suggestion would be to deploy the osd and
> db on the same NVMe
>
> but with different logical volumes or updating to quincy.
>
> Thank you!
>
> Carsten
>
> *Von: *Igor Fedotov 
> *Datum: *Dienstag, 20. Juni 2023 um 12:48
> *An: *Carsten Grommel , ceph-users@ceph.io
> 
> *Betreff: *Re: [ceph-users] Ceph Pacific bluefs enospc bug with newly
> created OSDs
>
> Hi Carsten,
>
> first of all Quincy does have a fix for the issue, see
> https://tracker.ceph.com/issues/53466 (and its Quincy counterpart
> https://tracker.ceph.com/issues/58588)
>
> Could you please share a bit more info on OSD disk layout?
>
> SSD or HDD? Standalone or shared DB volume? I presume the latter... What
> is disk size and current utilization?
>
> Please share ceph-bluestore-tool's bluefs-bdev-sizes command output if
> possible
>
>
> Generally, given my assumption that DB volume is currently collocated
> and you still want to stay on Pacific, you might want to consider
> redeploying OSDs with a standalone DB volume setup.
>
> Just create large enough (2x of the current DB size seems to be pretty
> conservative estimation for that volume's size) additional LV on top of
> the same physical disk. And put DB there...
>
> Separating DB from main disk would result in much less fragmentation at
> DB volume and hence work around the problem. The cost would be having
> some extra spare space at DB volume unavailable for user data .
>
>
> Hope this helps,
>
> Igor
>
>
> On 20/06/2023 10:29, Carsten Grommel wrote:
> > Hi all,
> >
> > we are experiencing the “bluefs enospc bug” again after redeploying
> all OSDs of our Pacific Cluster.
> > I know that our cluster is a bit too utilized at the moment with
> 87.26 % raw usage but still this should not happen afaik.
> > We never hat this problem with previous ceph versions and right now
> I am kind of out of ideas at how to tackle these crashes.
> >

[ceph-users] How to repair pg in failed_repair state?

2023-06-21 Thread 이 강우
A lot of pg in inconsistent state occurred.

Most of them were repaired with ceph pg repair all, but in the case of 3 pg as 
shown below, it does not proceed further with failed_repair status.

[root@cephvm1 ~]# ceph health detail
HEALTH_ERR 30 scrub errors; Too many repaired reads on 7 OSDs; Possible data 
damage: 3 pgs inconsistent
OSD_SCRUB_ERRORS 30 scrub errors
OSD_TOO_MANY_REPAIRS Too many repaired reads on 7 OSDs
 osd.29 had 315 reads repaired
 osd.23 had 530 reads repaired
 osd.18 had 69 reads repaired
 osd.2 had 267 reads repaired
 osd.0 had 179 reads repaired
 osd.12 had 513 reads repaired
 osd.13 had 404 reads repaired
PG_DAMAGED Possible data damage: 3 pgs inconsistent
 pg 2.2f is active+clean+inconsistent+failed_repair, acting [29,13,18]
 pg 2.46 is active+clean+inconsistent+failed_repair, acting [12,0,29]
 pg 2.5c is active+clean+inconsistent+failed_repair, acting [12,23,0]

The query result of pg 2.2f is as follows, and the problem seems to be that the 
three peer versions are different.

[root@cephvm1 ~]# ceph pg 2.2f query
{
"state": "active+clean+inconsistent+failed_repair",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 426,
"up": [
29,
13,
18
],
"acting": [
29,
13,
18
],
"acting_recovery_backfill": [
"13",
"18",
"29"
],
"info": {
"pgid": "2.2f",
"last_update": "426'128436680",
"last_complete": "426'128436680",
"log_tail": "390'128433627",
"last_user_version": 128436529,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [],
"history": {
"epoch_created": 111,
"epoch_pool_created": 67,
"last_epoch_started": 426,
"last_interval_started": 425,
"last_epoch_clean": 426,
"last_interval_clean": 425,
"last_epoch_split": 111,
"last_epoch_marked_full": 0,
"same_up_since": 425,
"same_interval_since": 425,
"same_primary_since": 425,
"last_scrub": "426'128436680",
"last_scrub_stamp": "2023-06-21 15:57:53.645395",
"last_deep_scrub": "426'128436680",
"last_deep_scrub_stamp": "2023-06-21 15:57:53.645395",
"last_clean_scrub_stamp": "2023-03-28 09:11:29.298557"
},
"stats": {
"version": "426'128436680",
"reported_seq": "128628939",
"reported_epoch": "426",
"state": "active+clean+inconsistent+failed_repair",
"last_fresh": "2023-06-21 15:57:53.645450",
"last_change": "2023-06-21 15:57:53.645450",
"last_active": "2023-06-21 15:57:53.645450",
"last_peered": "2023-06-21 15:57:53.645450",
"last_clean": "2023-06-21 15:57:53.645450",
"last_became_active": "2023-06-21 14:03:02.233710",
"last_became_peered": "2023-06-21 14:03:02.233710",
"last_unstale": "2023-06-21 15:57:53.645450",
"last_undegraded": "2023-06-21 15:57:53.645450",
"last_fullsized": "2023-06-21 15:57:53.645450",
"mapping_epoch": 425,
"log_start": "390'128433627",
"ondisk_log_start": "390'128433627",
"created": 111,
"last_epoch_clean": 426,
"parent": "0.0",
"parent_split_bits": 7,
"last_scrub": "426'128436680",
"last_scrub_stamp": "2023-06-21 15:57:53.645395",
"last_deep_scrub": "426'128436680",
"last_deep_scrub_stamp": "2023-06-21 15:57:53.645395",
"last_clean_scrub_stamp": "2023-03-28 09:11:29.298557",
"log_size": 3053,
"ondisk_log_size": 3053,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 10888387166,
"num_objects": 2610,
"num_object_clones": 0,
"num_object_copies": 7830,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 2610,
"num_whiteouts": 0,
"num_read": 191976,
"num_read_kb": 10314827,
"num_write": 128429383,
"num_write_kb": 741542291,
"num_scrub_errors": 3,
"num_shallow_scrub_errors": 0,
"num_deep_scrub_errors": 3,
"num_objects_recovered": 

[ceph-users] Re: OSDs cannot join cluster anymore

2023-06-21 Thread Stefan Kooman

On 6/21/23 11:20, Malte Stroem wrote:

Hello Eugen,

recovery and rebalancing was finished however now all PGs show missing 
OSDs.


Everything looks like the PGs are missing OSDs although it finished 
correctly.


As if we shut down the servers immediately.

But we removed the nodes the way it is described in the documentation.

We just added new disks and they join the cluster immediately.

So the old OSDs removed from the cluster are available, I restored 
OSD.40 but it does not want to join the cluster.



Are the osd.$id keys still there of the removed OSDs (check with ceph 
auth list)? Otherwise you might need to import the keyring into the 
cluster (/var/lib/ceph/osd/ceph-$id/keyring) and provide it proper CAPS.


Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io