[ceph-users] Re: bidirectional rbd-mirroring

2023-01-18 Thread Aielli, Elia
That's an idea,

moreover I've discovered my two clusters can't be named the same (i.e.
ceph) but I have to change cluster name with environment variable in
/etc/default/ceph (I've deployed ceph via Proxmox 6.x and that's the name
it gives by default to my cluster)

This is kind of an issue cause in red hat ceph it's a simple 3 step
scenario (add CLUSTER variable to the said file, create symlink between
ceph.conf and .conf and you are good to go; unfortunately
in proxmox ceph this doesn't work, instead it breaks every daemon...

If someone has already experienced this scenario and would like to share
his/her experience, I'd be very grateful

BR
Elia

Il giorno mar 17 gen 2023 alle ore 15:18 Eugen Block  ha
scritto:

> Hi,
>
> maybe you need to remove the peer first before readding it with a
> different config? At least that's how I interpret the code [1]. I
> haven't tried it myself though, so be careful and maybe test it first
> in a test environment.
>
> [1]
>
> https://github.com/ceph/ceph/blob/main/src/tools/rbd/action/MirrorPool.cc#L1029
>
> Zitat von "Aielli, Elia" :
>
> > Hi all,
> >
> > I've a working couple of cluster configured with rbd mirror, Master
> cluster
> > is production, Backup cluster is DR. Right now all is working good with
> > Master configured in "tx-only" and Backup in "rx-tx".
> > I'd like to modify Master direction to rx-tx so I'm already prepared for
> a
> > failover after a disaster has happened, but while I'm doin so, I face
> this
> > error and i'm stuck:
> >
> > ceph version 15.2.17 (694d03a6f6c6e9f814446223549caf9a9f60dba0) octopus
> > (stable)
> >
> > Ceph user able to operate on Master is rbd-mirror.master, while on Backup
> > is rbd-mirror.backup
> > On Master cluster I've my ceph.conf and backup.conf, and on Backup
> cluster
> > I've ceph.conf and master.conf
> > Keyrings has been copied correctly.
> > I've change direction without any problem, but when i try to configure
> the
> > peer with this command, i receive following error:
> >
> > root@master# rbd mirror pool peer add 
> > client.rbd-mirror.backup@backup
> > rbd: multiple RX peers are not currently supported
> >
> > And when I check my pool info, i have the "Client:" section empty (while
> > the one on my DR is populated with client.rbd-mirror.master"
> >
> > Can someone lend me a hand?
> > Is this something I can't do or simply I'm using the wrong commands?
> >
> > Thanks in advance!
> > Elia
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] 17.2.5 ceph fs status: AssertionError

2023-01-18 Thread Robert Sander

Hi,

I have a healthy (test) cluster running 17.2.5:

root@cephtest20:~# ceph status
  cluster:
id: ba37db20-2b13-11eb-b8a9-871ba11409f6
health: HEALTH_OK
 
  services:

mon: 3 daemons, quorum cephtest31,cephtest41,cephtest21 (age 2d)
mgr: cephtest22.lqzdnk(active, since 4d), standbys: 
cephtest32.ybltym, cephtest42.hnnfaf
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 48 osds: 48 up (since 4d), 48 in (since 4M)
rgw: 2 daemons active (2 hosts, 1 zones)
tcmu-runner: 6 portals active (3 hosts)
 
  data:

volumes: 1/1 healthy
pools:   17 pools, 513 pgs
objects: 28.25k objects, 4.7 GiB
usage:   26 GiB used, 4.7 TiB / 4.7 TiB avail
pgs: 513 active+clean
 
  io:

client:   4.3 KiB/s rd, 170 B/s wr, 5 op/s rd, 0 op/s wr

CephFS is mounted and can be used without any issue.

But I get an error when I when querying its status:

root@cephtest20:~# ceph fs status
Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1757, in _handle_command
return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)
  File "/usr/share/ceph/mgr/mgr_module.py", line 462, in call
return self.func(mgr, **kwargs)
  File "/usr/share/ceph/mgr/status/module.py", line 159, in handle_fs_status
assert metadata
AssertionError


The dashboard's filesystem page shows no error and displays
all information about cephfs.

Where does this AssertionError come from?

Regards
--
Robert Sander
Heinlein Support GmbH
Linux: Akademie - Support - Hosting
http://www.heinlein-support.de

Tel: 030-405051-43
Fax: 030-405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stable erasure coding CRUSH rule for multiple hosts?

2023-01-18 Thread Eugen Block

Hi,

I only have one remark on your assumption regarding maintenance with  
your current setup. With your profile k4 m2 you'd have a min_size of 5  
(k + 1 which is recommended), taking one host down would still result  
in IO pause because min_size is not met. To allow IO you'd need to  
reduce min_size to 4 which is only recommended in disaster scenarios.  
With three nodes you'd be better off with replication size 3, although  
it requires more storage, of course.
Adding (or removing) OSDs always results in remapping, I don't think  
it's unexpected what you're describing.


Regards,
Eugen

Zitat von aschmitz :


Hi folks,

I have a small cluster of three Ceph hosts running on Pacific. I'm  
trying to balance resilience and disk usage, so I've set up a k=4  
m=2 pool for some bulk storage on HDD devices.


With the correct placement of PGs this should allow me to take any  
one host offline for maintenance. I've written this CRUSH rule for  
that purpose:


rule erasure_k4_m2_hdd_rule {
  id 3
  type erasure
  step take default class hdd
  step choose indep 3 type host
  step chooseleaf indep 2 type osd
  step emit
}

This should pick three hosts, and then two OSDs from each, which at  
least ensures that no host has more than two OSDs.


This appears to work correctly, but I'm running into an odd  
situation when adding additional OSDs to the cluster: sometimes the  
hosts flip order in a PG's set, resulting in unnecessary remapping  
work.


For example, I have one PG that changed from OSDs [0,13,7,9,3,5] to  
[0,13,3,5,7,9]. (Note that the middle two and last two sets of OSDs  
have swapped with one another.) From a quick perusal of other PGs  
that are being moved, the two OSDs within a host never appear to be  
rearranged, but the set of hosts that are chosen may be shuffled.


Is there something I'm missing that would make this rule more stable  
in the face of OSD addition? (I'm wondering if the host choosing  
component should be "firstn" rather than "indep", even though the  
discussion at  
https://docs.ceph.com/en/latest/rados/operations/crush-map-edits/#crushmaprules implies indep is preferable in EC  
pools.)


I don't have current plans to expand beyond a three-host cluster,  
but if there's an alternative way to express "not more than two OSDs  
per host", that could be helpful as well.


Any insights or suggestions would be appreciated.

Thanks,
aschmitz
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph Community Infrastructure Outage

2023-01-18 Thread Marc
> As services grew, we relied
> more and more on its legacy storage solution, which was never migrated to
> Ceph. Over the last few months, this legacy storage solution had several
> instances of silent data corruption, rendering the VMs unbootable, taking
> down various services, and requiring restoration from backups in many cases.

the shoemaker's children go barefoot ;)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] [RFC] Detail view of OSD network I/O

2023-01-18 Thread Nico Schottelius


Good morning ceph community,

for quite some time I was wondering if it would not make sense to add an
iftop alike interface to ceph that shows network traffic / iops on a per
IP basis?

I am aware of "rbd perf image iotop", however I am much more interested
into a combined metric featuring 1) Which clients read/write to where
and 2) inter OSDs traffic to see the total load on the cluster and being
able to drill down.

For example, the metric could look like this:


FROM  TOBytes/s Packets/s
osd.0 [IP] -> [IP] osd.10   ..  ..
osd.0 [IP] -> [IP] client   ..  ..


Given that this table would be sortable by from/to/min-or-max
bytes/min-or-max packets, this would allow spotting the

And maybe a summarised view such as:


FROMIN Bytes/s OUT Bytes/s   IN Packets/s OUT Packets/s
osd.0  [IP]
osd.10 [IP]


This way it would be nicely possible to identify high load.

If it was combined with average/current latency, it would potentially
also be able to find the bottlenecks in the cluster.

>From my perspective, easily combining client + intra cluster traffic
would be very helpful.

What do you think, does that make sense, does it already exist or how do
you approach this?

Best regards,

Nico

--
Sustainable and modern Infrastructures by ungleich.ch
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph-ansible: add a new HDD to an already provisioned WAL device

2023-01-18 Thread Guillaume Abrioux
Hi Len,

Indeed, this is not possible with ceph-ansible.
One option would be to do it manually with `ceph-volume lvm migrate`:

(Note that it can be tedious given that it requires a lot of manual
operations, especially for clusters with a large number of OSDs.)

Initial setup:
```
# cat group_vars/all
---
devices:
  - /dev/sdb
dedicated_devices:
  - /dev/sda
```

```
[root@osd0 ~]# lsblk
NAME
   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda
8:00  50G  0 disk
`-ceph--8d085f45--939c--4a65--a577--d21fa146d7d6-osd--db--cd34400d--daf2--450f--97d9--d561e7a43d1a
   252:10  50G  0 lvm
sdb
8:16   0  50G  0 disk
`-ceph--4c77295c--28a5--440a--9561--b9dc4c814e36-osd--block--70fd3b96--7bb2--4ae3--a0f8--4d18748186f9
252:00  50G  0 lvm
sdc
8:32   0  50G  0 disk
sdd
8:48   0  50G  0 disk
vda
  253:00  11G  0 disk
`-vda1
   253:10  10G  0 part /
```

```
[root@osd0 ~]# lvs
  LV VG
   Attr   LSize   Pool Origin Data%  Meta%  Move Log
Cpy%Sync Convert
  osd-block-70fd3b96-7bb2-4ae3-a0f8-4d18748186f9
ceph-4c77295c-28a5-440a-9561-b9dc4c814e36 -wi-ao <50.00g
  osd-db-cd34400d-daf2-450f-97d9-d561e7a43d1a
 ceph-8d085f45-939c-4a65-a577-d21fa146d7d6 -wi-ao <50.00g
[root@osd0 ~]# vgs
  VG#PV #LV #SN Attr   VSize   VFree
  ceph-4c77295c-28a5-440a-9561-b9dc4c814e36   1   1   0 wz--n- <50.00g0
  ceph-8d085f45-939c-4a65-a577-d21fa146d7d6   1   1   0 wz--n- <50.00g0
```

Create a tmp LV on your new device:
```
[root@osd0 ~]# pvcreate /dev/sdd
  Physical volume "/dev/sdd" successfully created.
[root@osd0 ~]# vgcreate vg_db_tmp /dev/sdd
  Volume group "vg_db_tmp" successfully created
[root@osd0 ~]# lvcreate -n db-sdb -l 100%FREE vg_db_tmp
  Logical volume "db-sdb" created.
```

stop your osd:
```
[root@osd0 ~]# systemctl stop ceph-osd@0
```

Migrate the db to the tmp lv:
```
[root@osd0 ~]# ceph-volume lvm migrate --osd-id 0 --osd-fsid
70fd3b96-7bb2-4ae3-a0f8-4d18748186f9 --from db --target vg_db_tmp/db-sdb
--> Migrate to new, Source: ['--devs-source',
'/var/lib/ceph/osd/ceph-0/block.db'] Target: /dev/vg_db_tmp/db-sdb
Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-0/block.db
Running command: /bin/chown -R ceph:ceph /dev/dm-2
--> Migration successful.
```

remove the old lv:
```
[root@osd0 ~]# lvremove
/dev/ceph-8d085f45-939c-4a65-a577-d21fa146d7d6/osd-db-cd34400d-daf2-450f-97d9-d561e7a43d1a
Do you really want to remove active logical volume
ceph-8d085f45-939c-4a65-a577-d21fa146d7d6/osd-db-cd34400d-daf2-450f-97d9-d561e7a43d1a?
[y/n]: y
  Logical volume "osd-db-cd34400d-daf2-450f-97d9-d561e7a43d1a" successfully
removed.
```

recreate a smaller LV.
in my simplified case, I want to go from 1 to 2 db device. it means that my
old LV has to be resized down to 1/2:
```
[root@osd0 ~]# lvcreate -n osd-db-cd34400d-daf2-450f-97d9-d561e7a43d1a -l
50%FREE ceph-8d085f45-939c-4a65-a577-d21fa146d7d6
  Logical volume "osd-db-cd34400d-daf2-450f-97d9-d561e7a43d1a" created.
```

Migrate the db to the new LV:
```
[root@osd0 ~]# ceph-volume lvm migrate --osd-id 0 --osd-fsid
70fd3b96-7bb2-4ae3-a0f8-4d18748186f9 --from db --target
ceph-8d085f45-939c-4a65-a577-d21fa146d7d6/osd-db-cd34400d-daf2-450f-97d9-d561e7a43d1a
--> Migrate to new, Source: ['--devs-source',
'/var/lib/ceph/osd/ceph-0/block.db'] Target:
/dev/ceph-8d085f45-939c-4a65-a577-d21fa146d7d6/osd-db-cd34400d-daf2-450f-97d9-d561e7a43d1a
Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-0/block.db
Running command: /bin/chown -R ceph:ceph /dev/dm-1
--> Migration successful.
```

restart the osd:
```
[root@osd0 ~]# systemctl start ceph-osd@0
```

remove tmp lv/vg/pv:
```
[root@osd0 ~]# lvremove /dev/vg_db_tmp/db-sdb
Do you really want to remove active logical volume vg_db_tmp/db-sdb? [y/n]:
y
[root@osd0 ~]# vgremove vg_db_tmp
  Volume group "vg_db_tmp" successfully removed
[root@osd0 ~]# pvremove /dev/sdd
  Labels on physical volume "/dev/sdd" successfully wiped.
```

add the new osd (should be done by re-running the playbook):
```
[root@osd0 ~]# ceph-volume lvm batch --bluestore --yes /dev/sdb /dev/sdc
--db-devices /dev/sda
--> passed data devices: 2 physical, 0 LVM
--> relative data size: 1.0
--> passed block_db devices: 1 physical, 0 LVM

... omitted output ...

--> ceph-volume lvm activate successful for osd ID: 1
--> ceph-volume lvm create successful for: /dev/sdc
[root@osd0 ~]#
```

new lsblk output:
```
[root@osd0 ~]# lsblk
NAME
   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda
8:00  50G  0 disk
|-ceph--8d085f45--939c--4a65--a577--d21fa146d7d6-osd--db--cd34400d--daf2--450f--97d9--d561e7a43d1a
   252:00  25G  0 lvm
`-ceph--8d085f45--939c--4a65--a577--d21fa146d7d6-osd--db--bb30e5aa--a634--4c52--8b99--a222c03c18e3
   2

[ceph-users] Re: MDS stuck in "up:replay"

2023-01-18 Thread Kotresh Hiremath Ravishankar
Hi Thomas,

This looks like it requires more investigation than I expected. What's the
current status ?
Did the crashed mds come back and become active ?

Increase the debug log level to 20 and share the mds logs. I will create a
tracker and share it here.
You can upload the mds logs there.

Thanks,
Kotresh H R

On Tue, Jan 17, 2023 at 5:34 PM Thomas Widhalm 
wrote:

> Another new thing that just happened:
>
> One of the MDS just crashed out of nowhere.
>
>
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
> In function 'void EMetaBlob::replay(MDSRank*, LogSegment*,
> MDPeerUpdate*)' thread 7fccc7153700 time 2023-01-17T10:05:15.420191+
>
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
> 1625: FAILED ceph_assert(g_conf()->mds_wipe_sessions)
>
>   ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
> (stable)
>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x135) [0x7fccd759943f]
>   2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605]
>   3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c)
> [0x55fb2b98e89c]
>   4: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0]
>   5: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443]
>   6: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31]
>   7: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca]
>   8: clone()
>
>
> and
>
>
>
> *** Caught signal (Aborted) **
>   in thread 7fccc7153700 thread_name:md_log_replay
>
>   ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
> (stable)
>   1: /lib64/libpthread.so.0(+0x12cf0) [0x7fccd6593cf0]
>   2: gsignal()
>   3: abort()
>   4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x18f) [0x7fccd7599499]
>   5: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605]
>   6: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c)
> [0x55fb2b98e89c]
>   7: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0]
>   8: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443]
>   9: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31]
>   10: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca]
>   11: clone()
>   NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
>
> Is what I found in the logs. Since it's referring to log replaying,
> could this be related to my issue?
>
> On 17.01.23 10:54, Thomas Widhalm wrote:
> > Hi again,
> >
> > Another thing I found: Out of pure desperation, I started MDS on all
> > nodes. I had them configured in the past so I was hoping, they could
> > help with bringing in missing data even when they were down for quite a
> > while now. I didn't see any changes in the logs but the CPU on the hosts
> > that usually don't run MDS just spiked. So high I had to kill the MDS
> > again because otherwise they kept killing OSD containers. So I don't
> > really have any new information, but maybe that could be a hint of some
> > kind?
> >
> > Cheers,
> > Thomas
> >
> > On 17.01.23 10:13, Thomas Widhalm wrote:
> >> Hi,
> >>
> >> Thanks again. :-)
> >>
> >> Ok, that seems like an error to me. I never configured an extra rank for
> >> MDS. Maybe that's where my knowledge failed me but I guess, MDS is
> >> waiting for something that was never there.
> >>
> >> Yes, there are two filesystems. Due to "budget restrictions" (it's my
> >> personal system at home, I configured a second CephFS with only one
> >> replica for data that could be easily restored.
> >>
> >> Here's what I got when turning up the debug level:
> >>
> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
> >> interval 1.0s
> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
> >> Sending beacon up:replay seq 11107
> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
> >> sender thread waiting interval 4s
> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
> >> received beacon reply up:replay seq 11107 rtt 0.0022
> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167
> >> schedule_update_timer_task
> >> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache Memory usage:  total
> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
> >> 0 caps, 0 caps per inode
> >> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
> >> trimming
> >> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
> >> interval 1.0s
> >> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache Memory usage:  total
> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
> >> 0 

[ceph-users] Re: MDS stuck in "up:replay"

2023-01-18 Thread Kotresh Hiremath Ravishankar
Hi Thomas,

I have created the tracker https://tracker.ceph.com/issues/58489 to track
this. Please upload the debug mds logs here.

Thanks,
Kotresh H R

On Wed, Jan 18, 2023 at 4:56 PM Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

> Hi Thomas,
>
> This looks like it requires more investigation than I expected. What's the
> current status ?
> Did the crashed mds come back and become active ?
>
> Increase the debug log level to 20 and share the mds logs. I will create a
> tracker and share it here.
> You can upload the mds logs there.
>
> Thanks,
> Kotresh H R
>
> On Tue, Jan 17, 2023 at 5:34 PM Thomas Widhalm 
> wrote:
>
>> Another new thing that just happened:
>>
>> One of the MDS just crashed out of nowhere.
>>
>>
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
>> In function 'void EMetaBlob::replay(MDSRank*, LogSegment*,
>> MDPeerUpdate*)' thread 7fccc7153700 time 2023-01-17T10:05:15.420191+
>>
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
>> 1625: FAILED ceph_assert(g_conf()->mds_wipe_sessions)
>>
>>   ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
>> (stable)
>>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x135) [0x7fccd759943f]
>>   2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605]
>>   3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c)
>> [0x55fb2b98e89c]
>>   4: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0]
>>   5: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443]
>>   6: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31]
>>   7: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca]
>>   8: clone()
>>
>>
>> and
>>
>>
>>
>> *** Caught signal (Aborted) **
>>   in thread 7fccc7153700 thread_name:md_log_replay
>>
>>   ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
>> (stable)
>>   1: /lib64/libpthread.so.0(+0x12cf0) [0x7fccd6593cf0]
>>   2: gsignal()
>>   3: abort()
>>   4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x18f) [0x7fccd7599499]
>>   5: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605]
>>   6: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c)
>> [0x55fb2b98e89c]
>>   7: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0]
>>   8: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443]
>>   9: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31]
>>   10: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca]
>>   11: clone()
>>   NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to interpret this.
>>
>> Is what I found in the logs. Since it's referring to log replaying,
>> could this be related to my issue?
>>
>> On 17.01.23 10:54, Thomas Widhalm wrote:
>> > Hi again,
>> >
>> > Another thing I found: Out of pure desperation, I started MDS on all
>> > nodes. I had them configured in the past so I was hoping, they could
>> > help with bringing in missing data even when they were down for quite a
>> > while now. I didn't see any changes in the logs but the CPU on the hosts
>> > that usually don't run MDS just spiked. So high I had to kill the MDS
>> > again because otherwise they kept killing OSD containers. So I don't
>> > really have any new information, but maybe that could be a hint of some
>> > kind?
>> >
>> > Cheers,
>> > Thomas
>> >
>> > On 17.01.23 10:13, Thomas Widhalm wrote:
>> >> Hi,
>> >>
>> >> Thanks again. :-)
>> >>
>> >> Ok, that seems like an error to me. I never configured an extra rank
>> for
>> >> MDS. Maybe that's where my knowledge failed me but I guess, MDS is
>> >> waiting for something that was never there.
>> >>
>> >> Yes, there are two filesystems. Due to "budget restrictions" (it's my
>> >> personal system at home, I configured a second CephFS with only one
>> >> replica for data that could be easily restored.
>> >>
>> >> Here's what I got when turning up the debug level:
>> >>
>> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread
>> waiting
>> >> interval 1.0s
>> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
>> >> Sending beacon up:replay seq 11107
>> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
>> >> sender thread waiting interval 4s
>> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
>> >> received beacon reply up:replay seq 11107 rtt 0.0022
>> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
>> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167
>> >> schedule_update_timer_task
>> >> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache Memory usage:  total
>> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have
>> caps,
>> >> 0 c

[ceph-users] Re: MDS stuck in "up:replay"

2023-01-18 Thread Thomas Widhalm

Thank you. I'm setting the debug level and await authorization for Tracker.

I'll upload the logs as soon as I can collect them.

Thank you so much for your help

On 18.01.23 12:26, Kotresh Hiremath Ravishankar wrote:

Hi Thomas,

This looks like it requires more investigation than I expected. What's
the current status ?
Did the crashed mds come back and become active ?

Increase the debug log level to 20 and share the mds logs. I will create
a tracker and share it here.
You can upload the mds logs there.

Thanks,
Kotresh H R

On Tue, Jan 17, 2023 at 5:34 PM Thomas Widhalm
mailto:thomas.widh...@netways.de>> wrote:

Another new thing that just happened:

One of the MDS just crashed out of nowhere.


/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
In function 'void EMetaBlob::replay(MDSRank*, LogSegment*,
MDPeerUpdate*)' thread 7fccc7153700 time 2023-01-17T10:05:15.420191+

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
1625: FAILED ceph_assert(g_conf()->mds_wipe_sessions)

   ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
(stable)
   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x135) [0x7fccd759943f]
   2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605]
   3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c)
[0x55fb2b98e89c]
   4: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0]
   5: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443]
   6: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31]
   7: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca]
   8: clone()


and



*** Caught signal (Aborted) **
   in thread 7fccc7153700 thread_name:md_log_replay

   ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
(stable)
   1: /lib64/libpthread.so.0(+0x12cf0) [0x7fccd6593cf0]
   2: gsignal()
   3: abort()
   4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x18f) [0x7fccd7599499]
   5: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605]
   6: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c)
[0x55fb2b98e89c]
   7: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0]
   8: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443]
   9: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31]
   10: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca]
   11: clone()
   NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

Is what I found in the logs. Since it's referring to log replaying,
could this be related to my issue?

On 17.01.23 10:54, Thomas Widhalm wrote:
 > Hi again,
 >
 > Another thing I found: Out of pure desperation, I started MDS on all
 > nodes. I had them configured in the past so I was hoping, they could
 > help with bringing in missing data even when they were down for
quite a
 > while now. I didn't see any changes in the logs but the CPU on
the hosts
 > that usually don't run MDS just spiked. So high I had to kill the MDS
 > again because otherwise they kept killing OSD containers. So I don't
 > really have any new information, but maybe that could be a hint
of some
 > kind?
 >
 > Cheers,
 > Thomas
 >
 > On 17.01.23 10:13, Thomas Widhalm wrote:
 >> Hi,
 >>
 >> Thanks again. :-)
 >>
 >> Ok, that seems like an error to me. I never configured an extra
rank for
 >> MDS. Maybe that's where my knowledge failed me but I guess, MDS is
 >> waiting for something that was never there.
 >>
 >> Yes, there are two filesystems. Due to "budget restrictions"
(it's my
 >> personal system at home, I configured a second CephFS with only one
 >> replica for data that could be easily restored.
 >>
 >> Here's what I got when turning up the debug level:
 >>
 >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread
waiting
 >> interval 1.0s
 >> Jan 17 10:08:17 ceph05 ceph-mds[1209]:
mds.beacon.mds01.ceph05.pqxmvt
 >> Sending beacon up:replay seq 11107
 >> Jan 17 10:08:17 ceph05 ceph-mds[1209]:
mds.beacon.mds01.ceph05.pqxmvt
 >> sender thread waiting interval 4s
 >> Jan 17 10:08:17 ceph05 ceph-mds[1209]:
mds.beacon.mds01.ceph05.pqxmvt
 >> received beacon reply up:replay seq 11107 rtt 0.0022
 >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
 >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167
 >> schedule_update_timer_task
 >> Jan 17 10:08:18 ceph05 ceph-md

[ceph-users] Ceph rbd clients surrender exclusive lock in critical situation

2023-01-18 Thread Frank Schilder
Hi all,

we are observing a problem on a libvirt virtualisation cluster that might come 
from ceph rbd clients. Something went wrong during execution of a 
live-migration operation and as a result we have two instances of the same VM 
running on 2 different hosts, the source- and the destination host. What we 
observe now is the the exclusive lock of the RBD disk image moves between these 
two clients periodically (every few minutes the owner flips).

We are pretty sure that no virsh commands possibly having that effect are 
executed during this time. The client connections are not lost and the OSD 
blacklist is empty. I don't understand why a ceph rbd client would surrender an 
exclusive lock in such a split brain situation, its exactly when it needs to 
hold on to it. As a result, the affected virtual drives are corrupted.

The questions we have in this context are:

Under what conditions does a ceph rbd client surrender an exclusive lock?
Could this be a bug in the client or a ceph config error?
Is this a known problem with libceph and libvirtd?
Anyone else making the same observation and having some guidance?

The VM hosts are on alma8 and we use the advanced virtualisation repo providing 
very recent versions of qemu and libvirtd. We have seen this floating exclusive 
lock before on mimic. Now we are on octopus and I can't really blame it on the 
old ceph version any more. We use opennebula as a KVM front-end.

Thanks for any pointers!
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Flapping OSDs on pacific 16.2.10

2023-01-18 Thread J-P Methot

Hi,

We have a full SSD production cluster running on Pacific 16.2.10 and 
deployed with cephadm that is experiencing OSD flapping issues. 
Essentially, random OSDs will get kicked out of the cluster and then 
automatically brought back in a few times a day. As an example, let's 
take the case of OSD.184 :


-It flapped 9 times between January 15th and 17th with the following log 
message each time :  2023-01-15T16:33:19.903+ prepare_failure 
osd.184 from osd.49 is reporting failure:1


-On January 17th, it complains that there are slow ops and spam its logs 
with the following line : heartbeat_map is_healthy 'OSD::osd_op_tp 
thread 0x7f346aa64700' had timed out after 15.00954s


The storage node itself has over 30 GB of ram still available in cache 
and the drives themselves only seldom peak at 100% usage and that never 
lasts more than a few seconds. CPU usage is also constantly around 5%. 
Considering there is no other error messages in any of the regular logs, 
including the systemd logs, why would this OSD not reply to heartbeats?


--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph rbd clients surrender exclusive lock in critical situation

2023-01-18 Thread Ilya Dryomov
On Wed, Jan 18, 2023 at 1:19 PM Frank Schilder  wrote:
>
> Hi all,
>
> we are observing a problem on a libvirt virtualisation cluster that might 
> come from ceph rbd clients. Something went wrong during execution of a 
> live-migration operation and as a result we have two instances of the same VM 
> running on 2 different hosts, the source- and the destination host. What we 
> observe now is the the exclusive lock of the RBD disk image moves between 
> these two clients periodically (every few minutes the owner flips).

Hi Frank,

If you are talking about RBD exclusive lock feature ("exclusive-lock"
under "features" in "rbd info" output) then this is expected.  This
feature provides automatic cooperative lock transitions between clients
to ensure that only a single client is writing to the image at any
given time.  It's there to protect internal per-image data structures
such as the object map, the journal or the client-side PWL (persistent
write log) cache from concurrent modifications in case the image is
opened by two or more clients.  The name is confusing but it's NOT
about preventing other clients from opening and writing to the image.
Rather it's about serializing those writes.

>
> We are pretty sure that no virsh commands possibly having that effect are 
> executed during this time. The client connections are not lost and the OSD 
> blacklist is empty. I don't understand why a ceph rbd client would surrender 
> an exclusive lock in such a split brain situation, its exactly when it needs 
> to hold on to it. As a result, the affected virtual drives are corrupted.

There is no split-brain from the Ceph POV here.  RBD has always
supported the multiple clients use case.

>
> The questions we have in this context are:
>
> Under what conditions does a ceph rbd client surrender an exclusive lock?

Exclusive lock transitions are cooperative so any time another client
asks for it (not immediately though -- the current lock owner finishes
processing in-flight I/O and flushes its caches first).

> Could this be a bug in the client or a ceph config error?

Very unlikely.

There is a way to disable automatic lock transitions but I don't think
it's wired up in QEMU.

> Is this a known problem with libceph and libvirtd?

Not sure what you mean by libceph.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Flapping OSDs on pacific 16.2.10

2023-01-18 Thread Danny Webb
Do you have any network congestion or packet loss on the replication network?  
are you sharing nics between public / replication?  That is another metric that 
needs looking into.

From: J-P Methot 
Sent: 18 January 2023 12:42
To: ceph-users 
Subject: [ceph-users] Flapping OSDs on pacific 16.2.10

CAUTION: This email originates from outside THG

Hi,

We have a full SSD production cluster running on Pacific 16.2.10 and
deployed with cephadm that is experiencing OSD flapping issues.
Essentially, random OSDs will get kicked out of the cluster and then
automatically brought back in a few times a day. As an example, let's
take the case of OSD.184 :

-It flapped 9 times between January 15th and 17th with the following log
message each time :  2023-01-15T16:33:19.903+ prepare_failure
osd.184 from osd.49 is reporting failure:1

-On January 17th, it complains that there are slow ops and spam its logs
with the following line : heartbeat_map is_healthy 'OSD::osd_op_tp
thread 0x7f346aa64700' had timed out after 15.00954s

The storage node itself has over 30 GB of ram still available in cache
and the drives themselves only seldom peak at 100% usage and that never
lasts more than a few seconds. CPU usage is also constantly around 5%.
Considering there is no other error messages in any of the regular logs,
including the systemd logs, why would this OSD not reply to heartbeats?

--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


Danny Webb
Principal OpenStack Engineer
danny.w...@thehutgroup.com
[THG Ingenuity Logo]
www.thg.com
[https://i.imgur.com/wbpVRW6.png]
 [https://i.imgur.com/c3040tr.png] 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Flapping OSDs on pacific 16.2.10

2023-01-18 Thread Anthony D'Atri
This was my first thought as well, especially if the OSDs log something like 
“wrongly marked down”.  It’s one of the reasons why I favor not having a 
replication network.

> On Jan 18, 2023, at 8:28 AM, Danny Webb  wrote:
> 
> Do you have any network congestion or packet loss on the replication network? 
>  are you sharing nics between public / replication?  That is another metric 
> that needs looking into.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Flapping OSDs on pacific 16.2.10

2023-01-18 Thread J-P Methot
At the network level we're using bonds (802.3ad). There are 2 nics, each 
with two 25gbps port. 1 port per nic is used for the public network, the 
other for the replication network. That suggests a network bandwidth of 
50gbps (in theory) for each network load. The network graph is showing 
me loads of around 100MB/sec on the public network interface, less on 
the replication network. No dropped packets or network errors reported. 
AFAIK, this is not getting overloaded.



On 1/18/23 08:28, Danny Webb wrote:
Do you have any network congestion or packet loss on the replication 
network? are you sharing nics between public / replication?  That is 
another metric that needs looking into.


*From:* J-P Methot 
*Sent:* 18 January 2023 12:42
*To:* ceph-users 
*Subject:* [ceph-users] Flapping OSDs on pacific 16.2.10
CAUTION: This email originates from outside THG

Hi,

We have a full SSD production cluster running on Pacific 16.2.10 and
deployed with cephadm that is experiencing OSD flapping issues.
Essentially, random OSDs will get kicked out of the cluster and then
automatically brought back in a few times a day. As an example, let's
take the case of OSD.184 :

-It flapped 9 times between January 15th and 17th with the following log
message each time :  2023-01-15T16:33:19.903+ prepare_failure
osd.184 from osd.49 is reporting failure:1

-On January 17th, it complains that there are slow ops and spam its logs
with the following line : heartbeat_map is_healthy 'OSD::osd_op_tp
thread 0x7f346aa64700' had timed out after 15.00954s

The storage node itself has over 30 GB of ram still available in cache
and the drives themselves only seldom peak at 100% usage and that never
lasts more than a few seconds. CPU usage is also constantly around 5%.
Considering there is no other error messages in any of the regular logs,
including the systemd logs, why would this OSD not reply to heartbeats?

--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

*Danny Webb*
Principal OpenStack Engineer
danny.w...@thehutgroup.com

THG Ingenuity Logo
www.thg.com 
 




--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph rbd clients surrender exclusive lock in critical situation

2023-01-18 Thread Frank Schilder
Hi Ilya,

thanks a lot for the information. Yes, I was talking about the exclusive lock 
feature and was under the impression that only one rbd client can get write 
access on connect and will keep it until disconnect. The problem we are facing 
with multi-VM write access is, that this will inevitably corrupt the file 
system created on the rbd if two instances can get write access. Its not a 
shared file system, its just an xfs formatted virtual disk.

> There is a way to disable automatic lock transitions but I don't think
> it's wired up in QEMU.

Can you point me to some documentation about that? It sounds like this is what 
would be needed to avoid the file system corruption in our use case. The lock 
transition should be initiated from the outside and the lock should then stay 
fixed on the client holding it until it is instructed to give up the lock or it 
disconnects.

>> Is this a known problem with libceph and libvirtd?
> Not sure what you mean by libceph.

I simply meant that its not a krbd client. Libvirt uses libceph (or was it 
librbd?) to emulate virtual drives, not krbd.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Ilya Dryomov 
Sent: 18 January 2023 14:26:54
To: Frank Schilder
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Ceph rbd clients surrender exclusive lock in critical 
situation

On Wed, Jan 18, 2023 at 1:19 PM Frank Schilder  wrote:
>
> Hi all,
>
> we are observing a problem on a libvirt virtualisation cluster that might 
> come from ceph rbd clients. Something went wrong during execution of a 
> live-migration operation and as a result we have two instances of the same VM 
> running on 2 different hosts, the source- and the destination host. What we 
> observe now is the the exclusive lock of the RBD disk image moves between 
> these two clients periodically (every few minutes the owner flips).

Hi Frank,

If you are talking about RBD exclusive lock feature ("exclusive-lock"
under "features" in "rbd info" output) then this is expected.  This
feature provides automatic cooperative lock transitions between clients
to ensure that only a single client is writing to the image at any
given time.  It's there to protect internal per-image data structures
such as the object map, the journal or the client-side PWL (persistent
write log) cache from concurrent modifications in case the image is
opened by two or more clients.  The name is confusing but it's NOT
about preventing other clients from opening and writing to the image.
Rather it's about serializing those writes.

>
> We are pretty sure that no virsh commands possibly having that effect are 
> executed during this time. The client connections are not lost and the OSD 
> blacklist is empty. I don't understand why a ceph rbd client would surrender 
> an exclusive lock in such a split brain situation, its exactly when it needs 
> to hold on to it. As a result, the affected virtual drives are corrupted.

There is no split-brain from the Ceph POV here.  RBD has always
supported the multiple clients use case.

>
> The questions we have in this context are:
>
> Under what conditions does a ceph rbd client surrender an exclusive lock?

Exclusive lock transitions are cooperative so any time another client
asks for it (not immediately though -- the current lock owner finishes
processing in-flight I/O and flushes its caches first).

> Could this be a bug in the client or a ceph config error?

Very unlikely.

There is a way to disable automatic lock transitions but I don't think
it's wired up in QEMU.

> Is this a known problem with libceph and libvirtd?

Not sure what you mean by libceph.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph-ansible: add a new HDD to an already provisioned WAL device

2023-01-18 Thread Len Kimms
Hi Guillaume,

thank you very much for the quick clarification and elaborate workaround.

We’ll check if manual migration is feasible with our setup with respect to the 
time needed. Alternatively, we’re looking into completely redeploying all 
affected OSDs (i.e. shrinking the cluster with ceph-ansible and newly 
provisioning all the devices).
Thanks as well for giving us the hint with the flags. In both cases it makes 
sense to prevent unnecessary data migration (by setting noout, norecovery, 
etc.) during the procedure.

Cheers, Len


Guillaume Abrioux schrieb am 2023-01-18:
> Hi Len,

> Indeed, this is not possible with ceph-ansible.
> One option would be to do it manually with `ceph-volume lvm migrate`:

> (Note that it can be tedious given that it requires a lot of manual
> operations, especially for clusters with a large number of OSDs.)

> Initial setup:
> ```
> # cat group_vars/all
> ---
> devices:
>   - /dev/sdb
> dedicated_devices:
>   - /dev/sda
> ```

> ```
> [root@osd0 ~]# lsblk
> NAME
>MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> sda
> 8:00  50G  0 disk
> `-ceph--8d085f45--939c--4a65--a577--d21fa146d7d6-osd--db--cd34400d--daf2--450f--97d9--d561e7a43d1a
>252:10  50G  0 lvm
> sdb
> 8:16   0  50G  0 disk
> `-ceph--4c77295c--28a5--440a--9561--b9dc4c814e36-osd--block--70fd3b96--7bb2--4ae3--a0f8--4d18748186f9
> 252:00  50G  0 lvm
> sdc
> 8:32   0  50G  0 disk
> sdd
> 8:48   0  50G  0 disk
> vda
>   253:00  11G  0 disk
> `-vda1
>253:10  10G  0 part /
> ```

> ```
> [root@osd0 ~]# lvs
>   LV VG
>Attr   LSize   Pool Origin Data%  Meta%  Move Log
> Cpy%Sync Convert
>   osd-block-70fd3b96-7bb2-4ae3-a0f8-4d18748186f9
> ceph-4c77295c-28a5-440a-9561-b9dc4c814e36 -wi-ao <50.00g
>   osd-db-cd34400d-daf2-450f-97d9-d561e7a43d1a
>  ceph-8d085f45-939c-4a65-a577-d21fa146d7d6 -wi-ao <50.00g
> [root@osd0 ~]# vgs
>   VG#PV #LV #SN Attr   VSize   VFree
>   ceph-4c77295c-28a5-440a-9561-b9dc4c814e36   1   1   0 wz--n- <50.00g0
>   ceph-8d085f45-939c-4a65-a577-d21fa146d7d6   1   1   0 wz--n- <50.00g0
> ```

> Create a tmp LV on your new device:
> ```
> [root@osd0 ~]# pvcreate /dev/sdd
>   Physical volume "/dev/sdd" successfully created.
> [root@osd0 ~]# vgcreate vg_db_tmp /dev/sdd
>   Volume group "vg_db_tmp" successfully created
> [root@osd0 ~]# lvcreate -n db-sdb -l 100%FREE vg_db_tmp
>   Logical volume "db-sdb" created.
> ```

> stop your osd:
> ```
> [root@osd0 ~]# systemctl stop ceph-osd@0
> ```

> Migrate the db to the tmp lv:
> ```
> [root@osd0 ~]# ceph-volume lvm migrate --osd-id 0 --osd-fsid
> 70fd3b96-7bb2-4ae3-a0f8-4d18748186f9 --from db --target vg_db_tmp/db-sdb
> --> Migrate to new, Source: ['--devs-source',
> '/var/lib/ceph/osd/ceph-0/block.db'] Target: /dev/vg_db_tmp/db-sdb
> Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-0/block.db
> Running command: /bin/chown -R ceph:ceph /dev/dm-2
> --> Migration successful.
> ```

> remove the old lv:
> ```
> [root@osd0 ~]# lvremove
> /dev/ceph-8d085f45-939c-4a65-a577-d21fa146d7d6/osd-db-cd34400d-daf2-450f-97d9-d561e7a43d1a
> Do you really want to remove active logical volume
> ceph-8d085f45-939c-4a65-a577-d21fa146d7d6/osd-db-cd34400d-daf2-450f-97d9-d561e7a43d1a?
> [y/n]: y
>   Logical volume "osd-db-cd34400d-daf2-450f-97d9-d561e7a43d1a" successfully
> removed.
> ```

> recreate a smaller LV.
> in my simplified case, I want to go from 1 to 2 db device. it means that my
> old LV has to be resized down to 1/2:
> ```
> [root@osd0 ~]# lvcreate -n osd-db-cd34400d-daf2-450f-97d9-d561e7a43d1a -l
> 50%FREE ceph-8d085f45-939c-4a65-a577-d21fa146d7d6
>   Logical volume "osd-db-cd34400d-daf2-450f-97d9-d561e7a43d1a" created.
> ```

> Migrate the db to the new LV:
> ```
> [root@osd0 ~]# ceph-volume lvm migrate --osd-id 0 --osd-fsid
> 70fd3b96-7bb2-4ae3-a0f8-4d18748186f9 --from db --target
> ceph-8d085f45-939c-4a65-a577-d21fa146d7d6/osd-db-cd34400d-daf2-450f-97d9-d561e7a43d1a
> --> Migrate to new, Source: ['--devs-source',
> '/var/lib/ceph/osd/ceph-0/block.db'] Target:
> /dev/ceph-8d085f45-939c-4a65-a577-d21fa146d7d6/osd-db-cd34400d-daf2-450f-97d9-d561e7a43d1a
> Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-0/block.db
> Running command: /bin/chown -R ceph:ceph /dev/dm-1
> --> Migration successful.
> ```

> restart the osd:
> ```
> [root@osd0 ~]# systemctl start ceph-osd@0
> ```

> remove tmp lv/vg/pv:
> ```
> [root@osd0 ~]# lvremove /dev/vg_db_tmp/db-sdb
> Do you really want to remove active logical volume vg_db_tmp/db-sdb? [y/n]:
> y
> [root@osd0 ~]# vgremove vg_db_tmp
>   Volume group "vg_db_tmp" successfully removed
> [root@osd0 ~]# pvremove /dev/sdd
>   Labels on physical volume "/dev/sdd" successfully wiped.
> ```

> ad

[ceph-users] Re: ceph orch osd spec questions

2023-01-18 Thread Wyll Ingersoll


In case anyone was wondering, I figured out the problem...

This nasty bug in Pacific 16.2.10   https://tracker.ceph.com/issues/56031  - I 
think it is fixed in the upcoming .11 release and in Quincy.

This bug causes the computation of the bluestore DB partition to be much 
smaller than it should be, so if you request a reasonable size which is smaller 
than the incorrectly computed maximum size, the DB creation will fail.

Our problem was that we added 3 new SSDs that were considered "unused" by the 
system, giving us a total of 8 (5 used, 3 unused).   When the orchestrator 
issues a "ceph-volume lvm batch" command, it passes 40 data devices and 8 db 
devices.  Normally, you would expect it to divide them into 5 slots per DB 
device (40/8).   But when it computes the size of the slots, that is where the 
problem occurs.

ceph-volume first sees the 3 unused devices in a group and incorrectly decides 
that the slots needed is 3 * 5 = 15 slots, then divides the size of a single DB 
device by 15, thus making a max DB size 3x smaller than it should be.  If the 
code had also used the size of all of the devices in the group, then computed 
the max size, it would have been fine, but it only accounts for the size of the 
1st DB device in the group resulting in a size 3x smaller than it should be.

The workaround is to trick ceph into grouping all of the DB devices into unique 
groups of 1 by putting a minimal VG with a unique name on each of the unused 
SSDs so that when ceph-volume computes the sizing, it sees groups of 1 and thus 
doesn't multiply the number of slots incorrectly.   I used "vgcreate bug1 -s 1M 
/dev/xyz" to create a bogus VG on each of the unused SSDs, now I have properly 
sized DB devices on the new SSDs (the "bugX" VGs can then be removed once there 
are legitimate DB VGs on the device).

Question - Because our cluster was initially layed out using the buggy 
ceph-volume (16.2.10), we now have hundreds of DB devices that are far smaller 
than they should be (far less than the recommended 1-4% of the data devices 
size).  Is it possible to resize the DB devices without destroying and 
recreating the OSD itself?

What are the implications of having bluestore DB devices that are far smaller 
than they should be?


thanks,
  Wyllys Ingersoll



From: Wyll Ingersoll 
Sent: Friday, January 13, 2023 4:35 PM
To: ceph-users@ceph.io 
Subject: [ceph-users] ceph orch osd spec questions


Ceph Pacific 16.2.9

We have a storage server with multiple 1.7TB SSDs dedicated to the bluestore DB 
usage.  The osd spec originally was misconfigured slightly and had set the 
"limit" parameter on the db_devices to 5 (there are 8 SSDs available) and did 
not specify a block_db_size.  ceph layed out the original 40 OSDs and put 8 DBs 
across 5 of the SSDs (because of limit param).  Ceph seems to have auto-sized 
the bluestore DB partitions to be about 45GB, which is far less than the 
recommended 1-4% (using 10TB drives).  How does ceph-volume determine the size 
of the bluestore DB/WAL partitions when it is not specified in the spec?

We updated the spec and specified a block_db_size of 300G and removed the 
"limit" value.  Now we can see in the cephadm.log that the ceph-volume command 
being issued is using the correct list of SSD devices (all 8) as options to the 
lvm batch (--db-devices ...), but it keeps failing to create the new OSD 
because we are asking for 300G and it thinks there is only 44G available even 
though the last 3 SSDs in the list are empty (1.7T).  So, it appears that 
somehow the orchestrator is ignoring the last 3 SSDs.  I have verified that 
these SSDs are wiped clean, have no partitions or LVM, and no label (sgdisk -Z, 
wipefs -a). They appear as available in the inventory and not locked or 
otherwise in use.

Also, the "db_slots" spec parameter is ignored in pacific due to a bug so there 
is no way to tell the orchestrator to use "block_db_slots". Adding it to the 
spec like "block_db_size" fails since it is not recognized.

Any help figuring out why these SSDs are being ignored would be much 
appreciated.

Our spec for this host looks like this:
---

spec:

  data_devices:

rotational: 1

size: '3TB:'

  db_devices:

rotational: 0

size: ':2T'

vendor: 'SEAGATE'

  block_db_size: 300G

---

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Flapping OSDs on pacific 16.2.10

2023-01-18 Thread Frank Schilder
Do you have CPU soft lock-ups around these times? We had these timeouts due to 
using the cfq/bfq disk schedulers with SSDs. The osd_op_tp thread timeout is 
typical when CPU lockups happen. Could be a sporadic problem with the disk IO 
path.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: J-P Methot 
Sent: 18 January 2023 14:49:54
To: Danny Webb; ceph-users
Subject: [ceph-users] Re: Flapping OSDs on pacific 16.2.10

At the network level we're using bonds (802.3ad). There are 2 nics, each
with two 25gbps port. 1 port per nic is used for the public network, the
other for the replication network. That suggests a network bandwidth of
50gbps (in theory) for each network load. The network graph is showing
me loads of around 100MB/sec on the public network interface, less on
the replication network. No dropped packets or network errors reported.
AFAIK, this is not getting overloaded.


On 1/18/23 08:28, Danny Webb wrote:
> Do you have any network congestion or packet loss on the replication
> network? are you sharing nics between public / replication?  That is
> another metric that needs looking into.
> 
> *From:* J-P Methot 
> *Sent:* 18 January 2023 12:42
> *To:* ceph-users 
> *Subject:* [ceph-users] Flapping OSDs on pacific 16.2.10
> CAUTION: This email originates from outside THG
>
> Hi,
>
> We have a full SSD production cluster running on Pacific 16.2.10 and
> deployed with cephadm that is experiencing OSD flapping issues.
> Essentially, random OSDs will get kicked out of the cluster and then
> automatically brought back in a few times a day. As an example, let's
> take the case of OSD.184 :
>
> -It flapped 9 times between January 15th and 17th with the following log
> message each time :  2023-01-15T16:33:19.903+ prepare_failure
> osd.184 from osd.49 is reporting failure:1
>
> -On January 17th, it complains that there are slow ops and spam its logs
> with the following line : heartbeat_map is_healthy 'OSD::osd_op_tp
> thread 0x7f346aa64700' had timed out after 15.00954s
>
> The storage node itself has over 30 GB of ram still available in cache
> and the drives themselves only seldom peak at 100% usage and that never
> lasts more than a few seconds. CPU usage is also constantly around 5%.
> Considering there is no other error messages in any of the regular logs,
> including the systemd logs, why would this OSD not reply to heartbeats?
>
> --
> Jean-Philippe Méthot
> Senior Openstack system administrator
> Administrateur système Openstack sénior
> PlanetHoster inc.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
> *Danny Webb*
> Principal OpenStack Engineer
> danny.w...@thehutgroup.com
>
> THG Ingenuity Logo
> www.thg.com 
> 
> 
>
--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Flapping OSDs on pacific 16.2.10

2023-01-18 Thread J-P Methot
There's nothing in the CPU graph that suggests soft lock-ups at these 
times. However, thank you for pointing out that the disk io scheduler 
could have an impact. Ubuntu seems to be on mq-deadline by default, so 
we just switched to none, as it fits our workload best I believe. I 
don't know if this will fix our issue, but I think it's worth testing.


On 1/18/23 11:17, Frank Schilder wrote:

Do you have CPU soft lock-ups around these times? We had these timeouts due to 
using the cfq/bfq disk schedulers with SSDs. The osd_op_tp thread timeout is 
typical when CPU lockups happen. Could be a sporadic problem with the disk IO 
path.


--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Flapping OSDs on pacific 16.2.10

2023-01-18 Thread Frank Schilder
I'm not sure what you look for in the CPU graph. If its load or a similar 
metric you will not see these lock-ups. You need to look into the syslog and 
search for it. If these warnings are there, it might give give a clue as to 
what hardware component is causing it. They look something like "BUG: soft 
lockup - CPU#X stuck for ..."

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: J-P Methot 
Sent: 18 January 2023 17:38:28
To: Frank Schilder; ceph-users
Subject: Re: [ceph-users] Re: Flapping OSDs on pacific 16.2.10

There's nothing in the CPU graph that suggests soft lock-ups at these
times. However, thank you for pointing out that the disk io scheduler
could have an impact. Ubuntu seems to be on mq-deadline by default, so
we just switched to none, as it fits our workload best I believe. I
don't know if this will fix our issue, but I think it's worth testing.

On 1/18/23 11:17, Frank Schilder wrote:
> Do you have CPU soft lock-ups around these times? We had these timeouts due 
> to using the cfq/bfq disk schedulers with SSDs. The osd_op_tp thread timeout 
> is typical when CPU lockups happen. Could be a sporadic problem with the disk 
> IO path.

--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph rbd clients surrender exclusive lock in critical situation

2023-01-18 Thread Ilya Dryomov
On Wed, Jan 18, 2023 at 3:25 PM Frank Schilder  wrote:
>
> Hi Ilya,
>
> thanks a lot for the information. Yes, I was talking about the exclusive lock 
> feature and was under the impression that only one rbd client can get write 
> access on connect and will keep it until disconnect. The problem we are 
> facing with multi-VM write access is, that this will inevitably corrupt the 
> file system created on the rbd if two instances can get write access. Its not 
> a shared file system, its just an xfs formatted virtual disk.
>
> > There is a way to disable automatic lock transitions but I don't think
> > it's wired up in QEMU.
>
> Can you point me to some documentation about that? It sounds like this is 
> what would be needed to avoid the file system corruption in our use case. The 
> lock transition should be initiated from the outside and the lock should then 
> stay fixed on the client holding it until it is instructed to give up the 
> lock or it disconnects.

It looks like there is not much documentation on this specific aspect
beyond a few scattered notes which I'm pasting below:

> To disable transparent lock transitions between multiple clients, the
> client must acquire the lock by using the RBD_LOCK_MODE_EXCLUSIVE flag.

> Per mapping (block device) rbd device map options:
> [...]
> - exclusive - Disable automatic exclusive lock transitions.
>   Equivalent to --exclusive.

(Yes, both the flag and the option are also named "exclusive".  Don't
ask why...)

However note that for krbd, --exclusive comes with some strings
attached.  For QEMU, there is no such option at all -- as already
mentioned, RBD_LOCK_MODE_EXCLUSIVE flag is not wired up there.

Ultimately, it's the responsibility of the orchestration layer to
prevent situations like this from happening.  Ceph just provides
storage, it can't really be involved in managing one's VMs or deciding
whether multi-VM access is OK.  The orchestration layer may choose to
use some of the RBD primitives for this (whether exclusive locks or
advisory locks -- see "rbd lock add", "rbd lock ls" and "rbd lock rm"
commands), use something else or do nothing at all...

>
> >> Is this a known problem with libceph and libvirtd?
> > Not sure what you mean by libceph.
>
> I simply meant that its not a krbd client. Libvirt uses libceph (or was it 
> librbd?) to emulate virtual drives, not krbd.

libceph is actually one of the kernel modules.  libvirt/QEMU usually
use librbd but it's completely up to the user.  Nothing prevents you
from feeding some krbd devices to libvirt/QEMU, for example.

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS crash in "inotablev == mds->inotable->get_version()"

2023-01-18 Thread Kenny Van Alstyne
Hey all!  I’ve run into an MDS crash on a cluster recently upgraded from Ceph 
16.2.7 to 16.2.10.  I’m hitting an assert nearly identical to this one gathered 
by the telemetry module:
https://tracker.ceph.com/issues/54747

I have a new build compiling to test whether 
https://github.com/ceph/ceph/pull/43184/  makes a difference or not, when 
setting mds_inject_skip_replaying_inotable.

Relevant logs are below, but I’m wondering if anyone has hit anything like 
this?  Thanks in advance!


=== BEGIN LOG SNIPPET ===

-2> 2023-01-18T20:16:29.789+ 7f6190243700 -1 log_channel(cluster) log 
[ERR] : journal replay alloc 0x110 not in free 
[0x111~0x3dc,0x10003fb~0x1e8,0x10005e5~0x2,0x10009d4~0x2,0x105cc6d~0x4,0x10001c6b44e~0x4,0x10001cb91f4~0x1f4,0x10001cb93f4~0x3dd,0x10007582c15~0x279,0x10007582e90~0x1f4,0x10007583094~0xfff8a7cf6c]
-1> 2023-01-18T20:16:29.789+ 7f6190243700 -1 
/builds/66321/e7c73776/ceph/-build//WORKDIR/ceph-16.2.10/src/mds/journal.cc: In 
function 'void EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)' thread 
7f6190243700 time 2023-01-18T20:16:29.794189+
/WORKDIR/ceph-16.2.10/src/mds/journal.cc: 1577: FAILED ceph_assert(inotablev == 
mds->inotable->get_version())

 ceph version 16.2.10 (e7c73776b3136f6d18a35febeb38f5fdd41be364) pacific 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14c) [0x7f619d548645]
 2: /usr/lib/ceph/libceph-common.so.2(+0x27182f) [0x7f619d54882f]
 3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5815) 
[0x560bfd1c6935]
 4: (EUpdate::replay(MDSRank*)+0x3c) [0x560bfd1c7ecc]
 5: (MDLog::_replay_thread()+0xca9) [0x560bfd153de9]
 6: (MDLog::ReplayThread::entry()+0xd) [0x560bfce78fdd]
 7: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f619cf29fa3]
 8: clone()

 0> 2023-01-18T20:16:29.793+ 7f6190243700 -1 *** Caught signal 
(Aborted) **
 in thread 7f6190243700 thread_name:md_log_replay

 ceph version 16.2.10 (e7c73776b3136f6d18a35febeb38f5fdd41be364) pacific 
(stable)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7f619cf34730]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x19d) [0x7f619d548696]
 5: /usr/lib/ceph/libceph-common.so.2(+0x27182f) [0x7f619d54882f]
 6: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5815) 
[0x560bfd1c6935]
 7: (EUpdate::replay(MDSRank*)+0x3c) [0x560bfd1c7ecc]
 8: (MDLog::_replay_thread()+0xca9) [0x560bfd153de9]
 9: (MDLog::ReplayThread::entry()+0xd) [0x560bfce78fdd]
 10: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f619cf29fa3]
 11: clone()
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

=== END LOG SNIPPET ===

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] ceph quincy rgw openstack howto

2023-01-18 Thread Shashi Dahal
Hi,

How to set up values for rgw_keystone_url and other related fields  that
are not possible to change via the GUI under cluster configuration. ?

ceph qunicy is deployed using cephadm.



-- 
Cheers,
Shashi
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 17.2.5 ceph fs status: AssertionError

2023-01-18 Thread Robert Sander

Am 18.01.23 um 10:12 schrieb Robert Sander:


root@cephtest20:~# ceph fs status
Error EINVAL: Traceback (most recent call last):
   File "/usr/share/ceph/mgr/mgr_module.py", line 1757, in _handle_command
     return CLICommand.COMMANDS[cmd['prefix']].call(self, cmd, inbuf)
   File "/usr/share/ceph/mgr/mgr_module.py", line 462, in call
     return self.func(mgr, **kwargs)
   File "/usr/share/ceph/mgr/status/module.py", line 159, in 
handle_fs_status

     assert metadata
AssertionError


After restarting all MDS daemons the AssertionError is gone, ceph fs 
status shows the filesystem status again. Strange.


Regards
--
Robert Sander
Heinlein Consulting GmbH
Schwedter Str. 8/9b, 10119 Berlin

http://www.heinlein-support.de

Tel: 030 / 405051-43
Fax: 030 / 405051-19

Zwangsangaben lt. §35a GmbHG:
HRB 220009 B / Amtsgericht Berlin-Charlottenburg,
Geschäftsführer: Peer Heinlein -- Sitz: Berlin
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io