[ceph-users] Re: [SPAM] Re: Ceph crash :-(

2024-06-13 Thread David C.
Debian unstable

The situation is absolutely not dramatic, but if it is a large production,
you should benefit from product support.
Based on the geographical area of your email domain, perhaps ask Robert for
a local service ?


Le jeu. 13 juin 2024 à 20:35, Ranjan Ghosh  a écrit :

> I'm still in doubt whether any reinstall will fix this issue, because
> the packages seem to be buggy and there a no better packages right now
> for Ubuntu 24.04 it seems.
>
> Canonical is really crazy if you ask me. Even for a non-LTS version but
> especially for an LTS version. What were they thinking? Just get a
> preliminary obviously buggy GIT version and shove it out with a release
> to unsuspecting users.
> Just imagine they did sth. like this with Apache etc.
>
> Thanks for all tips y'all. Still need to figure out what the *best*
> option right now would be to fix this. Sigh.
>
>
>
> Am 13.06.24 um 20:00 schrieb Sebastian:
> > If this is one node from many it’s not a problem because you can
> reinstall system and ceph and rebalance cluster.
> > BTW. Read release notes before :)  I’m also not reading it in case of my
> personal desktop, but on servers where I keep data I’m doing it.
> > but what canonical did in this case is… this is LTS version :/
> >
> >
> > BR,
> > Sebastian
> >
> >
> >> On 13 Jun 2024, at 19:47, David C.  wrote:
> >>
> >> In addition to Robert's recommendations,
> >>
> >> Remember to respect the update order (mgr->mon->(crash->)osd->mds->...)
> >>
> >> Before everything was containerized, it was not recommended to have
> >> different services on the same machine.
> >>
> >>
> >>
> >> Le jeu. 13 juin 2024 à 19:37, Robert Sander <
> r.san...@heinlein-support.de>
> >> a écrit :
> >>
> >>> On 13.06.24 18:18, Ranjan Ghosh wrote:
> >>>
> >>>> What's more APT says I now got a Ceph Version
> >>>> (19.2.0~git20240301.4c76c50-0ubuntu6) which doesn't even have any
> >>>> official release notes:
> >>> Ubuntu 24.04 ships with that version from a git snapshot.
> >>>
> >>> You have to ask Canonical why they did this.
> >>>
> >>> I would not use Ceph packages shipped from a distribution but always
> the
> >>> ones from download.ceph.com or even better the container images that
> >>> come with the orchestrator.
> >>>
> >>> Why version do your other Ceph nodes run on?
> >>>
> >>> Regards
> >>> --
> >>> Robert Sander
> >>> Heinlein Support GmbH
> >>> Linux: Akademie - Support - Hosting
> >>> http://www.heinlein-support.de
> >>>
> >>> Tel: 030-405051-43
> >>> Fax: 030-405051-19
> >>>
> >>> Zwangsangaben lt. §35a GmbHG:
> >>> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> >>> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
> >>> ___
> >>> ceph-users mailing list -- ceph-users@ceph.io
> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph crash :-(

2024-06-13 Thread David C.
In addition to Robert's recommendations,

Remember to respect the update order (mgr->mon->(crash->)osd->mds->...)

Before everything was containerized, it was not recommended to have
different services on the same machine.



Le jeu. 13 juin 2024 à 19:37, Robert Sander 
a écrit :

> On 13.06.24 18:18, Ranjan Ghosh wrote:
>
> > What's more APT says I now got a Ceph Version
> > (19.2.0~git20240301.4c76c50-0ubuntu6) which doesn't even have any
> > official release notes:
>
> Ubuntu 24.04 ships with that version from a git snapshot.
>
> You have to ask Canonical why they did this.
>
> I would not use Ceph packages shipped from a distribution but always the
> ones from download.ceph.com or even better the container images that
> come with the orchestrator.
>
> Why version do your other Ceph nodes run on?
>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Linux: Akademie - Support - Hosting
> http://www.heinlein-support.de
>
> Tel: 030-405051-43
> Fax: 030-405051-19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein  -- Sitz: Berlin
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Incomplete PGs. Ceph Consultant Wanted

2024-06-17 Thread David C.
Hi Pablo,

Could you tell us a little more about how that happened?

Do you have a min_size >= 2 (or E/C equivalent) ?


Cordialement,

*David CASIER*





Le lun. 17 juin 2024 à 16:26, cellosof...@gmail.com 
a écrit :

> Hi community!
>
> Recently we had a major outage in production and after running the
> automated ceph recovery, some PGs remain in "incomplete" state, and IO
> operations are blocked.
>
> Searching in documentation, forums, and this mailing list archive, I
> haven't found yet if this means this data is recoverable or not. We don't
> have any "unknown" objects or PGs, so I believe this is somehow an
> intermediate stage where we have to tell ceph which version of the objects
> to recover from.
>
> We are willing to work with a Ceph Consultant Specialist, because the data
> at stage is very critical, so if you're interested please let me know
> off-list, to discuss the details.
>
> Thanks in advance
>
> Best Regards
> Pablo
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Incomplete PGs. Ceph Consultant Wanted

2024-06-17 Thread David C.
Pablo,

Since some PGs are empty and all OSDs are enabled, I'm not optimistic about
the future at all.

Was the command "ceph osd force-create-pg" executed with missing OSDs ?


Le lun. 17 juin 2024 à 17:26, cellosof...@gmail.com 
a écrit :

> Hi everyone,
>
> Thanks for your kind responses
>
> I know the following is not the best scenario, but sadly I didn't have the
> opportunity of installing this cluster
>
> More information about the problem:
>
> * We use replicated pools
> * Replica 2, min replicas 1.
> * Ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy
> (stable)
> * Virtual Machines setup, 2 MGR Nodes, 2 OSD Nodes, 4 VMs in total.
> * 27 OSDs right now
> * Rook environment: rook: v1.9.5
> * Kubernetes Server Version: v1.24.1
>
> I attach a .txt with the result of some diagnostic commands for reference
>
> What do you think?
>
> Regards
> Pablo
>
> On Mon, Jun 17, 2024 at 11:01 AM Matthias Grandl 
> wrote:
>
>> Ah scratch that, my first paragraph about replicated pools is actually
>> incorrect. If it’s a replicated pool and it shows incomplete, it means the
>> most recent copy of the PG is missing. So ideal would be to recover the PG
>> from dead OSDs in any case if possible.
>>
>> Matthias Grandl
>> Head Storage Engineer
>> matthias.gra...@croit.io
>>
>> > On 17. Jun 2024, at 16:56, Matthias Grandl 
>> wrote:
>> >
>> > Hi Pablo,
>> >
>> > It depends. If it’s a replicated setup, it might be as easy as marking
>> dead OSDs as lost to get the PGs to recover. In that case it basically just
>> means that you are below the pools min_size.
>> >
>> > If it is an EC setup, it might be quite a bit more painful, depending
>> on what happened to the dead OSDs and whether they are at all recoverable.
>> >
>> >
>> > Matthias Grandl
>> > Head Storage Engineer
>> > matthias.gra...@croit.io
>> >
>> >> On 17. Jun 2024, at 16:46, David C.  wrote:
>> >>
>> >> Hi Pablo,
>> >>
>> >> Could you tell us a little more about how that happened?
>> >>
>> >> Do you have a min_size >= 2 (or E/C equivalent) ?
>> >> 
>> >>
>> >> Cordialement,
>> >>
>> >> *David CASIER*
>> >>
>> >> 
>> >>
>> >>
>> >>
>> >> Le lun. 17 juin 2024 à 16:26, cellosof...@gmail.com <
>> cellosof...@gmail.com>
>> >> a écrit :
>> >>
>> >>> Hi community!
>> >>>
>> >>> Recently we had a major outage in production and after running the
>> >>> automated ceph recovery, some PGs remain in "incomplete" state, and IO
>> >>> operations are blocked.
>> >>>
>> >>> Searching in documentation, forums, and this mailing list archive, I
>> >>> haven't found yet if this means this data is recoverable or not. We
>> don't
>> >>> have any "unknown" objects or PGs, so I believe this is somehow an
>> >>> intermediate stage where we have to tell ceph which version of the
>> objects
>> >>> to recover from.
>> >>>
>> >>> We are willing to work with a Ceph Consultant Specialist, because the
>> data
>> >>> at stage is very critical, so if you're interested please let me know
>> >>> off-list, to discuss the details.
>> >>>
>> >>> Thanks in advance
>> >>>
>> >>> Best Regards
>> >>> Pablo
>> >>> ___
>> >>> ceph-users mailing list -- ceph-users@ceph.io
>> >>> To unsubscribe send an email to ceph-users-le...@ceph.io
>> >>>
>> >> ___
>> >> ceph-users mailing list -- ceph-users@ceph.io
>> >> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Incomplete PGs. Ceph Consultant Wanted

2024-06-17 Thread David C.
1 pg / 16 is missing, in the meta pool, it is already enough to have great
difficulty browsing the FS
Your difficulty is to locate important objects in the data pool.

Try, perhaps, to target the important objects by retrieving the
layout/parent attributes on the objects in the cephfs-replicated pool:

Example:
# rados -p cephfs-replicated ls |while read inode_chunk; do ...
rados -p cephfs-replicated getxattr  layout
rados -p cephfs-replicated getxattr  parent



Le lun. 17 juin 2024 à 17:59, cellosof...@gmail.com 
a écrit :

> Command for trying the export was:
>
> [rook@rook-ceph-tools-recovery-77495958d9-plfch ~]$ rados export -p
> cephfs-replicated /mnt/recovery/backup-rados-cephfs-replicated
>
> We made sure we had enough space for this operation, and mounted the
> /mnt/recovery path using hostPath in the modified rook "toolbox" deployment.
>
> Regards
>
> On Mon, Jun 17, 2024 at 11:56 AM cellosof...@gmail.com <
> cellosof...@gmail.com> wrote:
>
>> Hi,
>>
>> I understand,
>>
>> We had to re-create the OSDs because of backing storage hardware failure,
>> so recovering from old OSDs is not possible.
>>
>> From your current understanding, is there a possibility to at least
>> recover some of the information, at least the fragments that are not
>> missing.
>>
>> I ask this because I tried to export the pool contents, but it gets stuck
>> (I/O blocked) because of "incomplete" PGs, maybe marking the PGs as
>> complete with ceph-objectstore-tool would be an option? Or using the
>> *dangerous* osd_find_best_info_ignore_history_les option for the affected
>> OSDs?
>>
>> Regards
>> Pablo
>>
>> On Mon, Jun 17, 2024 at 11:46 AM Matthias Grandl <
>> matthias.gra...@croit.io> wrote:
>>
>>> We are missing info here. Ceph status claims all OSDs are up. Did an OSD
>>> die and was it already removed from the CRUSH map? If so the only chance I
>>> see at preventing data loss is exporting the PGs off of that OSD and
>>> importing to another OSD. But yeah as David I am not too optimistic.
>>>
>>> Matthias Grandl
>>> Head Storage Engineer
>>> matthias.gra...@croit.io
>>>
>>> On 17. Jun 2024, at 17:26, cellosof...@gmail.com wrote:
>>>
>>> 
>>> Hi everyone,
>>>
>>> Thanks for your kind responses
>>>
>>> I know the following is not the best scenario, but sadly I didn't have
>>> the opportunity of installing this cluster
>>>
>>> More information about the problem:
>>>
>>> * We use replicated pools
>>> * Replica 2, min replicas 1.
>>> * Ceph version 17.2.0 (43e2e60a7559d3f46c9d53f1ca875fd499a1e35e) quincy
>>> (stable)
>>> * Virtual Machines setup, 2 MGR Nodes, 2 OSD Nodes, 4 VMs in total.
>>> * 27 OSDs right now
>>> * Rook environment: rook: v1.9.5
>>> * Kubernetes Server Version: v1.24.1
>>>
>>> I attach a .txt with the result of some diagnostic commands for reference
>>>
>>> What do you think?
>>>
>>> Regards
>>> Pablo
>>>
>>> On Mon, Jun 17, 2024 at 11:01 AM Matthias Grandl <
>>> matthias.gra...@croit.io> wrote:
>>>
>>>> Ah scratch that, my first paragraph about replicated pools is actually
>>>> incorrect. If it’s a replicated pool and it shows incomplete, it means the
>>>> most recent copy of the PG is missing. So ideal would be to recover the PG
>>>> from dead OSDs in any case if possible.
>>>>
>>>> Matthias Grandl
>>>> Head Storage Engineer
>>>> matthias.gra...@croit.io
>>>>
>>>> > On 17. Jun 2024, at 16:56, Matthias Grandl 
>>>> wrote:
>>>> >
>>>> > Hi Pablo,
>>>> >
>>>> > It depends. If it’s a replicated setup, it might be as easy as
>>>> marking dead OSDs as lost to get the PGs to recover. In that case it
>>>> basically just means that you are below the pools min_size.
>>>> >
>>>> > If it is an EC setup, it might be quite a bit more painful, depending
>>>> on what happened to the dead OSDs and whether they are at all recoverable.
>>>> >
>>>> >
>>>> > Matthias Grandl
>>>> > Head Storage Engineer
>>>> > matthias.gra...@croit.io
>>>> >
>>>> >> On 17. Jun 2024, at 16:46, David C.  wrote:
>>>> >>
>>>> >> Hi Pablo,
>>>> >>
>>>

[ceph-users] Re: Incomplete PGs. Ceph Consultant Wanted

2024-06-17 Thread David C.
In Pablo's unfortunate incident, it was because of a SAN incident, so it's
possible that Replica 3 didn't save him.
In this scenario, the architecture is more the origin of the incident than
the number of replicas.

It seems to me that replica 3 exists, by default, since firefly => make
replica 2, this is intentional.
So I'm not sure if adding a warning (again) is necessary.

For HDD, apart from special cases (buffer volume, etc.), it is difficult to
justify Replica 2 (especially on platforms several years old).
However, I'd rather see a full flash Replica 2 platform with solid backups
than Replica 3 without backups (well obviously, Replica 3, or E/C + backup
are much better).

Le lun. 17 juin 2024 à 19:14, Wesley Dillingham  a
écrit :

> Perhaps Ceph itself should also have a warning pop up (in "ceph -s", "ceph
> health detail" etc) when replica and min_size=1 or in an EC if min_size <
> k+1. Of course it could be muted but it would give an operator pause
> initially when setting that. I think a lot of people assume replica size=2
> is safe enough. I imagine this must have been proposed before.
>
> Respectfully,
>
> *Wes Dillingham*
> LinkedIn 
> w...@wesdillingham.com
>
>
>
>
> On Mon, Jun 17, 2024 at 1:07 PM Anthony D'Atri 
> wrote:
>
>>
>> >>
>> >> * We use replicated pools
>> >> * Replica 2, min replicas 1.
>>
>> Note to self:   Change the docs and default to discourage this.  This is
>> rarely appropriate in production.
>>
>> You had multiple overlapping drive failures?
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: wrong public_ip after blackout / poweroutage

2024-06-21 Thread David C.
Hi,

This type of incident is often resolved by setting the public_network
option to the "global" scope, in the configuration:

ceph config set global public_network a:b:c:d::/64


Le ven. 21 juin 2024 à 03:36, Eugen Block  a écrit :

> Hi,
>
> this only a theory, not a proven answer or something. But the
> orchestrator does automatically reconfigure daemons depending on the
> circumstances. So my theory is, some of the OSD nodes didn't respond
> via public network anymore, so ceph tried to use the cluster network
> as a fallback. The other way around is more common: if you don't have
> a cluster network configured at all, you see logs stating "falling
> back to public interface" (or similar). If the orchestrator did
> reconfigure the daemons, it would have been logged in the active mgr.
> And the result would be a different ceph.conf for the daemons in
> /var/lib/ceph/{FSID}/osd.{OSD_ID}/config. If you still have the mgr
> logs from after the outage you might find some clues.
>
> Regards,
> Eugen
>
> Zitat von mailing-lists :
>
> > Dear Cephers,
> >
> > after a succession of unfortunate events, we have suffered a
> > complete datacenter blackout today.
> >
> >
> > Ceph _nearly_ perfectly came back up. The Health was OK and all
> > services were online, but we were having weird problems. Weird as
> > in, we could sometimes map rbds and sometimes not, and sometimes we
> > could use the cephfs and sometimes we could not...
> >
> > Turns out, some osds (id say 5%) came back with the cluster_ip
> > address as their public_ip and thus were not reachable.
> >
> > I do not see any pattern, why some osds are faulty and others are
> > not. The fault is spread over nearly all nodes. This is an example:
> >
> > osd.45 up   in  weight 1 up_from 184143 up_thru 184164 down_at
> > 184142 last_clean_interval [182655,184103)
> > [v2:192.168.222.20:6834/1536394698,v1:192.168.222.20:6842/1536394698]
> > [v2:192.168.222.20:6848/1536394698,v1:192.168.222.20:6853/1536394698]
> > exists,up 002326c9
> >
> > This should have a public_ip in the first brackets []. Our
> > cluster-network is 192.168.222.0/24, which is of course only
> > available on the ceph internal switch.
> >
> > Simply restarting the osds that were affected solved this problem...
> > So I am not really asking for your help troubleshooting this; I
> > would just like to understand if there is a reasonable explanation.
> >
> > My guess would be some kind of race-condition when the interfaces
> > came up, but then again, why on ~5% of all osds? ... Anyways im
> > tired, I hope that this mail is somewhat understandable.
> >
> >
> > We are running Ceph 17.2.7 with cephadm docker deployment.
> >
> >
> > If you have any ideas for the cause of this, please let me know. I
> > have not seen this issue when I'm gracefully rebooting the nodes.
> >
> >
> > Best
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unable to mount with 18.2.2

2024-07-16 Thread David C.
Hi Albert,

I think it's related to your network change.

Can you send me the return of "ceph report" ?


Le mar. 16 juil. 2024 à 14:34, Albert Shih  a écrit :

> Hi everyone
>
> My cluster ceph run currently 18.2.2 and ceph -s say everything are OK
>
> root@cthulhu1:/var/lib/ceph/crash# ceph -s
>   cluster:
> id: 9c5bb196-c212-11ee-84f3-c3f2beae892d
> health: HEALTH_OK
>
>   services:
> mon: 5 daemons, quorum cthulhu1,cthulhu5,cthulhu3,cthulhu4,cthulhu2
> (age 4d)
> mgr: cthulhu1.yhgean(active, since 4d), standbys: cthulhu3.ylmosn,
> cthulhu5.hqiarz, cthulhu4.odtqjw, cthulhu2.ynvnob
> mds: 1/1 daemons up, 4 standby
> osd: 370 osds: 370 up (since 4d), 370 in (since 3M)
>
>   data:
> volumes: 1/1 healthy
> pools:   4 pools, 259 pgs
> objects: 333.68M objects, 279 TiB
> usage:   423 TiB used, 5.3 PiB / 5.8 PiB avail
> pgs: 226 active+clean
>  19  active+clean+scrubbing+deep
>  14  active+clean+scrubbing
>
> I got 3 clients cephfs :
>
>   2 with debian 12 + 18.2.2
>   1 with Debian 11 + 17.2.7
>
> The Debian 11 client work fine, I try to umount the cephfs and remount it
> and it's working
>
> The first Debian 12 + 18.2.2 are a upgrade from Debian 11 + 17.2.7 and
> before the update the mount was working, after the upgrade I'm unable to
> mount the cephfs
>
> The second Debian 12 is a fresh install and I'm also unable to mount the
> cephfs.
>
> I check the network and don't see any firewall problem.
>
> On the client when I try mount they take few minutes to answer me
>
>   mount error: no mds server is up or the cluster is laggy
>
> on the client I can see :
>
> Jul 16 14:10:43 Debian12-1 kernel: [  860.636012] ceph: corrupt mdsmap
> Jul 16 14:23:37 Debian12-2 kernel: [11497.406652] ceph: corrupt mdsmap
>
> I try to google this error but all I can find are the situation when ceph
> -s say it's in big trouble. On my case not only ceph -s say everything are
> ok but the Debian 11 are able to umount/mount/umount/mount/write
>
> Any clue or debugging method ?
>
> Regards
>
> --
> Albert SHIH 🦫 🐸
> Observatoire de Paris
> France
> Heure locale/Local time:
> mar. 16 juil. 2024 14:26:47 CEST
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unable to mount with 18.2.2

2024-07-16 Thread David C.
Albert,

The network is ok.

However, strangely, the osd and mds did not activate msgr v2 (msgr v2 was
activated on mon).

It is possible to bypass by adding the "ms_mode=legacy" option but you need
to find out why msgr v2 is not activated


Le mar. 16 juil. 2024 à 15:18, Albert Shih  a écrit :

> Le 16/07/2024 à 15:04:05+0200, David C. a écrit
> Hi,
>
> >
> > I think it's related to your network change.
>
> I though about it but in that case why the old (and before upgrade) server
> works ?
>
> > Can you send me the return of "ceph report" ?
>
> Nothing related to the old subnet. (see attach file)
>
> Regards.
>
> JAS
> --
> Albert SHIH 🦫 🐸
> Observatoire de Paris
> France
> Heure locale/Local time:
> mar. 16 juil. 2024 15:14:21 CEST
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unable to mount with 18.2.2

2024-07-17 Thread David C.
Hi Frédéric,

The curiosity of Albert's cluster is that (msgr) v1 and v2 are present on
the mons, as well as on the osds backend.

But v2 is absent on the public OSD and MDS network

The specific point is that the public network has been changed.

At first, I thought it was the order of declaration of my_host (v1 before
v2) but apparently, that's not it.


Le mer. 17 juil. 2024 à 09:21, Frédéric Nass 
a écrit :

> Hi David,
>
> Redeploying 2 out of 3 MONs a few weeks back (to have them using RocksDB
> to be ready for Quincy) prevented some clients from connecting to the
> cluster and mounting cephfs volumes.
>
> Before the redeploy, these clients were using port 6789 (v1) explicitly as
> connections wouldn't work with port 3300 (v2).
> After the redeploy, removing port 6789 from mon_ips fixed the situation.
>
> Seems like msgr v2 activation did only occur after all 3 MONs were
> redeployed and used RocksDB. Not sure why this happened though.
>
> @Albert, if this cluster has been upgrade several times, you might want to
> check /var/lib/ceph/$(ceph fsid)/kv_backend, redeploy the MONS if leveldb,
> make sure all clients use the new mon_host syntax in ceph.conf
> ([v2::3300,v1::6789],etc.]) and check their
> ability to connect to port 3300.
>
> Cheers,
> Frédéric.
>
> - Le 16 Juil 24, à 17:53, David david.cas...@aevoo.fr a écrit :
>
> > Albert,
> >
> > The network is ok.
> >
> > However, strangely, the osd and mds did not activate msgr v2 (msgr v2 was
> > activated on mon).
> >
> > It is possible to bypass by adding the "ms_mode=legacy" option but you
> need
> > to find out why msgr v2 is not activated
> >
> >
> > Le mar. 16 juil. 2024 à 15:18, Albert Shih  a
> écrit :
> >
> >> Le 16/07/2024 à 15:04:05+0200, David C. a écrit
> >> Hi,
> >>
> >> >
> >> > I think it's related to your network change.
> >>
> >> I though about it but in that case why the old (and before upgrade)
> server
> >> works ?
> >>
> >> > Can you send me the return of "ceph report" ?
> >>
> >> Nothing related to the old subnet. (see attach file)
> >>
> >> Regards.
> >>
> >> JAS
> >> --
> >> Albert SHIH 🦫 🐸
> >> Observatoire de Paris
> >> France
> >> Heure locale/Local time:
> >> mar. 16 juil. 2024 15:14:21 CEST
> >>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unable to mount with 18.2.2

2024-07-17 Thread David C.
Hi,

It would seem that the order of declaration of mons addresses (v2 then v1
and not the other way around) is important.

Albert restarted all services after this modification and everything is
back to normal


Le mer. 17 juil. 2024 à 09:40, David C.  a écrit :

> Hi Frédéric,
>
> The curiosity of Albert's cluster is that (msgr) v1 and v2 are present on
> the mons, as well as on the osds backend.
>
> But v2 is absent on the public OSD and MDS network
>
> The specific point is that the public network has been changed.
>
> At first, I thought it was the order of declaration of my_host (v1 before
> v2) but apparently, that's not it.
>
>
> Le mer. 17 juil. 2024 à 09:21, Frédéric Nass <
> frederic.n...@univ-lorraine.fr> a écrit :
>
>> Hi David,
>>
>> Redeploying 2 out of 3 MONs a few weeks back (to have them using RocksDB
>> to be ready for Quincy) prevented some clients from connecting to the
>> cluster and mounting cephfs volumes.
>>
>> Before the redeploy, these clients were using port 6789 (v1) explicitly
>> as connections wouldn't work with port 3300 (v2).
>> After the redeploy, removing port 6789 from mon_ips fixed the situation.
>>
>> Seems like msgr v2 activation did only occur after all 3 MONs were
>> redeployed and used RocksDB. Not sure why this happened though.
>>
>> @Albert, if this cluster has been upgrade several times, you might want
>> to check /var/lib/ceph/$(ceph fsid)/kv_backend, redeploy the MONS if
>> leveldb, make sure all clients use the new mon_host syntax in ceph.conf
>> ([v2::3300,v1::6789],etc.]) and check their
>> ability to connect to port 3300.
>>
>> Cheers,
>> Frédéric.
>>
>> - Le 16 Juil 24, à 17:53, David david.cas...@aevoo.fr a écrit :
>>
>> > Albert,
>> >
>> > The network is ok.
>> >
>> > However, strangely, the osd and mds did not activate msgr v2 (msgr v2
>> was
>> > activated on mon).
>> >
>> > It is possible to bypass by adding the "ms_mode=legacy" option but you
>> need
>> > to find out why msgr v2 is not activated
>> >
>> >
>> > Le mar. 16 juil. 2024 à 15:18, Albert Shih  a
>> écrit :
>> >
>> >> Le 16/07/2024 à 15:04:05+0200, David C. a écrit
>> >> Hi,
>> >>
>> >> >
>> >> > I think it's related to your network change.
>> >>
>> >> I though about it but in that case why the old (and before upgrade)
>> server
>> >> works ?
>> >>
>> >> > Can you send me the return of "ceph report" ?
>> >>
>> >> Nothing related to the old subnet. (see attach file)
>> >>
>> >> Regards.
>> >>
>> >> JAS
>> >> --
>> >> Albert SHIH 🦫 🐸
>> >> Observatoire de Paris
>> >> France
>> >> Heure locale/Local time:
>> >> mar. 16 juil. 2024 15:14:21 CEST
>> >>
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Small issue with perms

2024-07-18 Thread David C.
Hi Albert,

perhaps a conflict with the udev rules of locally installed packages.

Try uninstalling ceph-*

Le jeu. 18 juil. 2024 à 09:57, Albert Shih  a écrit :

> Hi everyone.
>
> After my upgrade from 17.2.7 to 18.2.2 I notice after each time I restart I
> got a issue with perm on
>
>   /var/lib/ceph/FSID/crash
>
> after the restart the owner/group are on nobody and I got
>
>   Error scraping /var/lib/ceph/crash: [Errno 13] Permission denied:
> '/var/lib/ceph/crash'
>
> I need to do a chown ceph:ceph on the directory. But on the next reboot the
> dir return to nobody.
>
> Anyway to fix (beside some cron + chown) this minor bug ?
>
> Regards
>
> --
> Albert SHIH 🦫 🐸
> Observatoire de Paris
> France
> Heure locale/Local time:
> jeu. 18 juil. 2024 09:53:40 CEST
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Small issue with perms

2024-07-18 Thread David C.
Your ceph processes are in containers.

You don't need the ceph-* packages on the host hosting the containers



Cordialement,

*David CASIER*




*Ligne directe: +33(0) 9 72 61 98 29*




Le jeu. 18 juil. 2024 à 10:34, Albert Shih  a écrit :

> Le 18/07/2024 à 10:27:09+0200, David C. a écrit
> Hi,
>
> >
> > perhaps a conflict with the udev rules of locally installed packages.
> >
> > Try uninstalling ceph-*
>
> Sorry...not sure I understandyou want me to uninstall ceph ?
>
> Regards.
>
> JAS
> --
> Albert SHIH 🦫 🐸
> Observatoire de Paris
> France
> Heure locale/Local time:
> jeu. 18 juil. 2024 10:33:54 CEST
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Small issue with perms

2024-07-18 Thread David C.
you can test on a host and restart it, to validate that everything is fine
(with ceph orch host maintenance enter  [or noout]).

But yes, you should be able to do it without breaking anything.


Le jeu. 18 juil. 2024 à 11:08, Albert Shih  a écrit :

> Le 18/07/2024 à 11:00:56+0200, Albert Shih a écrit
> > Le 18/07/2024 à 10:56:33+0200, David C. a écrit
> >
> Hi,
>
> >
> > > Your ceph processes are in containers.
> >
> > Yes I know but in my install process I just install
> >
> >   ceph-common
> >   ceph-base
> >
> > then cephadm from with the wget.
> >
> > I didn't install manually the other packages like :
> >
> > ii  ceph-fuse18.2.2-1~bpo12+1
> amd64FUSE-based client for the Ceph distributed file system
> > ii  ceph-mds 18.2.2-1~bpo12+1
> amd64metadata server for the ceph distributed file system
> > ii  ceph-mgr 18.2.2-1~bpo12+1
> amd64manager for the ceph distributed storage system
> > ii  ceph-mgr-cephadm 18.2.2-1~bpo12+1
> all  cephadm orchestrator module for ceph-mgr
> > ii  ceph-mgr-dashboard   18.2.2-1~bpo12+1
> all  dashboard module for ceph-mgr
> > ii  ceph-mgr-diskprediction-local18.2.2-1~bpo12+1
> all  diskprediction-local module for ceph-mgr
> > ii  ceph-mgr-k8sevents   18.2.2-1~bpo12+1
> all  kubernetes events module for ceph-mgr
> > ii  ceph-mgr-modules-core18.2.2-1~bpo12+1
> all  ceph manager modules which are always enabled
> > ii  ceph-mon 18.2.2-1~bpo12+1
> amd64monitor server for the ceph storage system
> > ii  ceph-osd 18.2.2-1~bpo12+1
> amd64OSD server for the ceph storage system
> > ii  ceph-volume  18.2.2-1~bpo12+1
> all  tool to facilidate OSD deploymentYou don't need the
> ceph-* packages on the host hosting the containers
> >
> > I've no idea how they end-up installed, you mean I can securely
> uninstall all of them on all my node ?
>
> I also got a package called
>
>   ceph
>
> should I also uninstalled it ?
>
> Regards.
>
> JAS
> --
> Albert SHIH 🦫 🐸
> Observatoire de Paris
> France
> Heure locale/Local time:
> jeu. 18 juil. 2024 11:08:07 CEST
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Small issue with perms

2024-07-18 Thread David C.
Thanks Christian,

I see the fix is on the postinst, so probably the reboot shouldn't put
"nobody" back, right?


Le jeu. 18 juil. 2024 à 11:44, Christian Rohmann <
christian.rohm...@inovex.de> a écrit :

> On 18.07.24 9:56 AM, Albert Shih wrote:
> >Error scraping /var/lib/ceph/crash: [Errno 13] Permission denied:
> '/var/lib/ceph/crash'
> There is / was a bug with the permissions for ceph-crash, see
>
> *
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/VACLBNVXTYNSXJSNXJSRAQNZHCHABDF4/
> * https://tracker.ceph.com/issues/64548
> * Reef backport (NOT merged yet): https://github.com/ceph/ceph/pull/58458
>
>
> Maybe your issue is somewhat related?
>
>
> Regards
>
>
> Christian
>
>
>
>
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Unable to mount with 18.2.2

2024-07-18 Thread David C.
Thank you for your research, Frédéric,

We looked and the conf files were up to date, in the form
[v1:(...),v2:(...)]

I manage to reproduce the "incident":

[aevoo-test - ceph-0]# ceph mon dump -f json|jq '.mons[].public_addrs'
dumped monmap epoch 2
{
  "addrvec": [
{
  "type": "v2",
  "addr": "(IP):3300",
  "nonce": 0
},
{
  "type": "v1",
  "addr": "(IP):6789",
  "nonce": 0
}
  ]
}
[aevoo-test - ceph-0]# ceph report |jq '.osdmap.osds[].public_addrs'
report 1958031869
{
  "addrvec": [
{
  "type": "v2",
  "addr": "(IP):6800",
  "nonce": 116888
},
{
  "type": "v1",
  "addr": "(IP):6801",
  "nonce": 116888
}
  ]
}
[aevoo-test - ceph-0]# ceph mon set-addrs home [v1:(IP):6789/0,v2:(IP):3300/0]
[aevoo-test - ceph-0]# ceph mon dump -f json|jq '.mons[].public_addrs'
dumped monmap epoch 3
{
  "addrvec": [
{
  "type": "v1",
  "addr": "(IP):6789",
  "nonce": 0
},
{
  "type": "v2",
  "addr": "(IP):3300",
  "nonce": 0
}
  ]
}

After osd restart :

[aevoo-test - ceph-0]# ceph report |jq '.osdmap.osds[].public_addrs'
report 2993464839
{
  "addrvec": [
{
  "type": "v1",
  "addr": "(IP):6801",
  "nonce": 117895
}
  ]
}




Le jeu. 18 juil. 2024 à 12:30, Frédéric Nass 
a écrit :

> Hi Albert, David,
>
> I came across this: https://github.com/ceph/ceph/pull/47421
>
> "OSDs have a config file that includes addresses for the mon daemons.
> We already have in place logic to cause a reconfig of OSDs if the mon map
> changes, but when we do we aren't actually regenerating the config
> so it's never updated with the new mon addresses. This change is to
> have us recreate the OSD config when we redeploy or reconfig an OSD
> so it gets the new mon addresses."
>
> You mentioned a network change. Maybe the orch failed to update
> /var/lib/ceph/$(ceph fsid)/*/config of some services, came back later and
> succeeded.
>
> Maybe that explains it.
>
> Cheers,
> Frédéric.
>
> - Le 17 Juil 24, à 16:22, Frédéric Nass frederic.n...@univ-lorraine.fr
> a écrit :
>
> > - Le 17 Juil 24, à 15:53, Albert Shih albert.s...@obspm.fr a écrit :
> >
> >> Le 17/07/2024 à 09:40:59+0200, David C. a écrit
> >> Hi everyone.
> >>
> >>>
> >>> The curiosity of Albert's cluster is that (msgr) v1 and v2 are present
> on the
> >>> mons, as well as on the osds backend.
> >>>
> >>> But v2 is absent on the public OSD and MDS network
> >>>
> >>> The specific point is that the public network has been changed.
> >>>
> >>> At first, I thought it was the order of declaration of my_host (v1
> before v2)
> >>> but apparently, that's not it.
> >>>
> >>>
> >>> Le mer. 17 juil. 2024 à 09:21, Frédéric Nass <
> frederic.n...@univ-lorraine.fr> a
> >>> écrit :
> >>>
> >>> Hi David,
> >>>
> >>> Redeploying 2 out of 3 MONs a few weeks back (to have them using
> RocksDB to
> >>> be ready for Quincy) prevented some clients from connecting to the
> cluster
> >>> and mounting cephfs volumes.
> >>>
> >>> Before the redeploy, these clients were using port 6789 (v1)
> explicitly as
> >>> connections wouldn't work with port 3300 (v2).
> >>> After the redeploy, removing port 6789 from mon_ips fixed the
> situation.
> >>>
> >>> Seems like msgr v2 activation did only occur after all 3 MONs were
> >>> redeployed and used RocksDB. Not sure why this happened though.
> >>>
> >>> @Albert, if this cluster has been upgrade several times, you might
> want to
> >>> check /var/lib/ceph/$(ceph fsid)/kv_backend, redeploy the MONS if
> leveldb,
> >>> make sure all clients use the new mon_host syntax in ceph.conf
> ([v2:
> >>> :3300,v1::6789],etc.]) and check their
> ability to
> >>> connect to port 3300.
> >>
> >> So it's working now, I can mount from all my clients the cephfs.
> >>
> >> Because I'm not sure what really happens and where was the 

[ceph-users] Urgent help needed please - MDS offline

2020-10-22 Thread David C
Hi All

My main CephFS data pool on a Luminous 12.2.10 cluster hit capacity
overnight, metadata is on a separate pool which didn't hit capacity but the
filesystem stopped working which I'd expect. I increased the osd full-ratio
to give me some breathing room to get some data deleted once the filesystem
is back online. When I attempt to restart the MDS service, I see the usual
stuff I'd expect in the log but then:

heartbeat_map is_healthy 'MDSRank' had timed out after 15


Followed by:

mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last
> acked 4.00013s ago); MDS internal heartbeat is not healthy!


Eventually I get:

>
> mds.beacon.hostnamecephssd01 is_laggy 29.372 > 15 since last acked beacon
> mds.0.90884 skipping upkeep work because connection to Monitors appears
> laggy
> mds.hostnamecephssd01 Updating MDS map to version 90885 from mon.0
> mds.beacon.hostnamecephssd01  MDS is no longer laggy


The "MDS is no longer laggy" appears to be where the service fails

Meanwhile a ceph -s is showing:

>
> cluster:
> id: 5c5998fd-dc9b-47ec-825e-beaba66aad11
> health: HEALTH_ERR
> 1 filesystem is degraded
> insufficient standby MDS daemons available
> 67 backfillfull osd(s)
> 11 nearfull osd(s)
> full ratio(s) out of order
> 2 pool(s) backfillfull
> 2 pool(s) nearfull
> 6 scrub errors
> Possible data damage: 5 pgs inconsistent
>   services:
> mon: 3 daemons, quorum hostnameceph01,hostnameceph02,hostnameceph03
> mgr: hostnameceph03(active), standbys: hostnameceph02, hostnameceph01
> mds: cephfs-1/1/1 up  {0=hostnamecephssd01=up:replay}
> osd: 172 osds: 161 up, 161 in
>   data:
> pools:   5 pools, 8384 pgs
> objects: 76.25M objects, 124TiB
> usage:   373TiB used, 125TiB / 498TiB avail
> pgs: 8379 active+clean
>  5active+clean+inconsistent
>   io:
> client:   676KiB/s rd, 0op/s rd, 0op/s w


The 5 pgs inconsistent is not a new issue, that is from past scrubs, just
haven't gotten around to manually clearing them although I suppose they
could be related to my issue

The cluster has no clients connected

I did notice in the ceph.log, some OSDs that are in the same host as the
MDS service briefly went down when trying to restart the MDS but examining
the logs of those particular OSDs isn't showing any glaring issues.

Full MDS log at debug 5 (can go higher if needed):

2020-10-22 11:27:10.987652 7f6f696f5240  0 set uid:gid to 167:167
(ceph:ceph)
2020-10-22 11:27:10.987669 7f6f696f5240  0 ceph version 12.2.10
(177915764b752804194937482a39e95e0ca3de94) luminous (stable), process
ceph-mds, pid 2022582
2020-10-22 11:27:10.990567 7f6f696f5240  0 pidfile_write: ignore empty
--pid-file
2020-10-22 11:27:11.027981 7f6f62616700  1 mds.hostnamecephssd01 Updating
MDS map to version 90882 from mon.0
2020-10-22 11:27:15.097957 7f6f62616700  1 mds.hostnamecephssd01 Updating
MDS map to version 90883 from mon.0
2020-10-22 11:27:15.097989 7f6f62616700  1 mds.hostnamecephssd01 Map has
assigned me to become a standby
2020-10-22 11:27:15.101071 7f6f62616700  1 mds.hostnamecephssd01 Updating
MDS map to version 90884 from mon.0
2020-10-22 11:27:15.105310 7f6f62616700  1 mds.0.90884 handle_mds_map i am
now mds.0.90884
2020-10-22 11:27:15.105316 7f6f62616700  1 mds.0.90884 handle_mds_map state
change up:boot --> up:replay
2020-10-22 11:27:15.105325 7f6f62616700  1 mds.0.90884 replay_start
2020-10-22 11:27:15.105333 7f6f62616700  1 mds.0.90884  recovery set is
2020-10-22 11:27:15.105344 7f6f62616700  1 mds.0.90884  waiting for osdmap
73745 (which blacklists prior instance)
2020-10-22 11:27:15.149092 7f6f5be09700  0 mds.0.cache creating system
inode with ino:0x100
2020-10-22 11:27:15.149693 7f6f5be09700  0 mds.0.cache creating system
inode with ino:0x1
2020-10-22 11:27:41.021708 7f6f63618700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:27:43.029290 7f6f5f610700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:27:43.029297 7f6f5f610700  0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 4.00013s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:27:45.866711 7f6f5fe11700  1 heartbeat_map reset_timeout
'MDSRank' had timed out after 15
2020-10-22 11:28:01.021965 7f6f63618700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:03.029862 7f6f5f610700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:03.029885 7f6f5f610700  0 mds.beacon.hostnamecephssd01
Skipping beacon heartbeat to monitors (last acked 4.00113s ago); MDS
internal heartbeat is not healthy!
2020-10-22 11:28:06.022033 7f6f63618700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:07.029955 7f6f5f610700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 11:28:07.029961 7f6f5f610700  0 mds.beacon.host

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread David C
Dan, many thanks for the response.

I was going down the route of looking at mds_beacon_grace but I now
realise when I start my MDS, it's swallowing up memory rapidly and
looks like the oom-killer is eventually killing the mds. With debug
upped to 10, I can see it's doing EMetaBlob.replays on various dirs in
the filesystem and I can't see any obvious issues.

This server has 128GB ram with 111GB free with the MDS stopped

The mds_cache_memory_limit is currently set to 32GB

Could this be a case of simply reducing the mds cache until I can get
this started again or is there another setting I should be looking at?
Is it safe to reduce the cache memory limit at this point?

The standby is currently down and has been deliberately down for a while now.

Log excerpt from debug 10 just before MDS is killed (path/to/dir
refers to a real path in my FS)

2020-10-22 13:29:49.527372 7fc72d39f700 10
mds.0.cache.ino(0x1000e4c0ff4) mark_dirty_parent
2020-10-22 13:29:49.527374 7fc72d39f700 10 mds.0.journal
EMetaBlob.replay noting opened inode [inode 0x1000e4c0ff4 [2,head]
/path/to/dir/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp auth v904149
dirtyparent s
=0 n(v0 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x561c23d66e00]
2020-10-22 13:29:49.527378 7fc72d39f700 10 mds.0.journal
EMetaBlob.replay inotable tablev 481253 <= table 481328
2020-10-22 13:29:49.527380 7fc72d39f700 10 mds.0.journal
EMetaBlob.replay sessionmap v 240341131 <= table 240378576
2020-10-22 13:29:49.527383 7fc72d39f700 10 mds.0.journal
EMetaBlob.replay request client.16250824:1416595263 trim_to 1416595263
2020-10-22 13:29:49.530097 7fc72d39f700 10 mds.0.log _replay
57437755528637~11764673 / 57441334490146 2020-10-22 09:08:56.198798:
EOpen [metab
lob 0x10009e1ec8e, 1881 dirs], 16748 open files
2020-10-22 13:29:49.530106 7fc72d39f700 10 mds.0.journal EOpen.replay
2020-10-22 13:29:49.530107 7fc72d39f700 10 mds.0.journal
EMetaBlob.replay 1881 dirlumps by unknown.0
2020-10-22 13:29:49.530109 7fc72d39f700 10 mds.0.journal
EMetaBlob.replay dir 0x10009e1ec8e
2020-10-22 13:29:49.530111 7fc72d39f700 10 mds.0.journal
EMetaBlob.replay updated dir [dir 0x10009e1ec8e /path/to/dir/ [2,head]
auth v=904150 cv=0/0 state=1073741824 f(v0 m2020-10-22 08:46:44.932805
89215=89215+0) n(v2 rc2020-10-22 08:46:44.932805 b17592
89215=89215+0) hs=42927+1178,ss=0+0 dirty=2376 | child=1
0x56043c4bd100]
2020-10-22 13:29:50.275864 7fc731ba8700  5
mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 13
2020-10-22 13:29:51.026368 7fc73732e700  5
mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 13
rtt 0.750024
2020-10-22 13:29:51.026377 7fc73732e700  0
mds.beacon.hostnamecephssd01  MDS is no longer laggy
2020-10-22 13:29:54.275993 7fc731ba8700  5
mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 14
2020-10-22 13:29:54.277360 7fc73732e700  5
mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 14
rtt 0.0013
2020-10-22 13:29:58.276117 7fc731ba8700  5
mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 15
2020-10-22 13:29:58.277322 7fc73732e700  5
mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 15
rtt 0.0013
2020-10-22 13:30:02.276313 7fc731ba8700  5
mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 16
2020-10-22 13:30:02.477973 7fc73732e700  5
mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 16
rtt 0.202007

Thanks,
David

On Thu, Oct 22, 2020 at 1:41 PM Dan van der Ster  wrote:
>
> You can disable that beacon by increasing mds_beacon_grace to 300 or
> 600. This will stop the mon from failing that mds over to a standby.
> I don't know if that is set on the mon or mgr, so I usually set it on both.
> (You might as well disable the standby too -- no sense in something
> failing back and forth between two mdses).
>
> Next -- looks like your mds is in active:replay. Is it doing anything?
> Is it using lots of CPU/RAM? If you increase debug_mds do you see some
> progress?
>
> -- dan
>
>
> On Thu, Oct 22, 2020 at 2:01 PM David C  wrote:
> >
> > Hi All
> >
> > My main CephFS data pool on a Luminous 12.2.10 cluster hit capacity
> > overnight, metadata is on a separate pool which didn't hit capacity but the
> > filesystem stopped working which I'd expect. I increased the osd full-ratio
> > to give me some breathing room to get some data deleted once the filesystem
> > is back online. When I attempt to restart the MDS service, I see the usual
> > stuff I'd expect in the log but then:
> >
> > heartbeat_map is_healthy 'MDSRank' had timed out after 15
> >
> >
> > Followed by:
> >
> > mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors (last
> > > acked 4.00013s ago); MDS internal heartbeat is not healthy!
> >
> >
> > Eventually I get:
> >
> > >
> >

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread David C
I've not touched the journal segments, current value of
mds_log_max_segments is 128. Would you recommend I increase (or
decrease) that value? And do you think I should change
mds_log_max_expiring to match that value?

On Thu, Oct 22, 2020 at 3:06 PM Dan van der Ster  wrote:
>
> You could decrease the mds_cache_memory_limit but I don't think this
> will help here during replay.
>
> You can see a related tracker here: https://tracker.ceph.com/issues/47582
> This is possibly caused by replaying a very large journal. Did you
> increase the journal segments?
>
> -- dan
>
>
>
>
>
>
>
> -- dan
>
> On Thu, Oct 22, 2020 at 3:35 PM David C  wrote:
> >
> > Dan, many thanks for the response.
> >
> > I was going down the route of looking at mds_beacon_grace but I now
> > realise when I start my MDS, it's swallowing up memory rapidly and
> > looks like the oom-killer is eventually killing the mds. With debug
> > upped to 10, I can see it's doing EMetaBlob.replays on various dirs in
> > the filesystem and I can't see any obvious issues.
> >
> > This server has 128GB ram with 111GB free with the MDS stopped
> >
> > The mds_cache_memory_limit is currently set to 32GB
> >
> > Could this be a case of simply reducing the mds cache until I can get
> > this started again or is there another setting I should be looking at?
> > Is it safe to reduce the cache memory limit at this point?
> >
> > The standby is currently down and has been deliberately down for a while 
> > now.
> >
> > Log excerpt from debug 10 just before MDS is killed (path/to/dir
> > refers to a real path in my FS)
> >
> > 2020-10-22 13:29:49.527372 7fc72d39f700 10
> > mds.0.cache.ino(0x1000e4c0ff4) mark_dirty_parent
> > 2020-10-22 13:29:49.527374 7fc72d39f700 10 mds.0.journal
> > EMetaBlob.replay noting opened inode [inode 0x1000e4c0ff4 [2,head]
> > /path/to/dir/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp auth v904149
> > dirtyparent s
> > =0 n(v0 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x561c23d66e00]
> > 2020-10-22 13:29:49.527378 7fc72d39f700 10 mds.0.journal
> > EMetaBlob.replay inotable tablev 481253 <= table 481328
> > 2020-10-22 13:29:49.527380 7fc72d39f700 10 mds.0.journal
> > EMetaBlob.replay sessionmap v 240341131 <= table 240378576
> > 2020-10-22 13:29:49.527383 7fc72d39f700 10 mds.0.journal
> > EMetaBlob.replay request client.16250824:1416595263 trim_to 1416595263
> > 2020-10-22 13:29:49.530097 7fc72d39f700 10 mds.0.log _replay
> > 57437755528637~11764673 / 57441334490146 2020-10-22 09:08:56.198798:
> > EOpen [metab
> > lob 0x10009e1ec8e, 1881 dirs], 16748 open files
> > 2020-10-22 13:29:49.530106 7fc72d39f700 10 mds.0.journal EOpen.replay
> > 2020-10-22 13:29:49.530107 7fc72d39f700 10 mds.0.journal
> > EMetaBlob.replay 1881 dirlumps by unknown.0
> > 2020-10-22 13:29:49.530109 7fc72d39f700 10 mds.0.journal
> > EMetaBlob.replay dir 0x10009e1ec8e
> > 2020-10-22 13:29:49.530111 7fc72d39f700 10 mds.0.journal
> > EMetaBlob.replay updated dir [dir 0x10009e1ec8e /path/to/dir/ [2,head]
> > auth v=904150 cv=0/0 state=1073741824 f(v0 m2020-10-22 08:46:44.932805
> > 89215=89215+0) n(v2 rc2020-10-22 08:46:44.932805 b17592
> > 89215=89215+0) hs=42927+1178,ss=0+0 dirty=2376 | child=1
> > 0x56043c4bd100]
> > 2020-10-22 13:29:50.275864 7fc731ba8700  5
> > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 13
> > 2020-10-22 13:29:51.026368 7fc73732e700  5
> > mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 13
> > rtt 0.750024
> > 2020-10-22 13:29:51.026377 7fc73732e700  0
> > mds.beacon.hostnamecephssd01  MDS is no longer laggy
> > 2020-10-22 13:29:54.275993 7fc731ba8700  5
> > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 14
> > 2020-10-22 13:29:54.277360 7fc73732e700  5
> > mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 14
> > rtt 0.0013
> > 2020-10-22 13:29:58.276117 7fc731ba8700  5
> > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 15
> > 2020-10-22 13:29:58.277322 7fc73732e700  5
> > mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 15
> > rtt 0.0013
> > 2020-10-22 13:30:02.276313 7fc731ba8700  5
> > mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 16
> > 2020-10-22 13:30:02.477973 7fc73732e700  5
> > mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 16
> > rtt 0.202007
> >
> > Thanks,
> > David
> >
> > On Thu, Oct 22, 2020 at 1:41 PM Dan van der Ster  
> > wrote:
> > >
> > 

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread David C
0  5
mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 14
rtt 0.0013
2020-10-22 16:44:07.783586 7f424ada7700  5
mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 15
2020-10-22 16:44:07.784097 7f424fd2c700  5
mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 15
rtt 0.0013
2020-10-22 16:44:11.783678 7f424ada7700  5
mds.beacon.hostnamecephssd01 Sending beacon up:replay seq 16
2020-10-22 16:44:11.784223 7f424fd2c700  5
mds.beacon.hostnamecephssd01 received beacon reply up:replay seq 16
rtt 0.0013
2020-10-22 16:44:15.783788 7f424ada7700  1 heartbeat_map is_healthy
'MDSRank' had timed out after 15
2020-10-22 16:44:15.783814 7f424ada7700  0
mds.beacon.hostnamecephssd01 Skipping beacon heartbeat to monitors
(last acked 4.00013s ago); MDS internal heartbeat is not healthy!

On Thu, Oct 22, 2020 at 3:30 PM Dan van der Ster  wrote:
>
> I wouldn't adjust it.
> Do you have the impression that the mds is replaying the exact same ops every
> time the mds is restarting? or is it progressing and trimming the
> journal over time?
>
> The only other advice I have is that 12.2.10 is quite old, and might
> miss some important replay/mem fixes.
> I'm thinking of one particular memory bloat issue we suffered (it
> manifested on a multi-mds cluster, so I am not sure if it is the root
> cause here https://tracker.ceph.com/issues/45090 )
> I don't know enough about the changelog diffs to suggest upgrading
> right now in the middle of this outage.
>
>
> -- dan
>
> On Thu, Oct 22, 2020 at 4:14 PM David C  wrote:
> >
> > I've not touched the journal segments, current value of
> > mds_log_max_segments is 128. Would you recommend I increase (or
> > decrease) that value? And do you think I should change
> > mds_log_max_expiring to match that value?
> >
> > On Thu, Oct 22, 2020 at 3:06 PM Dan van der Ster  
> > wrote:
> > >
> > > You could decrease the mds_cache_memory_limit but I don't think this
> > > will help here during replay.
> > >
> > > You can see a related tracker here: https://tracker.ceph.com/issues/47582
> > > This is possibly caused by replaying a very large journal. Did you
> > > increase the journal segments?
> > >
> > > -- dan
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > -- dan
> > >
> > > On Thu, Oct 22, 2020 at 3:35 PM David C  wrote:
> > > >
> > > > Dan, many thanks for the response.
> > > >
> > > > I was going down the route of looking at mds_beacon_grace but I now
> > > > realise when I start my MDS, it's swallowing up memory rapidly and
> > > > looks like the oom-killer is eventually killing the mds. With debug
> > > > upped to 10, I can see it's doing EMetaBlob.replays on various dirs in
> > > > the filesystem and I can't see any obvious issues.
> > > >
> > > > This server has 128GB ram with 111GB free with the MDS stopped
> > > >
> > > > The mds_cache_memory_limit is currently set to 32GB
> > > >
> > > > Could this be a case of simply reducing the mds cache until I can get
> > > > this started again or is there another setting I should be looking at?
> > > > Is it safe to reduce the cache memory limit at this point?
> > > >
> > > > The standby is currently down and has been deliberately down for a 
> > > > while now.
> > > >
> > > > Log excerpt from debug 10 just before MDS is killed (path/to/dir
> > > > refers to a real path in my FS)
> > > >
> > > > 2020-10-22 13:29:49.527372 7fc72d39f700 10
> > > > mds.0.cache.ino(0x1000e4c0ff4) mark_dirty_parent
> > > > 2020-10-22 13:29:49.527374 7fc72d39f700 10 mds.0.journal
> > > > EMetaBlob.replay noting opened inode [inode 0x1000e4c0ff4 [2,head]
> > > > /path/to/dir/{dc97bb9c-4600-48bb-b232-23f9e45caa6e}.tmp auth v904149
> > > > dirtyparent s
> > > > =0 n(v0 1=1+0) (iversion lock) | dirtyparent=1 dirty=1 0x561c23d66e00]
> > > > 2020-10-22 13:29:49.527378 7fc72d39f700 10 mds.0.journal
> > > > EMetaBlob.replay inotable tablev 481253 <= table 481328
> > > > 2020-10-22 13:29:49.527380 7fc72d39f700 10 mds.0.journal
> > > > EMetaBlob.replay sessionmap v 240341131 <= table 240378576
> > > > 2020-10-22 13:29:49.527383 7fc72d39f700 10 mds.0.journal
> > > > EMetaBlob.replay request client.16250824:1416595263 trim_to 1416595263
> > > > 2020-10-22 13:29:49.530097 7fc72d39f700 10 mds.

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread David C
Thanks, guys

I can't add more RAM right now or have access to a server that does,
I'd fear it wouldn't be enough anyway. I'll give the swap idea a go
and try and track down the thread you mentioned, Frank.

'cephfs-journal-tool journal inspect' tells me the journal is fine. I
was able to back it up cleanly, however the apparent size of the file
reported by du is 53TB, does that sound right to you? The actual size
is 3.7GB.

'cephfs-journal-tool event get list' starts listing events but
eventually gets killed as expected.

'cephfs-journal-tool event get summary'
Events by type:
  OPEN: 314260
  SUBTREEMAP: 1134
  UPDATE: 547973
Errors: 0

Those numbers seem really high to me - for reference this is an approx
128TB (usable space) cluster, 505 objects in metadata pool.

On Thu, Oct 22, 2020 at 5:23 PM Frank Schilder  wrote:
>
> If you can't add RAM, you could try provisioning SWAP on a reasonably fast 
> drive. There is a thread from this year where someone had a similar problem, 
> the MDS running out of memory during replay. He could quickly add sufficient 
> swap and the MDS managed to come up. Took a long time though, but might be 
> faster than getting more RAM and will not loose data.
>
> Your clients will not be able to do much, if anything during recovery though.
>
> Best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________
> From: Dan van der Ster 
> Sent: 22 October 2020 18:11:57
> To: David C
> Cc: ceph-devel; ceph-users
> Subject: [ceph-users] Re: Urgent help needed please - MDS offline
>
> I assume you aren't able to quickly double the RAM on this MDS ? or
> failover to a new MDS with more ram?
>
> Failing that, you shouldn't reset the journal without recovering
> dentries, otherwise the cephfs_data objects won't be consistent with
> the metadata.
> The full procedure to be used is here:
> https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts
>
>  backup the journal, recover dentires, then reset the journal.
> (the steps after might not be needed)
>
> That said -- maybe there is a more elegant procedure than using
> cephfs-journal-tool.  A cephfs dev might have better advice.
>
> -- dan
>
>
> On Thu, Oct 22, 2020 at 6:03 PM David C  wrote:
> >
> > I'm pretty sure it's replaying the same ops every time, the last
> > "EMetaBlob.replay updated dir" before it dies is always referring to
> > the same directory. Although interestingly that particular dir shows
> > up in the log thousands of times - the dir appears to be where a
> > desktop app is doing some analytics collecting - I don't know if
> > that's likely to be a red herring or the reason why the journal
> > appears to be so long. It's a dir I'd be quite happy to lose changes
> > to or remove from the file system altogether.
> >
> > I'm loath to update during an outage although I have seen people
> > update the MDS code independently to get out of a scrape - I suspect
> > you wouldn't recommend that.
> >
> > I feel like this leaves me with having to manipulate the journal in
> > some way, is there a nuclear option where I can choose to disregard
> > the uncommitted events? I assume that would be a journal reset with
> > the cephfs-journal-tool but I'm unclear on the impact of that, I'd
> > expect to lose any metadata changes that were made since my cluster
> > filled up but are there further implications? I also wonder what's the
> > riskier option, resetting the journal or attempting an update.
> >
> > I'm very grateful for your help so far
> >
> > Below is more of the debug 10 log with ops relating to the
> > aforementioned dir (name changed but inode is accurate):
> >
> > 2020-10-22 16:44:00.488850 7f424659e700 10 mds.0.journal
> > EMetaBlob.replay updated dir [dir 0x10009e1ec8d /path/to/desktop/app/
> > [2,head] auth v=911968 cv=0/0 state=1610612736 f(v0 m2020-10-14
> > 16:32:42.596652 1=0+1) n(v6164 rc2020-10-22 08:46:44.932805 b17592
> > 89216=89215+1)/n(v6164 rc2020-10-22 08:46:43.950805 b17592
> > 89214=89213+1) hs=1+0,ss=0+0 dirty=1 | child=1 dirty=1 0x5654f8288300]
> > 2020-10-22 16:44:00.488864 7f424659e700 10 mds.0.journal
> > EMetaBlob.replay for [2,head] had [dentry
> > #0x1/path/to/desktop/app/Upload [2,head] auth (dversion lock) v=911967
> > inode=0x5654f8288a00 state=1610612736 | inodepin=1 dirty=1
> > 0x5654f82794a0]
> > 2020-10-22 16:44:00.488873 7f424659e700 10 mds.0.journal
> > EMetaBlob.replay fo

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-22 Thread David C
On Thu, Oct 22, 2020 at 6:09 PM Dan van der Ster  wrote:
>
>
>
> On Thu, 22 Oct 2020, 19:03 David C,  wrote:
>>
>> Thanks, guys
>>
>> I can't add more RAM right now or have access to a server that does,
>> I'd fear it wouldn't be enough anyway. I'll give the swap idea a go
>> and try and track down the thread you mentioned, Frank.
>>
>> 'cephfs-journal-tool journal inspect' tells me the journal is fine. I
>> was able to back it up cleanly, however the apparent size of the file
>> reported by du is 53TB, does that sound right to you? The actual size
>> is 3.7GB.
>
>
> IIRC it's a sparse file. So yes that sounds normal.
>
>
>>
>> 'cephfs-journal-tool event get list' starts listing events but
>> eventually gets killed as expected.
>
>
>
> Does it go oom too?

Yep the cephfs_journal_tool process gets killed
>
> .. dan
>
>
>
>>
>> 'cephfs-journal-tool event get summary'
>> Events by type:
>>   OPEN: 314260
>>   SUBTREEMAP: 1134
>>   UPDATE: 547973
>> Errors: 0
>>
>> Those numbers seem really high to me - for reference this is an approx
>> 128TB (usable space) cluster, 505 objects in metadata pool.
>>
>> On Thu, Oct 22, 2020 at 5:23 PM Frank Schilder  wrote:
>> >
>> > If you can't add RAM, you could try provisioning SWAP on a reasonably fast 
>> > drive. There is a thread from this year where someone had a similar 
>> > problem, the MDS running out of memory during replay. He could quickly add 
>> > sufficient swap and the MDS managed to come up. Took a long time though, 
>> > but might be faster than getting more RAM and will not loose data.
>> >
>> > Your clients will not be able to do much, if anything during recovery 
>> > though.
>> >
>> > Best regards,
>> > =
>> > Frank Schilder
>> > AIT Risø Campus
>> > Bygning 109, rum S14
>> >
>> > 
>> > From: Dan van der Ster 
>> > Sent: 22 October 2020 18:11:57
>> > To: David C
>> > Cc: ceph-devel; ceph-users
>> > Subject: [ceph-users] Re: Urgent help needed please - MDS offline
>> >
>> > I assume you aren't able to quickly double the RAM on this MDS ? or
>> > failover to a new MDS with more ram?
>> >
>> > Failing that, you shouldn't reset the journal without recovering
>> > dentries, otherwise the cephfs_data objects won't be consistent with
>> > the metadata.
>> > The full procedure to be used is here:
>> > https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-recovery-experts
>> >
>> >  backup the journal, recover dentires, then reset the journal.
>> > (the steps after might not be needed)
>> >
>> > That said -- maybe there is a more elegant procedure than using
>> > cephfs-journal-tool.  A cephfs dev might have better advice.
>> >
>> > -- dan
>> >
>> >
>> > On Thu, Oct 22, 2020 at 6:03 PM David C  wrote:
>> > >
>> > > I'm pretty sure it's replaying the same ops every time, the last
>> > > "EMetaBlob.replay updated dir" before it dies is always referring to
>> > > the same directory. Although interestingly that particular dir shows
>> > > up in the log thousands of times - the dir appears to be where a
>> > > desktop app is doing some analytics collecting - I don't know if
>> > > that's likely to be a red herring or the reason why the journal
>> > > appears to be so long. It's a dir I'd be quite happy to lose changes
>> > > to or remove from the file system altogether.
>> > >
>> > > I'm loath to update during an outage although I have seen people
>> > > update the MDS code independently to get out of a scrape - I suspect
>> > > you wouldn't recommend that.
>> > >
>> > > I feel like this leaves me with having to manipulate the journal in
>> > > some way, is there a nuclear option where I can choose to disregard
>> > > the uncommitted events? I assume that would be a journal reset with
>> > > the cephfs-journal-tool but I'm unclear on the impact of that, I'd
>> > > expect to lose any metadata changes that were made since my cluster
>> > > filled up but are there further implications? I also wonder what's the
>> > &

[ceph-users] Re: Urgent help needed please - MDS offline

2020-10-23 Thread David C
Success!

I remembered I had a server I'd taken out of the cluster to
investigate some issues, that had some good quality 800GB Intel DC
SSDs, dedicated an entire drive to swap, tuned up min_free_kbytes,
added an MDS to that server and let it run. Took 3 - 4 hours but
eventually came back online. It used the 128GB of RAM and about 250GB
of the swap.

Dan, thanks so much for steering me down this path, I would have more
than likely started hacking away at the journal otherwise!

Frank, thanks for pointing me towards that other thread, I used your
min_free_kbytes tip

I now need to consider updating - I wonder if the risk averse CephFS
operator would go for the latest Nautilus or latest Octopus, it used
to be that the newer CephFS code meant the most stable but don't know
if that's still the case.

Thanks, again
David

On Thu, Oct 22, 2020 at 7:06 PM Frank Schilder  wrote:
>
> The post was titled "mds behind on trimming - replay until memory exhausted".
>
> > Load up with swap and try the up:replay route.
> > Set the beacon to 10 until it finishes.
>
> Good point! The MDS will not send beacons for a long time. Same was necessary 
> in the other case.
>
> Good luck!
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: fixing future rctime

2023-10-20 Thread David C.
Someone correct me if I'm saying something stupid but from what I see in
the code, there is a check each time to make sure rctime doesn't go back.
Which seems logical to me because otherwise you would have to go through
all the children to determine the correct ctime.

I don't have the impression that a scrub on the fs will solve the problem.

Apart from creating a script that manipulates the metas (dangerous) or the
restoration, I don't have much of an idea. (except [joke] to wait until
2040)


Cordialement,

*David CASIER*





Le ven. 20 oct. 2023 à 11:04, MARTEL Arnaud  a écrit :

> Hi all,
>
> I have some troubles with my backup script because there are few files, in
> a deep sub-directory, with a creation/modification date in the future (for
> example: 2040-02-06 18:00:00). As my script uses the ceph.dir.rctime
> extended attribute to identify the files and directories to backup, it now
> browses and sync a lot of unchanged sub-directories…
> I tried a lot of things, including remove and recreate the files so they
> have now the current datetime, but rctime is never updated.  Even when I
> remove the last directory (ie: the one where the files are located), rctime
> is not updated for the parent directories.
>
> Has someone a trick to reset rctime to current datetime (or any other
> solution to remove this inconsistent value of rctime) with quincy (17.2.6) ?
>
> Regards,
> Arnaud
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: fixing future rctime

2023-10-20 Thread David C.
(Re) Hi Arnaud,

Work by Mouratidis Theofilos (closed/not merged) :
https://github.com/ceph/ceph/pull/37938

Maybe ask him if he found a trick


Cordialement,

*David CASIER*





Le ven. 20 oct. 2023 à 13:08, David C.  a écrit :

> Someone correct me if I'm saying something stupid but from what I see in
> the code, there is a check each time to make sure rctime doesn't go back.
> Which seems logical to me because otherwise you would have to go through
> all the children to determine the correct ctime.
>
> I don't have the impression that a scrub on the fs will solve the problem.
>
> Apart from creating a script that manipulates the metas (dangerous) or the
> restoration, I don't have much of an idea. (except [joke] to wait until
> 2040)
> 
>
> Cordialement,
>
> *David CASIER*
>
> 
>
>
>
> Le ven. 20 oct. 2023 à 11:04, MARTEL Arnaud  a
> écrit :
>
>> Hi all,
>>
>> I have some troubles with my backup script because there are few files,
>> in a deep sub-directory, with a creation/modification date in the future
>> (for example: 2040-02-06 18:00:00). As my script uses the ceph.dir.rctime
>> extended attribute to identify the files and directories to backup, it now
>> browses and sync a lot of unchanged sub-directories…
>> I tried a lot of things, including remove and recreate the files so they
>> have now the current datetime, but rctime is never updated.  Even when I
>> remove the last directory (ie: the one where the files are located), rctime
>> is not updated for the parent directories.
>>
>> Has someone a trick to reset rctime to current datetime (or any other
>> solution to remove this inconsistent value of rctime) with quincy (17.2.6) ?
>>
>> Regards,
>> Arnaud
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Quincy: failure to enable mgr rgw module if not --force

2023-10-24 Thread David C.
Hi Michel,

(I'm just discovering the existence of this module, so it's possible I'm
making mistakes)

The rgw module is new and only seems to be there to configure multisite.

It is present on the v17.2.6 branch but I don't see it in the container for
this version.

In any case, if you're not using multisite, you shouldn't need this module
to access the RGWs functionality in the dashboard.


Cordialement,

*David CASIER*





Le mar. 24 oct. 2023 à 17:12, Michel Jouvin 
a écrit :

> Hi,
>
> I'm trying to use the rgw mgr module to configure RGWs. Unfortunately it
> is not present in 'ceph mgr module ls' list and any attempt to enable it
> suggests that one mgr doesn't support it and that --force should be
> added. Adding --force effectively enabled it.
>
> It is strange as it is a brand new cluster, created in Quincy, using
> cephadm. Why this need for --force? And it seems that even if the module
> is listed as enabled, the 'ceph rgw' command is not recognized and the
> help is not available for the rgw subcommand? What are we doing wrong?
>
> Cheers,
>
> Michel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Quincy: failure to enable mgr rgw module if not --force

2023-10-24 Thread David C.
Correction, it's not so new but doesn't seem to be maintained :

https://github.com/ceph/ceph/commits/v17.2.6/src/pybind/mgr/rgw



Cordialement,

*David CASIER*





Le mar. 24 oct. 2023 à 18:11, David C.  a écrit :

> Hi Michel,
>
> (I'm just discovering the existence of this module, so it's possible I'm
> making mistakes)
>
> The rgw module is new and only seems to be there to configure multisite.
>
> It is present on the v17.2.6 branch but I don't see it in the container
> for this version.
>
> In any case, if you're not using multisite, you shouldn't need this module
> to access the RGWs functionality in the dashboard.
> 
>
> Cordialement,
>
> *David CASIER*
> 
>
>
>
>
> Le mar. 24 oct. 2023 à 17:12, Michel Jouvin 
> a écrit :
>
>> Hi,
>>
>> I'm trying to use the rgw mgr module to configure RGWs. Unfortunately it
>> is not present in 'ceph mgr module ls' list and any attempt to enable it
>> suggests that one mgr doesn't support it and that --force should be
>> added. Adding --force effectively enabled it.
>>
>> It is strange as it is a brand new cluster, created in Quincy, using
>> cephadm. Why this need for --force? And it seems that even if the module
>> is listed as enabled, the 'ceph rgw' command is not recognized and the
>> help is not available for the rgw subcommand? What are we doing wrong?
>>
>> Cheers,
>>
>> Michel
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: radosgw - octopus - 500 Bad file descriptor on upload

2023-10-25 Thread David C.
Hi Hubert,

It's an error "125" (ECANCELED)  (and there may be many reasons for it).

I see a high latency (144sec), is the object big ?
No network problems ?


Cordialement,

*David CASIER*





Le mer. 25 oct. 2023 à 16:37, BEAUDICHON Hubert (Acoss) <
hubert.beaudic...@acoss.fr> a écrit :

> Hi,
> We encountered the same kind of error for one of our users.
> CEPH Version : 16.2.10
>
> 2023-10-24T17:57:22.438+0200 7fc27ab44700  0 WARNING: set_req_state_err
> err_no=125 resorting to 500
> 2023-10-24T17:57:22.439+0200 7fc584957700  0 req 12200560481916573577
> 143.735748291s ERROR: RESTFUL_IO(s)->complete_header() returned err=Bad
> file descriptor
> 2023-10-24T17:57:22.439+0200 7fbecfaed700  1 == req done
> req=0x7fbdb86ab600 op status=-125 http_status=500 latency=143.735748291s
> ==
> 2023-10-24T17:57:22.439+0200 7fbecfaed700  1 beast: 0x7fbdb86ab600:
> 10.227.131.117 - dev-centralog-save [24/Oct/2023:17:54:58.703 +0200] "PUT
> &partNumber=1 HTTP/1.1" 500 58720313 - "" -
> latency=143.735748291s
>
> I haven't got any clue on the cause...
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Emergency, I lost 4 monitors but all osd disk are safe

2023-11-02 Thread David C.
Hi Mohamed,

I understand there's one operational monitor, isn't there?
If so, you need to reprovision the other monitors on an empty base so that
they synchronize with the only remaining monitor.


Cordialement,

*David CASIER*





Le jeu. 2 nov. 2023 à 13:42, Mohamed LAMDAOUAR 
a écrit :

> Hi robert,
>
> when I ran this command, I got this error (because the database of the osd
> was on the boot disk)
>
> ceph-objectstore-tool \
> > --type bluestore \
> > --data-path /var/lib/ceph/c80891ba-55f3-11ed-9389-919f4368965c/osd.9 \
> > --op update-mon-db \
> > --mon-store-path /home/enyx-admin/backup-osd-9 \
> > --no-mon-config --debug
> 2023-11-02T10:59:33.381+ 7f6724da71c0  1 bdev(0x560257b42400
> /var/lib/ceph/c80891ba-55f3-11ed-9389-919f4368965c/osd.9/block) open path
> /var/lib/ceph/c80891ba-55f3-11ed-9389-919f4368965c/osd.9/block
>
> 2023-11-02T10:59:33.381+ 7f6724da71c0  1 bdev(0x560257b42400
> /var/lib/ceph/c80891ba-55f3-11ed-9389-919f4368965c/osd.9/block) open size
> 1827154432 (0x9187fc0, 9.1 TiB) block_size 4096 (4 KiB) rotational
> device, discard not supported
>
> 2023-11-02T10:59:33.381+ 7f6724da71c0  1 bluestore(/var/lib/ceph/
> c80891ba-55f3-11ed-9389-919f4368965c/osd.9) _set_cache_sizes cache_size
> 1073741824 meta 0.45 kv 0.45 data 0.06
>
> 2023-11-02T10:59:33.381+ 7f6724da71c0  1 bdev(0x560257b42c00
> /var/lib/ceph/c80891ba-55f3-11ed-9389-919f4368965c/osd.9/block) open path
> /var/lib/ceph/c80891ba-55f3-11ed-9389-919f4368965c/osd.9/block
>
> 2023-11-02T10:59:33.381+ 7f6724da71c0  1 bdev(0x560257b42c00
> /var/lib/ceph/c80891ba-55f3-11ed-9389-919f4368965c/osd.9/block) open size
> 1827154432 (0x9187fc0, 9.1 TiB) block_size 4096 (4 KiB) rotational
> device, discard not supported
>
> 2023-11-02T10:59:33.381+ 7f6724da71c0  1 bluefs add_block_device bdev 1
> path /var/lib/ceph/c80891ba-55f3-11ed-9389-919f4368965c/osd.9/block size
> 9.1 TiB
>
> 2023-11-02T10:59:33.381+ 7f6724da71c0  1 bluefs mount
>
> 2023-11-02T10:59:33.381+ 7f6724da71c0  1 bluefs _init_alloc shared, id
> 1, capacity 0x9187fc0, block size 0x1
>
> 2023-11-02T10:59:33.441+ 7f6724da71c0 -1 bluefs _replay 0x0: stop: uuid
> 369c96dd-2df1-8d88-2722-3f8334920e83 != super.uuid
> ba94c6e8-394b-4a78-84d6-9afe1cbc280b,
> block dump:
>   42 9c 61 78 69 ec 36 9c  96 dd 2d f1 8d 88 27 22
>  |B.axi.6...-...'"|
> 0010  3f 83 34 92 0e 83 e6 0f  8f 17 fc 3e ec 86 c5 15
>  |?.4>|
> 0020  39 91 13 2e b0 14 92 86  65 75 5c 8e c1 ee fc 18
>  |9...eu\.|
> 0030  f1 7b b2 37 f7 75 70 e2  5e da 79 cd e6 ad 27 40
>  |.{.7.up.^.y...'@|
> 0040  d6 b8 3b da 81 1f 9b ba  c6 e8 b7 68 bc a1 77 ac
>  |..;h..w.|
> 0050  7b a9 a3 cd 9d da b6 57  aa 40 bd ab d0 89 ec e6  |{..W.@
> ..|
> 0060  71 a2 2b 4d 87 74 2f ff  0a bf 3b da 3d da 93 52
>  |q.+M.t/...;.=..R|
> 0070  1c ea f2 fb 8d e0 a1 e6  ef b5 42 5e 85 87 27 df
>  |..B^..'.|
> 0080  ac f1 ae 08 9d c5 71 6f  0f f7 68 ce 28 3d 3e 6e
>  |..qo..h.(=>n|
> 0090  94 b2 1a dc 3b f0 9e e9  6e 77 dd 95 b6 9e 94 56
>  |;...nw.V|
> 00a0  f2 dd 9a 35 a0 65 78 05  bb a9 5f a1 99 6a 5c a1
>  |...5.ex..._..j\.|
> 00b0  5d e9 6d 02 83 be 9d 60  d1 82 fc 6c 66 40 11 17  |].m`...lf@
> ..|
> 00c0  3a 4d 9d 73 f6 ec fb ed  41 db e2 39 15 e1 5f 28
>  |:M.sA..9.._(|
> 00d0  c4 ce cf eb 93 f2 88 d5  af ae 11 14 d6 97 74 ff
>  |..t.|
> 00e0  4b 7e 73 fe 97 4c 06 2a  3a bc b3 7f 04 94 6c 1d
>  |K~s..L.*:.l.|
> 00f0  60 bf b1 42 fa 76 b0 df  33 ff bf 84 36 b1 b5 b3
>  |`..B.v..3...6...|
> 0100  17 36 d6 b7 7d 4c d4 37  fa 7f 8e 59 1f 72 53 d5
>  |.6..}L.7...Y.rS.|
> 0110  c4 d0 de d8 4e 13 ca c6  0a 60 87 3c e4 21 2b 1b
>  |N`.<.!+.|
> 0120  00 f2 67 cf 0a 02 01 20  ec ec 7f c1 8f e3 df f8  |..g
> |
> 0130  3f db 7f 60 28 14 8a fa  48 cb c6 f6 c7 9a 3f 71
>  |?..`(...H.?q|
> 0140  bf 61 36 30 08 c0 f1 e7  f8 af b5 7f d2 fc ad a1
>  |.a60|
> 0150  72 b2 40 ff 82 ff a3 c7  5f f0 a3 0e 8f b2 fe b6  |r.@
> ._...|
> 0160  ee 2f 5d fe 90 8b fa 28  8f 95 03 fa 5b ee e3 9c
>  |./]([...|
> 0170  36 ea 3f 6a 1e c0 fe bb  c2 80 4a 56 ca 96 26 8f
>  |6.?j..JV..&.|
> 0180  85 03 e0 f8 67 c9 3d a8  fa 97 af c5 c0 00 ce 7f
>  |g.=.|
> 0190  cd 83 ff 36 ff c0 1c f0  7b c1 03 cf b7 b6 56 06
>  |...6{.V.|
> 01a0  8a 30 7b 4d e0 5b 11 31  a0 12 cc d9 5e fb 7f 2d
>  |.0{M.[.1^..-|
> 01b0  fb 47 04 df ea 1b 3d 3e  6c 1f f7 07 96 df 97 cf
>  |.G=>l...|
> 01c0  15 60 76 56 0e b6 06 30  3b c0 6f 11 0a 40 19 98  |.`vV...0;.o..@
> ..|
> 01d0  a1 89 fe e3 b6 f3 28 80  83 e5 c1 1d 9c ac da 40
>  |..(@|
> 01e0  71 5b 2b 07 eb 07 2e 8a  0f 71 7f 88 aa f5 23 0b
>  |q[+..q#.|
> 01f0  0

[ceph-users] Re: Emergency, I lost 4 monitors but all osd disk are safe

2023-11-02 Thread David C.
Hi,

I've just checked with the team and the situation is much more serious than
it seems: the lost disks contained the MONs AND OSDs databases (5 servers
down out of 8, replica 3).

It seems that the team fell victim to a bad batch of Samsung 980 Pros (I'm
not a big fan of this "Pro" range, but that's not the point), which have
never been able to restart since the incident.

Someone please correct me, but as far as I'm concerned, the cluster is lost.


Cordialement,

*David CASIER*




*Ligne directe: +33(0) 9 72 61 98 29*




Le jeu. 2 nov. 2023 à 15:49, Anthony D'Atri  a écrit :

> This admittedly is the case throughout the docs.
>
> > On Nov 2, 2023, at 07:27, Joachim Kraftmayer - ceph ambassador <
> joachim.kraftma...@clyso.com> wrote:
> >
> > Hi,
> >
> > another short note regarding the documentation, the paths are designed
> for a package installation.
> >
> > the paths for container installation look a bit different e.g.:
> /var/lib/ceph//osd.y/
> >
> > Joachim
> >
> > ___
> > ceph ambassador DACH
> > ceph consultant since 2012
> >
> > Clyso GmbH - Premier Ceph Foundation Member
> >
> > https://www.clyso.com/
> >
> > Am 02.11.23 um 12:02 schrieb Robert Sander:
> >> Hi,
> >>
> >> On 11/2/23 11:28, Mohamed LAMDAOUAR wrote:
> >>
> >>>I have 7 machines on CEPH cluster, the service ceph runs on a docker
> >>> container.
> >>>   Each machine has 4 hdd of data (available) and 2 nvme sssd (bricked)
> >>>During a reboot, the ssd bricked on 4 machines, the data are
> available on
> >>> the HDD disk but the nvme is bricked and the system is not available.
> is it
> >>> possible to recover the data of the cluster (the data disk are all
> >>> available)
> >>
> >> You can try to recover the MON db from the OSDs, as they keep a copy of
> it:
> >>
> >>
> https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-mon/#monitor-store-failures
> >>
> >> Regards
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Ceph dashboard reports CephNodeNetworkPacketErrors

2023-11-07 Thread David C.
Hi Dominique,

The consistency of the data should not be at risk with such a problem.
But on the other hand, it's better to solve the network problem.

Perhaps look at the state of bond0 :
cat /proc/net/bonding/bond0
As well as the usual network checks


Cordialement,

*David CASIER*





Le mar. 7 nov. 2023 à 11:20, Dominique Ramaekers <
dominique.ramaek...@cometal.be> a écrit :

> Hi,
>
> I'm using Ceph on a 4-host cluster for a year now. I recently discovered
> the Ceph Dashboard :-)
>
> No I see that the Dashboard reports CephNodeNetworkPacketErrors >0.01% or
> >10 packets/s...
>
> Although all systems work great, I'm worried.
>
> 'ip -s link show eno5' results:
> 2: eno5:  mtu 1500 qdisc mq master
> bond0 state UP mode DEFAULT group default qlen 1000
> link/ether 7a:3b:79:9c:f6:d1 brd ff:ff:ff:ff:ff:ff permaddr
> 5c:ba:2c:08:b3:90
> RX: bytes   packets errors dropped  missed   mcast
>  734153938129 645770129  20160   0   0  342301
> TX: bytes   packets errors dropped carrier collsns
> 1085134190597 923843839  0   0   0   0
> altname enp178s0f0
>
> So in average 0,0003% of RX packet errors!
>
> All the four hosts uses the same 10Gb HP switch. The hosts themselves are
> HP Proliant G10 servers. I would expect 0% packet loss...
>
> Anyway. Should I be worried about data consistency? Or can Ceph handle
> this amount of packet errors?
>
> Greetings,
>
> Dominique.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>

Le mar. 7 nov. 2023 à 11:20, Dominique Ramaekers <
dominique.ramaek...@cometal.be> a écrit :

> Hi,
>
> I'm using Ceph on a 4-host cluster for a year now. I recently discovered
> the Ceph Dashboard :-)
>
> No I see that the Dashboard reports CephNodeNetworkPacketErrors >0.01% or
> >10 packets/s...
>
> Although all systems work great, I'm worried.
>
> 'ip -s link show eno5' results:
> 2: eno5:  mtu 1500 qdisc mq master
> bond0 state UP mode DEFAULT group default qlen 1000
> link/ether 7a:3b:79:9c:f6:d1 brd ff:ff:ff:ff:ff:ff permaddr
> 5c:ba:2c:08:b3:90
> RX: bytes   packets errors dropped  missed   mcast
>  734153938129 645770129  20160   0   0  342301
> TX: bytes   packets errors dropped carrier collsns
> 1085134190597 923843839  0   0   0   0
> altname enp178s0f0
>
> So in average 0,0003% of RX packet errors!
>
> All the four hosts uses the same 10Gb HP switch. The hosts themselves are
> HP Proliant G10 servers. I would expect 0% packet loss...
>
> Anyway. Should I be worried about data consistency? Or can Ceph handle
> this amount of packet errors?
>
> Greetings,
>
> Dominique.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 100.00 Usage for ssd-pool (maybe after: ceph osd crush move .. root=default)

2023-11-08 Thread David C.
Hi,

It seems to me that before removing buckets from the crushmap, it is
necessary to do the migration first.
I think you should restore the initial crushmap by adding the default root
next to it and only then do the migration.
There should be some backfill (probably a lot).


Cordialement,

*David CASIER*




*Ligne directe: +33(0) 9 72 61 98 29*




Le mer. 8 nov. 2023 à 11:27, Denny Fuchs  a écrit :

> Hello,
>
> we upgraded to Quincy and tried to remove an obsolete part:
>
> In the beginning of Ceph, there where no device classes and we created
> rules, to split them into hdd and ssd on one of our datacenters.
>
>
> https://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
>
> So we had different "roots" for SSD and HDD. Two weeks ago .. we tried
> to move the hosts to the root=default and checked .. what happens ..
> nothing .. all was fine and working. But we did not checked the "ceph
> df":
>
> ==
> root@fc-r02-ceph-osd-01:[~]: ceph osd df tree
> ID   CLASS  WEIGHTREWEIGHT  SIZE RAW USE  DATA OMAP META
>   AVAIL%USE   VAR   PGS  STATUS  TYPE NAME
> -140 -  0 B  0 B  0 B  0 B
> 0 B  0 B  0 0-  root sata
> -180 -  0 B  0 B  0 B  0 B
> 0 B  0 B  0 0-  datacenter fc-sata
> -160 -  0 B  0 B  0 B  0 B
> 0 B  0 B  0 0-  rack r02-sata
> -130 -  0 B  0 B  0 B  0 B
> 0 B  0 B  0 0-  root ssds
> -170 -  0 B  0 B  0 B  0 B
> 0 B  0 B  0 0-  datacenter fc-ssds
> -150 -  0 B  0 B  0 B  0 B
> 0 B  0 B  0 0-  rack r02-ssds
>   -1 23.99060 -   23 TiB   13 TiB   12 TiB  6.0 GiB32
> GiB   11 TiB  54.17  1.00-  root default
>   -6  4.00145 -  3.9 TiB  2.1 TiB  2.1 TiB  2.1 MiB   7.2
> GiB  1.7 TiB  54.87  1.01-  host fc-r02-ceph-osd-01
>0ssd   0.45470   1.0  447 GiB  236 GiB  235 GiB  236 KiB   794
> MiB  211 GiB  52.80  0.97  119  up  osd.0
>1ssd   0.45470   1.0  447 GiB  222 GiB  221 GiB  239 KiB   808
> MiB  225 GiB  49.67  0.92  108  up  osd.1
>2ssd   0.45470   1.0  447 GiB  245 GiB  244 GiB  254 KiB   819
> MiB  202 GiB  54.85  1.01  118  up  osd.2
>3ssd   0.45470   1.0  447 GiB  276 GiB  276 GiB  288 KiB   903
> MiB  171 GiB  61.83  1.14  135  up  osd.3
>4ssd   0.45470   1.0  447 GiB  268 GiB  267 GiB  272 KiB   913
> MiB  180 GiB  59.85  1.10  132  up  osd.4
>5ssd   0.45470   1.0  447 GiB  204 GiB  203 GiB  181 KiB   684
> MiB  243 GiB  45.56  0.84  108  up  osd.5
>   41ssd   0.36388   1.0  373 GiB  211 GiB  210 GiB  207 KiB   818
> MiB  161 GiB  56.69  1.05  104  up  osd.41
>   42ssd   0.45470   1.0  447 GiB  220 GiB  219 GiB  214 KiB   791
> MiB  227 GiB  49.26  0.91  107  up  osd.42
>   48ssd   0.45470   1.0  447 GiB  284 GiB  284 GiB  281 KiB   864
> MiB  163 GiB  63.62  1.17  139  up  osd.48
>   -2  3.98335 -  3.9 TiB  2.1 TiB  2.1 TiB  1.0 GiB   5.0
> GiB  1.7 TiB  54.82  1.01-  host fc-r02-ceph-osd-02
>   36   nvme   0.36388   1.0  373 GiB  239 GiB  238 GiB  163 MiB   460
> MiB  134 GiB  64.10  1.18  127  up  osd.36
>6ssd   0.45470   1.0  447 GiB  247 GiB  246 GiB  114 MiB   585
> MiB  200 GiB  55.20  1.02  121  up  osd.6
>7ssd   0.45470   1.0  447 GiB  260 GiB  259 GiB  158 MiB   590
> MiB  187 GiB  58.19  1.07  126  up  osd.7
>8ssd   0.45470   1.0  447 GiB  196 GiB  195 GiB  165 MiB   471
> MiB  251 GiB  43.85  0.81  101  up  osd.8
>9ssd   0.45470   1.0  447 GiB  203 GiB  202 GiB  168 MiB   407
> MiB  244 GiB  45.34  0.84  104  up  osd.9
>   10ssd   0.43660   1.0  447 GiB  284 GiB  283 GiB  287 KiB   777
> MiB  163 GiB  63.49  1.17  142  up  osd.10
>   29ssd   0.45470   1.0  447 GiB  241 GiB  240 GiB  147 MiB   492
> MiB  206 GiB  53.93  1.00  124  up  osd.29
>   43ssd   0.45470   1.0  447 GiB  257 GiB  256 GiB  151 MiB   509
> MiB  190 GiB  57.48  1.06  131  up  osd.43
>   49ssd   0.45470   1.0  447 GiB  239 GiB  238 GiB  242 KiB   820
> MiB  209 GiB  53.35  0.98  123  up  osd.49
>   -5  4.00145 -  3.9 TiB  2.1 TiB  2.1 TiB  1.3 GiB   4.9
> GiB  1.7 TiB  55.41  1.02-  host fc-r02-ceph-osd-03
>   40   nvme   0.36388  

[ceph-users] Re: 100.00 Usage for ssd-pool (maybe after: ceph osd crush move .. root=default)

2023-11-08 Thread David C.
I've probably answered too quickly if the migration is complete and there
are no incidents.

Are the pg active+clean?


Cordialement,

*David CASIER*





Le mer. 8 nov. 2023 à 11:50, David C.  a écrit :

> Hi,
>
> It seems to me that before removing buckets from the crushmap, it is
> necessary to do the migration first.
> I think you should restore the initial crushmap by adding the default root
> next to it and only then do the migration.
> There should be some backfill (probably a lot).
> 
>
> Cordialement,
>
> *David CASIER*
>
> 
>
>
>
> Le mer. 8 nov. 2023 à 11:27, Denny Fuchs  a écrit :
>
>> Hello,
>>
>> we upgraded to Quincy and tried to remove an obsolete part:
>>
>> In the beginning of Ceph, there where no device classes and we created
>> rules, to split them into hdd and ssd on one of our datacenters.
>>
>>
>> https://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/
>>
>> So we had different "roots" for SSD and HDD. Two weeks ago .. we tried
>> to move the hosts to the root=default and checked .. what happens ..
>> nothing .. all was fine and working. But we did not checked the "ceph
>> df":
>>
>> ==
>> root@fc-r02-ceph-osd-01:[~]: ceph osd df tree
>> ID   CLASS  WEIGHTREWEIGHT  SIZE RAW USE  DATA OMAP META
>>   AVAIL%USE   VAR   PGS  STATUS  TYPE NAME
>> -140 -  0 B  0 B  0 B  0 B
>> 0 B  0 B  0 0-  root sata
>> -180 -  0 B  0 B  0 B  0 B
>> 0 B  0 B  0 0-  datacenter fc-sata
>> -160 -  0 B  0 B  0 B  0 B
>> 0 B  0 B  0 0-  rack r02-sata
>> -130 -  0 B  0 B  0 B  0 B
>> 0 B  0 B  0 0-  root ssds
>> -170 -  0 B  0 B  0 B  0 B
>> 0 B  0 B  0 0-  datacenter fc-ssds
>> -150 -  0 B  0 B  0 B  0 B
>> 0 B  0 B  0 0-  rack r02-ssds
>>   -1 23.99060 -   23 TiB   13 TiB   12 TiB  6.0 GiB32
>> GiB   11 TiB  54.17  1.00-  root default
>>   -6  4.00145 -  3.9 TiB  2.1 TiB  2.1 TiB  2.1 MiB   7.2
>> GiB  1.7 TiB  54.87  1.01-  host fc-r02-ceph-osd-01
>>0ssd   0.45470   1.0  447 GiB  236 GiB  235 GiB  236 KiB   794
>> MiB  211 GiB  52.80  0.97  119  up  osd.0
>>1ssd   0.45470   1.0  447 GiB  222 GiB  221 GiB  239 KiB   808
>> MiB  225 GiB  49.67  0.92  108  up  osd.1
>>2ssd   0.45470   1.0  447 GiB  245 GiB  244 GiB  254 KiB   819
>> MiB  202 GiB  54.85  1.01  118  up  osd.2
>>3ssd   0.45470   1.0  447 GiB  276 GiB  276 GiB  288 KiB   903
>> MiB  171 GiB  61.83  1.14  135  up  osd.3
>>4ssd   0.45470   1.0  447 GiB  268 GiB  267 GiB  272 KiB   913
>> MiB  180 GiB  59.85  1.10  132  up  osd.4
>>5ssd   0.45470   1.0  447 GiB  204 GiB  203 GiB  181 KiB   684
>> MiB  243 GiB  45.56  0.84  108  up  osd.5
>>   41ssd   0.36388   1.0  373 GiB  211 GiB  210 GiB  207 KiB   818
>> MiB  161 GiB  56.69  1.05  104  up  osd.41
>>   42ssd   0.45470   1.0  447 GiB  220 GiB  219 GiB  214 KiB   791
>> MiB  227 GiB  49.26  0.91  107  up  osd.42
>>   48ssd   0.45470   1.0  447 GiB  284 GiB  284 GiB  281 KiB   864
>> MiB  163 GiB  63.62  1.17  139  up  osd.48
>>   -2  3.98335 -  3.9 TiB  2.1 TiB  2.1 TiB  1.0 GiB   5.0
>> GiB  1.7 TiB  54.82  1.01-  host fc-r02-ceph-osd-02
>>   36   nvme   0.36388   1.0  373 GiB  239 GiB  238 GiB  163 MiB   460
>> MiB  134 GiB  64.10  1.18  127  up  osd.36
>>6ssd   0.45470   1.0  447 GiB  247 GiB  246 GiB  114 MiB   585
>> MiB  200 GiB  55.20  1.02  121  up  osd.6
>>7ssd   0.45470   1.0  447 GiB  260 GiB  259 GiB  158 MiB   590
>> MiB  187 GiB  58.19  1.07  126  up  osd.7
>>8ssd   0.45470   1.0  447 GiB  196 GiB  195 GiB  165 MiB   471
>> MiB  251 GiB  43.85  0.81  101  up  osd.8
>&

[ceph-users] Re: 100.00 Usage for ssd-pool (maybe after: ceph osd crush move .. root=default)

2023-11-08 Thread David C.
so the next step is to place the pools on the right rule :

ceph osd pool set db-pool  crush_rule fc-r02-ssd


Le mer. 8 nov. 2023 à 12:04, Denny Fuchs  a écrit :

> hi,
>
> I've forget to write the command, I've used:
>
> =
> ceph osd crush move fc-r02-ceph-osd-01 root=default
> ceph osd crush move fc-r02-ceph-osd-01 root=default
> ...
> =
>
> and I've found also this param:
>
> ===
> root@fc-r02-ceph-osd-01:[~]: ceph osd crush tree --show-shadow
> ID   CLASS  WEIGHTTYPE NAME
> -39   nvme   1.81938  root default~nvme
> -30   nvme 0  host fc-r02-ceph-osd-01~nvme
> -31   nvme   0.36388  host fc-r02-ceph-osd-02~nvme
>   36   nvme   0.36388  osd.36
> -32   nvme   0.36388  host fc-r02-ceph-osd-03~nvme
>   40   nvme   0.36388  osd.40
> -33   nvme   0.36388  host fc-r02-ceph-osd-04~nvme
>   37   nvme   0.36388  osd.37
> -34   nvme   0.36388  host fc-r02-ceph-osd-05~nvme
>   38   nvme   0.36388  osd.38
> -35   nvme   0.36388  host fc-r02-ceph-osd-06~nvme
>   39   nvme   0.36388  osd.39
> -38   nvme 0  root ssds~nvme
> -37   nvme 0  datacenter fc-ssds~nvme
> -36   nvme 0  rack r02-ssds~nvme
> -29   nvme 0  root sata~nvme
> -28   nvme 0  datacenter fc-sata~nvme
> -27   nvme 0  rack r02-sata~nvme
> -24ssd 0  root ssds~ssd
> -23ssd 0  datacenter fc-ssds~ssd
> -21ssd 0  rack r02-ssds~ssd
> -22ssd 0  root sata~ssd
> -19ssd 0  datacenter fc-sata~ssd
> -20ssd 0  rack r02-sata~ssd
> -140  root sata
> -180  datacenter fc-sata
> -160  rack r02-sata
> -130  root ssds
> -170  datacenter fc-ssds
> -150  rack r02-ssds
>   -4ssd  22.17122  root default~ssd
>   -7ssd   4.00145  host fc-r02-ceph-osd-01~ssd
>0ssd   0.45470  osd.0
>1ssd   0.45470  osd.1
>2ssd   0.45470  osd.2
>3ssd   0.45470  osd.3
>4ssd   0.45470  osd.4
>5ssd   0.45470  osd.5
>   41ssd   0.36388  osd.41
>   42ssd   0.45470  osd.42
>   48ssd   0.45470  osd.48
>   -3ssd   3.61948  host fc-r02-ceph-osd-02~ssd
>6ssd   0.45470  osd.6
>7ssd   0.45470  osd.7
>8ssd   0.45470  osd.8
>9ssd   0.45470  osd.9
>   10ssd   0.43660  osd.10
>   29ssd   0.45470  osd.29
>   43ssd   0.45470  osd.43
>   49ssd   0.45470  osd.49
>   -8ssd   3.63757  host fc-r02-ceph-osd-03~ssd
>   11ssd   0.45470  osd.11
>   12ssd   0.45470  osd.12
>   13ssd   0.45470  osd.13
>   14ssd   0.45470  osd.14
>   15ssd   0.45470  osd.15
>   16ssd   0.45470  osd.16
>   44ssd   0.45470  osd.44
>   50ssd   0.45470  osd.50
> -10ssd   3.63757  host fc-r02-ceph-osd-04~ssd
>   30ssd   0.45470  osd.30
>   31ssd   0.45470  osd.31
>   32ssd   0.45470  osd.32
>   33ssd   0.45470  osd.33
>   34ssd   0.45470  osd.34
>   35ssd   0.45470  osd.35
>   45ssd   0.45470  osd.45
>   51ssd   0.45470  osd.51
> -12ssd   3.63757  host fc-r02-ceph-osd-05~ssd
>   17ssd   0.45470  osd.17
>   18ssd   0.45470  osd.18
>   19ssd   0.45470  osd.19
>   20ssd   0.45470  osd.20
>   21ssd   0.45470  osd.21
>   22ssd   0.45470  osd.22
>   46ssd   0.45470  osd.46
>   52ssd   0.45470  osd.52
> -26ssd   3.63757  host fc-r02-ceph-osd-06~ssd
>   23ssd   0.45470  osd.23
>   24ssd   0.45470  osd.24
>   25ssd   0.45470  osd.25
>   26ssd   0.45470  osd.26
>   27ssd   0.45470  osd.27
>   28ssd   0.45470  osd.28
>   47ssd   0.45470  osd.47
>   53ssd   0.45470  osd.53
>   -1 23.99060  root default
>   -6  4.00145  host fc-r02-ceph-osd-01
>0ssd   0.45470  osd.0
>1ssd   0.45470  osd.1
>2ssd   0.45470  osd.2
>3ssd   0.45470  osd.3
>4ssd   0.45470  osd.4
>5ssd   0.45470  osd.5
>   41ssd   0.36388  osd.41
>   42ssd   0.45470  osd.42
>   48ssd   0.45470  osd.48
>   -2  3.98335  host fc-r02-ceph-osd-02
>   36   nvme   0.36388  osd.36
>6ssd   0.45470  osd.6
>7ssd   0.45470  osd.7
>8ssd   0.45470  osd.8
>9ssd   0.45470  osd.9
>   10ssd   0.43660  osd.10
>   29ssd   0.45470 

[ceph-users] Re: HDD cache

2023-11-08 Thread David C.
Without (raid/jbod) controller ?

Le mer. 8 nov. 2023 à 18:36, Peter  a écrit :

> Hi All,
>
> I note that HDD cluster commit delay improves after i turn off HDD cache.
> However, i also note that not all HDDs are able to turn off the cache.
> special I found that two HDD with same model number, one can turn off,
> anther doesn't. i guess i have my system config or something different
> setting with two HDDs.
> Below is my command to turn off the HDD cache.
>
> root@lahost008:~# sdparm --set WCE=0 /dev/sdb
>  /dev/sdb: ATA ST1NE000-3AP EN01
> root@lahost008:~# cat /sys/block/sdb/queue/write_cache
>  write through
>
> I also tried sdparm to run "sdparm --set WCE=0 /dev/sdb" I got the same
> result.
>
> Anyone experienced this can advise?
>
> Thanks a lot
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Crush map & rule

2023-11-08 Thread David C.
Hi Albert,

What would be the number of replicas (in total and on each row) and their
distribution on the tree ?


Le mer. 8 nov. 2023 à 18:45, Albert Shih  a écrit :

> Hi everyone,
>
> I'm totally newbie with ceph, so sorry if I'm asking some stupid question.
>
> I'm trying to understand how the crush map & rule work, my goal is to have
> two groups of 3 servers, so I'm using “row” bucket
>
> ID   CLASS  WEIGHTTYPE NAME STATUS  REWEIGHT  PRI-AFF
>  -1 59.38367  root default
> -15 59.38367  zone City
> -17 29.69183  row primary
>  -3  9.89728  host server1
>   0ssd   3.49309  osd.0 up   1.0  1.0
>   1ssd   1.74660  osd.1 up   1.0  1.0
>   2ssd   1.74660  osd.2 up   1.0  1.0
>   3ssd   2.91100  osd.3 up   1.0  1.0
>  -5  9.89728  host server2
>   4ssd   1.74660  osd.4 up   1.0  1.0
>   5ssd   1.74660  osd.5 up   1.0  1.0
>   6ssd   2.91100  osd.6 up   1.0  1.0
>   7ssd   3.49309  osd.7 up   1.0  1.0
>  -7  9.89728  host server3
>   8ssd   3.49309  osd.8 up   1.0  1.0
>   9ssd   1.74660  osd.9 up   1.0  1.0
>  10ssd   2.91100  osd.10up   1.0  1.0
>  11ssd   1.74660  osd.11up   1.0  1.0
> -19 29.69183  row secondary
>  -9  9.89728  host server4
>  12ssd   1.74660  osd.12up   1.0  1.0
>  13ssd   1.74660  osd.13up   1.0  1.0
>  14ssd   3.49309  osd.14up   1.0  1.0
>  15ssd   2.91100  osd.15up   1.0  1.0
> -11  9.89728  host server5
>  16ssd   1.74660  osd.16up   1.0  1.0
>  17ssd   1.74660  osd.17up   1.0  1.0
>  18ssd   3.49309  osd.18up   1.0  1.0
>  19ssd   2.91100  osd.19up   1.0  1.0
> -13  9.89728  host server6
>  20ssd   1.74660  osd.20up   1.0  1.0
>  21ssd   1.74660  osd.21up   1.0  1.0
>  22ssd   2.91100  osd.22up   1.0  1.0
>
> and I want to create a some rules, first I like to have
>
>   a rule «replica» (over host) inside the «row» primary
>   a rule «erasure» (over host)  inside the «row» primary
>
> but also two crush rule between primary/secondary, meaning I like to have a
> replica (with only 1 copy of course) of pool from “row” primary to
> secondary.
>
> How can I achieve that ?
>
> Regards
>
>
>
> --
> Albert SHIH 🦫 🐸
> mer. 08 nov. 2023 18:37:54 CET
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Crush map & rule

2023-11-09 Thread David C.
(I wrote it freehand, test before applying)
If your goal is to have a replication of 3 on a row and to be able to
switch to the secondary row, then you need 2 roles and you change the crush
rule on the pool side :

rule primary_location {
(...)
   step take primary class ssd
   step chooseleaf firstn 0 type host
   step emit
}

rule secondary_loc {
(...)
  step take secondary ...

If the aim is to make a replica 2 on 2 rows (not recommended) :

rule row_repli {
(...)
  step take default class ssd
  step chooseleaf firstn 0 type row
  step emit
}

If the aim is to distribute replications over the 2 rows (for example 2*2
or 2*3 replica) :

type replicated
step take primary  class ssd
step chooseleaf firstn 2 type host
step emit
step take secondary  class ssd
step chooseleaf firstn 2 type host
step emit

as far as erasure code is concerned, I really don't see what's reasonably
possible on this architecture.


Cordialement,

*David CASIER*






Le jeu. 9 nov. 2023 à 08:48, Albert Shih  a écrit :

> Le 08/11/2023 à 19:29:19+0100, David C. a écrit
> Hi David.
>
> >
> > What would be the number of replicas (in total and on each row) and their
> > distribution on the tree ?
>
> Well “inside” a row that would be 3 in replica mode.
>
> Between row...well two ;-)
>
> Beside to understanding how to write a rule a little more complex than the
> example in the official documentation, they are another purpose and it's
> to try to have
> a protocole for changing the hardware.
>
> For example if «row primary» are only with old bare metal server, and I
> have some new server I put inside the ceph and want to copy everything
> from the “row primary” to “row secondary”.
>
> Regards
>
> >
> >
> > Le mer. 8 nov. 2023 à 18:45, Albert Shih  a
> écrit :
> >
> > Hi everyone,
> >
> > I'm totally newbie with ceph, so sorry if I'm asking some stupid
> question.
> >
> > I'm trying to understand how the crush map & rule work, my goal is
> to have
> > two groups of 3 servers, so I'm using “row” bucket
> >
> > ID   CLASS  WEIGHTTYPE NAME STATUS  REWEIGHT
> PRI-AFF
> >  -1 59.38367  root default
> > -15 59.38367  zone City
> > -17 29.69183  row primary
> >  -3  9.89728  host server1
> >   0ssd   3.49309  osd.0 up   1.0
> 1.0
> >   1ssd   1.74660  osd.1 up   1.0
> 1.0
> >   2ssd   1.74660  osd.2 up   1.0
> 1.0
> >   3ssd   2.91100  osd.3 up   1.0
> 1.0
> >  -5  9.89728  host server2
> >   4ssd   1.74660  osd.4 up   1.0
> 1.0
> >   5ssd   1.74660  osd.5 up   1.0
> 1.0
> >   6ssd   2.91100  osd.6 up   1.0
> 1.0
> >   7ssd   3.49309  osd.7 up   1.0
> 1.0
> >  -7  9.89728  host server3
> >   8ssd   3.49309  osd.8 up   1.0
> 1.0
> >   9ssd   1.74660  osd.9 up   1.0
> 1.0
> >  10ssd   2.91100  osd.10up   1.0
> 1.0
> >  11ssd   1.74660  osd.11up   1.0
> 1.0
> > -19 29.69183  row secondary
> >  -9  9.89728  host server4
> >  12ssd   1.74660  osd.12up   1.0
> 1.0
> >  13ssd   1.74660  osd.13up   1.0
> 1.0
> >  14ssd   3.49309  osd.14up   1.0
> 1.0
> >  15ssd   2.91100  osd.15up   1.0
> 1.0
> > -11  9.89728  host server5
> >  16ssd   1.74660  osd.16up   1.0
> 1.0
> >  17ssd   1.74660  osd.17up   1.0
> 1.0
> >  18ssd   3.49309  osd.18up   1.0
> 1.0
> >  19ssd   2.91100  osd.19up   1.0
> 1.0
> > -13  9.89728  host server6
> >  20ssd   1.74660  osd.20up   1.0
> 1.0
> >  21ssd   1.74660  osd.21u

[ceph-users] Re: IO stalls when primary OSD device blocks in 17.2.6

2023-11-10 Thread David C.
Hi Daniel,

it's perfectly normal for a PG to freeze when the primary osd is not stable.

It can sometimes happen that the disk fails but doesn't immediately send
back I/O errors (which crash the osd).

 When the OSD is stopped, there's a 5-minute delay before it goes down in
the crushmap.



Le ven. 10 nov. 2023 à 11:43, Daniel Schreiber <
daniel.schrei...@hrz.tu-chemnitz.de> a écrit :

> Dear cephers,
>
> we are sometimes observing stalling IO on our ceph 17.2.6 cluster when
> the backing device for the primary OSD of a PG fails and seems to block
> read IO to objects from that pg. If I set the OSD with the broken device
> to down, the IO continues. Setting the OSD to down is not sufficient.
>
> The cluster is running on Debian 11, the pool is an erasure coded cephfs
> data pool. The OSD has a HDD data device and an SSD db device. The data
> devices is the one which failed and was blocking IO.
>
> The OSD was reporting slow ops and short time after that smartd notified
> about unreadable sectors.
>
> Has anyone seen such behaviour? Are there some tweaks that I missed?
>
> Kind regards,
>
> Daniel
> --
> Daniel Schreiber
> Facharbeitsgruppe Systemsoftware
> Universitaetsrechenzentrum
>
> Technische Universität Chemnitz
> Straße der Nationen 62 (Raum B303)
> 09111 Chemnitz
> Germany
>
> Tel: +49 371 531 35444
> Fax: +49 371 531 835444
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Problem while upgrade 17.2.6 to 17.2.7

2023-11-14 Thread David C.
Hi Jean Marc,

maybe look at this parameter "rgw_enable_apis", if the values you have
correspond to the default (need rgw restart) :

https://docs.ceph.com/en/quincy/radosgw/config-ref/#confval-rgw_enable_apis
ceph config get client.rgw rgw_enable_apis



Cordialement,

*David CASIER*





Le mar. 14 nov. 2023 à 11:45, Jean-Marc FONTANA 
a écrit :

> Hello everyone,
>
> We operate two clusters that we installed with ceph-deploy in Nautilus
> version on Debian 10. We use them for external S3 storage (owncloud) and
> rbd disk images.We had them upgraded to Octopus and Pacific versions on
> Debian 11 and recently converted them to cephadm and upgraded to Quincy
> (17.2.6).
>
> As we now have the orchestrator, we tried updating to 17.2.7 using the
> command# ceph orch upgrade start --image quay.io/ceph/ceph:v17.2.7
>
> Everything went well, both clusters work perfectly for our use, except
> that the Rados gateway configuration is no longer accessible from the
> dashboard with the following error messageError connecting to Object
> Gateway: RGW REST API failed request with status code 404.
>
> We tried a few solutions found on the internet (reset rgw credentials,
> restart rgw adnd mgr, reenable dashboard, ...), unsuccessfully.
>
> Does somebody have an idea ?
>
> Best regards,
>
> Jean-Marc Fontana
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: per-rbd snapshot limitation

2023-11-15 Thread David C.
rbd create testpool/test3 --size=100M
rbd snap limit set testpool/test3 --limit 3


Le mer. 15 nov. 2023 à 17:58, Wesley Dillingham  a
écrit :

> looking into how to limit snapshots at the ceph level for RBD snapshots.
> Ideally ceph would enforce an arbitrary number of snapshots allowable per
> rbd.
>
> Reading the man page for rbd command I see this option:
> https://docs.ceph.com/en/quincy/man/8/rbd/#cmdoption-rbd-limit
>
> --limit
>
> Specifies the limit for the number of snapshots permitted.
>
> Seems perfect. But on attempting to use it as such I get an error:
>
> admin@rbdtest:~$ rbd create testpool/test3 --size=100M --limit=3
> rbd: unrecognised option '--limit=3'
>
> Where am I going wrong here? Is there another way to enforce a limit of
> snapshots for RBD? Thanks.
>
> Respectfully,
>
> *Wes Dillingham*
> w...@wesdillingham.com
> LinkedIn 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: per-rbd snapshot limitation

2023-11-15 Thread David C.
I don't think this parameter exists (today)

Le mer. 15 nov. 2023 à 19:25, Wesley Dillingham  a
écrit :

> Are you aware of any config item that can be set (perhaps in the ceph.conf
> or config db) so the limit is enforced immediately at creation time without
> needing to set it for each rbd?
>
> Respectfully,
>
> *Wes Dillingham*
> w...@wesdillingham.com
> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
>
>
> On Wed, Nov 15, 2023 at 1:14 PM David C.  wrote:
>
>> rbd create testpool/test3 --size=100M
>> rbd snap limit set testpool/test3 --limit 3
>>
>>
>> Le mer. 15 nov. 2023 à 17:58, Wesley Dillingham 
>> a écrit :
>>
>>> looking into how to limit snapshots at the ceph level for RBD snapshots.
>>> Ideally ceph would enforce an arbitrary number of snapshots allowable per
>>> rbd.
>>>
>>> Reading the man page for rbd command I see this option:
>>> https://docs.ceph.com/en/quincy/man/8/rbd/#cmdoption-rbd-limit
>>>
>>> --limit
>>>
>>> Specifies the limit for the number of snapshots permitted.
>>>
>>> Seems perfect. But on attempting to use it as such I get an error:
>>>
>>> admin@rbdtest:~$ rbd create testpool/test3 --size=100M --limit=3
>>> rbd: unrecognised option '--limit=3'
>>>
>>> Where am I going wrong here? Is there another way to enforce a limit of
>>> snapshots for RBD? Thanks.
>>>
>>> Respectfully,
>>>
>>> *Wes Dillingham*
>>> w...@wesdillingham.com
>>> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to use hardware

2023-11-17 Thread David C.
Hi Albert ,

5 instead of 3 mon will allow you to limit the impact if you break a mon
(for example, with the file system full)

5 instead of 3 MDS, this makes sense if the workload can be distributed
over several trees in your file system. Sometimes it can also make sense to
have several FSs in order to limit the consequences of an infrastructure
with several active MDSs.

Concerning performance, if you see a node that is too busy which impacts
the cluster, you can always think about relocating certain services.



Le ven. 17 nov. 2023 à 11:00, Albert Shih  a écrit :

> Hi everyone,
>
> In the purpose to deploy a medium size of ceph cluster (300 OSD) we have 6
> bare-metal server for the OSD, and 5 bare-metal server for the service
> (MDS, Mon, etc.)
>
> Those 5 bare-metal server have each 48 cores and 256 Gb.
>
> What would be the smartest way to use those 5 server, I see two way :
>
>   first :
>
> Server 1 : MDS,MON, grafana, prometheus, webui
> Server 2:  MON
> Server 3:  MON
> Server 4 : MDS
> Server 5 : MDS
>
>   so 3 MDS, 3 MON. and we can loose 2 servers.
>
>   Second
>
> KVM on each server
>   Server 1 : 3 VM : One for grafana & CIe, and 1 MDS, 2 MON
>   other server : 1 MDS, 1 MON
>
>   in total :  5 MDS, 5 MON and we can loose 4 servers.
>
> So on paper it's seem the second are smarter, but it's also more complex,
> so my question are «is it worth the complexity to have 5 MDS/MON for 300
> OSD».
>
> Important : The main goal of this ceph cluster are not to get the maximum
> I/O speed, I would not say the speed is not a factor, but it's not the main
> point.
>
> Regards.
>
>
> --
> Albert SHIH 🦫 🐸
> Observatoire de Paris
> France
> Heure locale/Local time:
> ven. 17 nov. 2023 10:49:27 CET
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Problem while upgrade 17.2.6 to 17.2.7

2023-11-17 Thread David C.
Hi,

don't you have a traceback below that ?

You probably have a communication problem (ssl ? ) between the dashboard
and the rgw.

Maybe check the settings: ceph dashboard get-rgw-api-*
=>
https://docs.ceph.com/en/quincy/mgr/dashboard/#enabling-the-object-gateway-management-frontend




Le ven. 17 nov. 2023 à 11:22, Jean-Marc FONTANA 
a écrit :

> Hello, everyone,
>
> There's nothing cephadm.log in /var/log/ceph.
>
> To get something else, we tried what David C. proposed (thanks to him !!)
> and found:
>
> nov. 17 10:53:54 svtcephmonv3 ceph-mgr[727]: [balancer ERROR root] execute
> error: r = -1, detail = min_compat_client jewel < luminous, which is
> required for pg-upmap. Try 'ceph osd set-require-min-compat-client
> luminous' before using the new interface
> nov. 17 10:54:54 svtcephmonv3 ceph-mgr[727]: [balancer ERROR root] execute
> error: r = -1, detail = min_compat_client jewel < luminous, which is
> required for pg-upmap. Try 'ceph osd set-require-min-compat-client
> luminous' before using the new interface
> nov. 17 10:55:56 svtcephmonv3 ceph-mgr[727]: [dashboard ERROR exception]
> Internal Server Error
> nov. 17 10:55:56 svtcephmonv3 ceph-mgr[727]: [dashboard ERROR request]
> [:::192.168.114.32:53414] [GET] [500] [0.026s] [testadmin] [513.0B]
> /api/rgw/daemon
> nov. 17 10:55:56 svtcephmonv3 ceph-mgr[727]: [dashboard ERROR request]
> [b'{"status": "500 Internal Server Error", "detail": "The server
> encountered an unexpected condition which prevented it from fulfilling the
> request.", "request_id":
> "961b2a25-5c14-4c67-a82a-431f08684f80"}
> ']
> nov. 17 10:55:56 svtcephmonv3 ceph-mgr[727]: [dashboard ERROR exception]
> Internal Server Error
> nov. 17 10:55:56 svtcephmonv3 ceph-mgr[727]: [dashboard ERROR request]
> [:::192.168.114.32:53409] [GET] [500] [0.012s] [testadmin] [513.0B]
> /api/rgw/daemon
> nov. 17 10:55:56 svtcephmonv3 ceph-mgr[727]: [dashboard ERROR request]
> [b'{"status": "500 Internal Server Error", "detail": "The server
> encountered an unexpected condition which prevented it from fulfilling the
> request.", "request_id": "baf41a81-1e6b-4422-97a7-bd96b832dc5a"}
>
> The error about min_compat_client has been fixed with the suggested
> command ( that is a nice result :) ),
> but the web interface still keeps on going on error.
>
> Thanks for your helping,
>
> JM
> Le 17/11/2023 à 07:33, Nizamudeen A a écrit :
>
> Hi,
>
> I think it should be in /var/log/ceph/ceph-mgr..log, probably you
> can reproduce this error again and hopefully
> you'll be able to see a python traceback or something related to rgw in the
> mgr logs.
>
> Regards
>
> On Thu, Nov 16, 2023 at 7:43 PM Jean-Marc FONTANA  
> 
> wrote:
>
>
> Hello,
>
> These are the last lines of /var/log/ceph/cephadm.log of the active mgr
> machine after an error occured.
> As I don't feel this will be very helpfull, would you please tell us where
> to look ?
>
> Best regards,
>
> JM Fontana
>
> 2023-11-16 14:45:08,200 7f341eae8740 DEBUG
> 
> cephadm ['--timeout', '895', 'gather-facts']
> 2023-11-16 14:46:10,406 7fca81386740 DEBUG
> 
> cephadm ['--timeout', '895', 'gather-facts']
> 2023-11-16 14:47:12,594 7fd48f814740 DEBUG
> 
> cephadm ['--timeout', '895', 'gather-facts']
> 2023-11-16 14:48:14,857 7fd0b24b1740 DEBUG
> 
> cephadm ['--timeout', '895', 'check-host']
> 2023-11-16 14:48:14,990 7fd0b24b1740 INFO podman (/usr/bin/podman) version
> 3.0.1 is present
> 2023-11-16 14:48:14,992 7fd0b24b1740 INFO systemctl is present
> 2023-11-16 14:48:14,993 7fd0b24b1740 INFO lvcreate is present
> 2023-11-16 14:48:15,041 7fd0b24b1740 INFO Unit chrony.service is enabled
> and running
> 2023-11-16 14:48:15,043 7fd0b24b1740 INFO Host looks OK
> 2023-11-16 14:48:15,655 7f36b81fd740 DEBUG
> 
> cephadm ['--image', 
> 'quay.io/ceph/ceph@sha256:56984a149e89ce282e9400ca53371ff7df74b1c7f5e979b6ec651b751931483a',
> '--timeout', '895', 'ls']
> 2023-11-16 14:48:17,662 7f17bfc28740 DEBUG
> 

[ceph-users] Re: cephadm user on cephadm rpm package

2023-11-17 Thread David C.
Hi,

You can use the cephadm account (instead of root) to control machines with
the orchestrator.


Le ven. 17 nov. 2023 à 13:30, Luis Domingues  a
écrit :

> Hi,
>
> I noticed when installing the cephadm rpm package, to bootstrap a cluster
> for example, that a user cephadm was created. But I do not see it used
> anywhere.
>
> What is the purpose of creating a user on the machine we install the local
> binary of cephadm?
>
> Luis Domingues
> Proton AG
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm user on cephadm rpm package

2023-11-17 Thread David C.
If you provision the binary (python script) cephadm yourself and your
users, you should be able to do without the cephadm rpm.



Le ven. 17 nov. 2023 à 14:04, Luis Domingues  a
écrit :

> So I guess I need to install the cephadm rpm packages on all my machines
> then?
>
> I like the idea of not having a root user, and in fact we do it on our
> clusters. But as we need to push ssh keys to the user config, so we manage
> users outside of ceph, during OS provisioning.
> So it look a little bit redundant to have cephadm package to create that
> user, when we need to figure out how to enable cephadm's access to the
> machines.
>
> Anyway, thanks for your reply.
>
> Luis Domingues
> Proton AG
>
>
> On Friday, 17 November 2023 at 13:55, David C. 
> wrote:
>
>
> > Hi,
> >
> > You can use the cephadm account (instead of root) to control machines
> with
> > the orchestrator.
> >
> >
> > Le ven. 17 nov. 2023 à 13:30, Luis Domingues luis.doming...@proton.ch a
> >
> > écrit :
> >
> > > Hi,
> > >
> > > I noticed when installing the cephadm rpm package, to bootstrap a
> cluster
> > > for example, that a user cephadm was created. But I do not see it used
> > > anywhere.
> > >
> > > What is the purpose of creating a user on the machine we install the
> local
> > > binary of cephadm?
> > >
> > > Luis Domingues
> > > Proton AG
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to use hardware

2023-11-18 Thread David C.
Hello Albert,

5 vs 3 MON => you won't notice any difference
5 vs 3 MGR => by default, only 1 will be active


Le sam. 18 nov. 2023 à 09:28, Albert Shih  a écrit :

> Le 17/11/2023 à 11:23:49+0100, David C. a écrit
>
> Hi,
>
> >
> > 5 instead of 3 mon will allow you to limit the impact if you break a mon
> (for
> > example, with the file system full)
> >
> > 5 instead of 3 MDS, this makes sense if the workload can be distributed
> over
> > several trees in your file system. Sometimes it can also make sense to
> have
> > several FSs in order to limit the consequences of an infrastructure with
> > several active MDSs.
>
> So no disadvantage to have 5 instead of 3 ?
>
> > Concerning performance, if you see a node that is too busy which impacts
> the
> > cluster, you can always think about relocating certain services.
>
> Ok, thanks for the answer.
>
> Regards.
>
> --
> Albert SHIH 🦫 🐸
> Observatoire de Paris
> France
> Heure locale/Local time:
> sam. 18 nov. 2023 09:26:56 CET
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Issue with CephFS (mds stuck in clientreplay status) since upgrade to 18.2.0.

2023-11-27 Thread David C.
Hi Guiseppe,

Wouldn't you have clients who heavily load the MDS with concurrent access
on the same trees ?

Perhaps, also, look at the stability of all your clients (even if there are
many) [dmesg -T, ...]

How are your 4 active MDS configured (pinning?) ?

Probably nothing to do but normal for 2 MDS to be on the same host
"monitor-02" ?



Cordialement,

*David CASIER*





Le lun. 27 nov. 2023 à 10:09, Lo Re Giuseppe  a
écrit :

> Hi,
> We have upgraded one ceph cluster from 17.2.7 to 18.2.0. Since then we are
> having CephFS issues.
> For example this morning:
> “””
> [root@naret-monitor01 ~]# ceph -s
>   cluster:
> id: 63334166-d991-11eb-99de-40a6b72108d0
> health: HEALTH_WARN
> 1 filesystem is degraded
> 3 clients failing to advance oldest client/flush tid
> 3 MDSs report slow requests
> 6 pgs not scrubbed in time
> 29 daemons have recently crashed
> …
> “””
>
> The ceph orch, ceph crash and ceph fs status commands were hanging.
>
> After a “ceph mgr fail” those commands started to respond.
> Then I have noticed that there was one mds with most of the slow
> operations,
>
> “””
> [WRN] MDS_SLOW_REQUEST: 3 MDSs report slow requests
> mds.cephfs.naret-monitor01.nuakzo(mds.0): 18 slow requests are blocked
> > 30 secs
> mds.cephfs.naret-monitor01.uvevbf(mds.1): 1683 slow requests are
> blocked > 30 secs
> mds.cephfs.naret-monitor02.exceuo(mds.2): 1 slow requests are blocked
> > 30 secs
> “””
>
> Then I tried to restart it with
>
> “””
> [root@naret-monitor01 ~]# ceph orch daemon restart
> mds.cephfs.naret-monitor01.uvevbf
> Scheduled to restart mds.cephfs.naret-monitor01.uvevbf on host
> 'naret-monitor01'
> “””
>
> After the cephfs entered into this situation:
> “””
> [root@naret-monitor01 ~]# ceph fs status
> cephfs - 198 clients
> ==
> RANK STATE   MDS  ACTIVITY DNS
> INOS   DIRS   CAPS
> 0   active cephfs.naret-monitor01.nuakzo  Reqs:0 /s  17.2k
> 16.2k  1892   14.3k
> 1   active cephfs.naret-monitor02.ztdghf  Reqs:0 /s  28.1k
> 10.3k   752   6881
> 2clientreplay  cephfs.naret-monitor02.exceuo 63.0k
> 6491541 66
> 3   active cephfs.naret-monitor03.lqppte  Reqs:0 /s  16.7k
> 13.4k  8233990
>   POOL  TYPE USED  AVAIL
>cephfs.cephfs.meta metadata  5888M  18.5T
>cephfs.cephfs.data   data 119G   215T
> cephfs.cephfs.data.e_4_2data2289G  3241T
> cephfs.cephfs.data.e_8_3data9997G   470T
>  STANDBY MDS
> cephfs.naret-monitor03.eflouf
> cephfs.naret-monitor01.uvevbf
> MDS version: ceph version 18.2.0
> (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable)
> “””
>
> The file system is totally unresponsive (we can mount it on client nodes
> but any operations like a simple ls hangs).
>
> During the night we had a lot of mds crashes, I can share the content.
>
> Does anybody have an idea on how to tackle this problem?
>
> Best,
>
> Giuseppe
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to identify the index pool real usage?

2023-12-01 Thread David C.
Hi,

It looks like a trim/discard problem.

I would try my luck by activating the discard on a disk, to validate.

I have no feedback on the reliability of the bdev_*_discard parameters.
Maybe dig a little deeper into the subject or if anyone has any feedback...



Cordialement,

*David CASIER*





Le ven. 1 déc. 2023 à 16:15, Szabo, Istvan (Agoda) 
a écrit :

> Hi,
>
> Today we had a big issue with slow ops on the nvme drives which holding
> the index pool.
>
> Why the nvme shows full if on ceph is barely utilized? Which one I should
> belive?
>
> When I check the ceph osd df it shows 10% usage of the osds (1x 2TB nvme
> drive has 4x osds on it):
>
> ID   CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META
>   AVAIL%USE   VAR   PGS  STATUS
> 195   nvme  0.43660   1.0  447 GiB   47 GiB  161 MiB   46 GiB   656
> MiB  400 GiB  10.47  0.21   64  up
> 252   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   45 GiB   845
> MiB  401 GiB  10.35  0.21   64  up
> 253   nvme  0.43660   1.0  447 GiB   46 GiB  229 MiB   45 GiB   662
> MiB  401 GiB  10.26  0.21   66  up
> 254   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   44 GiB   1.3
> GiB  401 GiB  10.26  0.21   65  up
> 255   nvme  0.43660   1.0  447 GiB   47 GiB  161 MiB   46 GiB   1.2
> GiB  400 GiB  10.58  0.21   64  up
> 288   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   44 GiB   1.2
> GiB  401 GiB  10.25  0.21   64  up
> 289   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   45 GiB   641
> MiB  401 GiB  10.33  0.21   64  up
> 290   nvme  0.43660   1.0  447 GiB   45 GiB  229 MiB   44 GiB   668
> MiB  402 GiB  10.14  0.21   65  up
>
> However in nvme list it says full:
> Node SN   Model
> Namespace Usage  Format   FW Rev
>  
>  -
> --  
> /dev/nvme0n1 90D0A00XTXTR KCD6XLUL1T92
>  1   1.92  TB /   1.92  TB512   B +  0 B   GPK6
> /dev/nvme1n1 60P0A003TXTR KCD6XLUL1T92
>  1   1.92  TB /   1.92  TB512   B +  0 B   GPK6
>
> With some other node the test was like:
>
>   *   if none of the disk full, no slow ops.
>   *   If 1x disk full and the other not, has slow ops but not too much
>   *   if none of the disk full, no slow ops.
>
> The full disks are very highly utilized during recovery and they are
> holding back the operations from the other nvmes.
>
> What's the reason that even if the pgs are the same in the cluster +/-1
> regarding space they are not equally utilized.
>
> Thank you
>
>
>
> 
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by copyright
> or other legal rules. If you have received it by mistake please let us know
> by reply email and delete it from your system. It is prohibited to copy
> this message or disclose its content to anyone. Any confidentiality or
> privilege is not waived or lost by any mistaken delivery or unauthorized
> disclosure of the message. All messages sent to and from Agoda may be
> monitored to ensure compliance with company policies, to protect the
> company's interests and to remove potential malware. Electronic messages
> may be intercepted, amended, lost or deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to identify the index pool real usage?

2023-12-04 Thread David C.
Hi,

A flash system needs free space to work efficiently.

Hence my hypothesis that fully allocated disks need to be notified of free
blocks (trim)


Cordialement,

*David CASIER*





Le lun. 4 déc. 2023 à 06:01, Szabo, Istvan (Agoda) 
a écrit :

> With the nodes that has some free space on that namespace, we don't have
> issue, only with this which is weird.
> --
> *From:* Anthony D'Atri 
> *Sent:* Friday, December 1, 2023 10:53 PM
> *To:* David C. 
> *Cc:* Szabo, Istvan (Agoda) ; Ceph Users <
> ceph-users@ceph.io>
> *Subject:* Re: [ceph-users] How to identify the index pool real usage?
>
> Email received from the internet. If in doubt, don't click any link nor
> open any attachment !
> 
>
> >>
> >> Today we had a big issue with slow ops on the nvme drives which holding
> >> the index pool.
> >>
> >> Why the nvme shows full if on ceph is barely utilized? Which one I
> should
> >> belive?
> >>
> >> When I check the ceph osd df it shows 10% usage of the osds (1x 2TB nvme
> >> drive has 4x osds on it):
>
> Why split each device into 4 very small OSDs?  You're losing a lot of
> capacity to overhead.
>
> >>
> >> ID   CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP
> META  AVAIL%USE   VAR   PGS  STATUS
> >> 195   nvme  0.43660   1.0  447 GiB   47 GiB  161 MiB   46 GiB   656
> MiB  400 GiB  10.47  0.21   64  up
> >> 252   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   45 GiB   845
> MiB  401 GiB  10.35  0.21   64  up
> >> 253   nvme  0.43660   1.0  447 GiB   46 GiB  229 MiB   45 GiB   662
> MiB  401 GiB  10.26  0.21   66  up
> >> 254   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   44 GiB   1.3
> GiB  401 GiB  10.26  0.21   65  up
> >> 255   nvme  0.43660   1.0  447 GiB   47 GiB  161 MiB   46 GiB   1.2
> GiB  400 GiB  10.58  0.21   64  up
> >> 288   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   44 GiB   1.2
> GiB  401 GiB  10.25  0.21   64  up
> >> 289   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   45 GiB   641
> MiB  401 GiB  10.33  0.21   64  up
> >> 290   nvme  0.43660   1.0  447 GiB   45 GiB  229 MiB   44 GiB   668
> MiB  402 GiB  10.14  0.21   65  up
> >>
> >> However in nvme list it says full:
> >> Node SN   ModelNamespace
> Usage  Format   FW Rev
> >>  
> --- -
> >> --  
>
> >> /dev/nvme0n1 90D0A00XTXTR KCD6XLUL1T92 1
> 1.92  TB /   1.92  TB512   B +  0 B   GPK6
> >> /dev/nvme1n1 60P0A003TXTR KCD6XLUL1T92 1
> 1.92  TB /   1.92  TB512   B +  0 B   GPK6
>
> That command isn't telling you what you think it is.  It has no awareness
> of actual data, it's looking at NVMe namespaces.
>
> >>
> >> With some other node the test was like:
> >>
> >>  *   if none of the disk full, no slow ops.
> >>  *   If 1x disk full and the other not, has slow ops but not too much
> >>  *   if none of the disk full, no slow ops.
> >>
> >> The full disks are very highly utilized during recovery and they are
> >> holding back the operations from the other nvmes.
> >>
> >> What's the reason that even if the pgs are the same in the cluster +/-1
> >> regarding space they are not equally utilized.
> >>
> >> Thank you
> >>
> >>
> >>
> >> 
> >> This message is confidential and is for the sole use of the intended
> >> recipient(s). It may also be privileged or otherwise protected by
> copyright
> >> or other legal rules. If you have received it by mistake please let us
> know
> >> by reply email and delete it from your system. It is prohibited to copy
> >> this message or disclose its content to anyone. Any confidentiality or
> >> privilege is not waived or lost by any mistaken delivery or unauthorized
> >> disclosure of the message. All messages sent to and from Agoda may be
> >> monitored to ensure compliance with company policies, to protect the
> >> company's interests and to remove potential malware. Electronic messages
> >> may be intercepted, amended, lost or deleted, or contai

[ceph-users] Re: How to identify the index pool real usage?

2023-12-04 Thread David C.
Yes that's right.
Test them on a single OSD, to validate.
Does your platform write a lot and everywhere?
From what I just saw, it seems to me that the discard only applies to
transactions (and not the entire disk).
If you can report back the results, that would be great.



Cordialement,

*David CASIER*




Le lun. 4 déc. 2023 à 10:14, Szabo, Istvan (Agoda) 
a écrit :

> These values shouldn't be true to be able to do triming?
>
> "bdev_async_discard": "false",
> "bdev_enable_discard": "false",
>
>
>
> Istvan Szabo
> Staff Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> -------
>
>
> --
> *From:* David C. 
> *Sent:* Monday, December 4, 2023 3:44 PM
> *To:* Szabo, Istvan (Agoda) 
> *Cc:* Anthony D'Atri ; Ceph Users <
> ceph-users@ceph.io>
> *Subject:* Re: [ceph-users] How to identify the index pool real usage?
>
> Email received from the internet. If in doubt, don't click any link nor
> open any attachment !
> --
> Hi,
>
> A flash system needs free space to work efficiently.
>
> Hence my hypothesis that fully allocated disks need to be notified of free
> blocks (trim)
> 
>
> Cordialement,
>
> *David CASIER*
> 
>
>
>
>
> Le lun. 4 déc. 2023 à 06:01, Szabo, Istvan (Agoda) 
> a écrit :
>
> With the nodes that has some free space on that namespace, we don't have
> issue, only with this which is weird.
> --
> *From:* Anthony D'Atri 
> *Sent:* Friday, December 1, 2023 10:53 PM
> *To:* David C. 
> *Cc:* Szabo, Istvan (Agoda) ; Ceph Users <
> ceph-users@ceph.io>
> *Subject:* Re: [ceph-users] How to identify the index pool real usage?
>
> Email received from the internet. If in doubt, don't click any link nor
> open any attachment !
> 
>
> >>
> >> Today we had a big issue with slow ops on the nvme drives which holding
> >> the index pool.
> >>
> >> Why the nvme shows full if on ceph is barely utilized? Which one I
> should
> >> belive?
> >>
> >> When I check the ceph osd df it shows 10% usage of the osds (1x 2TB nvme
> >> drive has 4x osds on it):
>
> Why split each device into 4 very small OSDs?  You're losing a lot of
> capacity to overhead.
>
> >>
> >> ID   CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP
> META  AVAIL%USE   VAR   PGS  STATUS
> >> 195   nvme  0.43660   1.0  447 GiB   47 GiB  161 MiB   46 GiB   656
> MiB  400 GiB  10.47  0.21   64  up
> >> 252   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   45 GiB   845
> MiB  401 GiB  10.35  0.21   64  up
> >> 253   nvme  0.43660   1.0  447 GiB   46 GiB  229 MiB   45 GiB   662
> MiB  401 GiB  10.26  0.21   66  up
> >> 254   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   44 GiB   1.3
> GiB  401 GiB  10.26  0.21   65  up
> >> 255   nvme  0.43660   1.0  447 GiB   47 GiB  161 MiB   46 GiB   1.2
> GiB  400 GiB  10.58  0.21   64  up
> >> 288   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   44 GiB   1.2
> GiB  401 GiB  10.25  0.21   64  up
> >> 289   nvme  0.43660   1.0  447 GiB   46 GiB  161 MiB   45 GiB   641
> MiB  401 GiB  10.33  0.21   64  up
> >> 290   nvme  0.43660   1.0  447 GiB   45 GiB  229 MiB   44 GiB   668
> MiB  402 GiB  10.14  0.21   65  up
> >>
> >> However in nvme list it says full:
> >> Node SN   ModelNamespace
> Usage  Format   FW Rev
> >>  
> --- -
> >> --  
>
> >> /dev/nvme0n1 90D0A00XTXTR KCD6XLUL1T92 1
> 1.92  TB /   1.92  TB512   B +  0 B   GPK6
> >> /dev/nvme1n1 60P0A003TXTR KCD6XLUL1T92 1
> 1.92  TB /   1.92  TB512   B +  0 B   GPK6
>
> That command isn't telling you what you think it is.  It has no awareness
> of actual data, it's looking at NVMe namespaces.
>
> >>
> >> With some other node the test was like:
> >>
> >>  *   if none of the disk full, no slow ops.
>

[ceph-users] Re: EC Profiles & DR

2023-12-05 Thread David C.
Hi Matthew,

To make a simplistic comparison, it is generally not recommended to raid 5
with large disks (>1 TB) due to the probability (low but not zero) of
losing another disk during the rebuild.
So imagine losing a host full of disks.

Additionally, min_size=1 means you can no longer maintain your cluster
(update, etc.), it's dangerous.

Unless you can afford to lose/rebuild your cluster, you should never have a
min_size <2


Cordialement,

*David CASIER*





Le mar. 5 déc. 2023 à 10:03, duluxoz  a écrit :

> Thanks David, I knew I had something wrong  :-)
>
> Just for my own edification: Why is k=2, m=1 not recommended for
> production? Considered to "fragile", or something else?
>
> Cheers
>
> Dulux-Oz
>
> On 05/12/2023 19:53, David Rivera wrote:
> > First problem here is you are using crush-failure-domain=osd when you
> > should use crush-failure-domain=host. With three hosts, you should use
> > k=2, m=1; this is not recommended in  production environment.
> >
> > On Mon, Dec 4, 2023, 23:26 duluxoz  wrote:
> >
> > Hi All,
> >
> > Looking for some help/explanation around erasure code pools, etc.
> >
> > I set up a 3-node Ceph (Quincy) cluster with each box holding 7 OSDs
> > (HDDs) and each box running Monitor, Manager, and iSCSI Gateway.
> > For the
> > record the cluster runs beautifully, without resource issues, etc.
> >
> > I created an Erasure Code Profile, etc:
> >
> > ~~~
> > ceph osd erasure-code-profile set my_ec_profile plugin=jerasure
> > k=4 m=2
> > crush-failure-domain=osd
> > ceph osd crush rule create-erasure my_ec_rule my_ec_profile
> > ceph osd crush rule create-replicated my_replicated_rule default host
> > ~~~
> >
> > My Crush Map is:
> >
> > ~~~
> > # begin crush map
> > tunable choose_local_tries 0
> > tunable choose_local_fallback_tries 0
> > tunable choose_total_tries 50
> > tunable chooseleaf_descend_once 1
> > tunable chooseleaf_vary_r 1
> > tunable chooseleaf_stable 1
> > tunable straw_calc_version 1
> > tunable allowed_bucket_algs 54
> >
> > # devices
> > device 0 osd.0 class hdd
> > device 1 osd.1 class hdd
> > device 2 osd.2 class hdd
> > device 3 osd.3 class hdd
> > device 4 osd.4 class hdd
> > device 5 osd.5 class hdd
> > device 6 osd.6 class hdd
> > device 7 osd.7 class hdd
> > device 8 osd.8 class hdd
> > device 9 osd.9 class hdd
> > device 10 osd.10 class hdd
> > device 11 osd.11 class hdd
> > device 12 osd.12 class hdd
> > device 13 osd.13 class hdd
> > device 14 osd.14 class hdd
> > device 15 osd.15 class hdd
> > device 16 osd.16 class hdd
> > device 17 osd.17 class hdd
> > device 18 osd.18 class hdd
> > device 19 osd.19 class hdd
> > device 20 osd.20 class hdd
> >
> > # types
> > type 0 osd
> > type 1 host
> > type 2 chassis
> > type 3 rack
> > type 4 row
> > type 5 pdu
> > type 6 pod
> > type 7 room
> > type 8 datacenter
> > type 9 zone
> > type 10 region
> > type 11 root
> >
> > # buckets
> > host ceph_1 {
> >id -3# do not change unnecessarily
> >id -4 class hdd  # do not change unnecessarily
> ># weight 38.09564
> >alg straw2
> >hash 0  # rjenkins1
> >item osd.0 weight 5.34769
> >item osd.1 weight 5.45799
> >item osd.2 weight 5.45799
> >item osd.3 weight 5.45799
> >item osd.4 weight 5.45799
> >item osd.5 weight 5.45799
> >item osd.6 weight 5.45799
> > }
> > host ceph_2 {
> >id -5# do not change unnecessarily
> >id -6 class hdd  # do not change unnecessarily
> ># weight 38.09564
> >alg straw2
> >hash 0  # rjenkins1
> >item osd.7 weight 5.34769
> >item osd.8 weight 5.45799
> >item osd.9 weight 5.45799
> >item osd.10 weight 5.45799
> >item osd.11 weight 5.45799
> >item osd.12 weight 5.45799
> >item osd.13 weight 5.45799
> > }
> > host ceph_3 {
> >id -7# do not change unnecessarily
> >id -8 class hdd  # do not change unnecessarily
> ># weight 38.09564
> >alg straw2
> >hash 0  # rjenkins1
> >item osd.14 weight 5.34769
> >item osd.15 weight 5.45799
> >item osd.16 weight 5.45799
> >item osd.17 weight 5.45799
> >item osd.18 weight 5.45799
> >item osd.19 weight 5.45799
> >item osd.20 weight 5.45799
> > }
> > root default {
> >id -1# do not change unnecessarily
> >id -2 class hdd  # do not change unnecessarily
> ># weight 114.28693
> >alg straw2
> >hash 0  # rjenkins1
> >item ceph_1 weight 38.09564
> >item cep

[ceph-users] Re: EC Profiles & DR

2023-12-05 Thread David C.
Hi,

To return to my comparison with SANs, on a SAN you have spare disks to
repair a failed disk.

On Ceph, you therefore need at least one more host (k+m+1).

If we take into consideration the formalities/delivery times of a new
server, k+m+2 is not luxury (Depending on the growth of your volume).



Cordialement,

*David CASIER*





Le mar. 5 déc. 2023 à 11:17, Patrick Begou <
patrick.be...@univ-grenoble-alpes.fr> a écrit :

> Hi Robert,
>
> Le 05/12/2023 à 10:05, Robert Sander a écrit :
> > On 12/5/23 10:01, duluxoz wrote:
> >> Thanks David, I knew I had something wrong  :-)
> >>
> >> Just for my own edification: Why is k=2, m=1 not recommended for
> >> production? Considered to "fragile", or something else?
> >
> > It is the same as a replicated pool with size=2. Only one host can go
> > down. After that you risk to lose data.
> >
> > Erasure coding is possible with a cluster size of 10 nodes or more.
> > With smaller clusters you have to go with replicated pools.
> >
> Could you explain why 10 nodes are required for EC ?
>
> On my side, I'm working on building my first (small) Ceph cluster using
> E.C. and I was thinking about 5 nodes and k=4 m=2. With a failure domain
> on host and several osd by nodes, in my mind this setup may run degraded
> with 3 nodes using 2 distincts osd by node and the ultimate possibility
> to loose an additional node without loosing data.  Of course with
> sufficient free storage available.
>
> Am I totally wrong in my first ceph approach ?
>
> Patrick
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: EC Profiles & DR

2023-12-05 Thread David C.
Hi Patrick,

If your hardware is new and you are confident in the support of your
hardware and can consider future expansion, you can possibly start with a
k=3 and m=2.
It is true that we generally prefer to divide (k) the data by an exponent
2, but k=3 does the job

Be careful, it is difficult/painful to change profiles later (need data
migration).


Cordialement,

*David CASIER*





Le mar. 5 déc. 2023 à 12:35, Patrick Begou <
patrick.be...@univ-grenoble-alpes.fr> a écrit :

> Ok, so I've misunderstood the meaning of failure domain. If there is no
> way to request using 2 osd/node and node as failure domain, with 5 nodes
> k=3+m=1 is not secure enough and I will have to use k=2+m=2, so like a
> raid1  setup. A little bit better than replication in the point of view of
> global storage capacity.
>
> Patrick
>
> Le 05/12/2023 à 12:19, David C. a écrit :
>
> Hi,
>
> To return to my comparison with SANs, on a SAN you have spare disks to
> repair a failed disk.
>
> On Ceph, you therefore need at least one more host (k+m+1).
>
> If we take into consideration the formalities/delivery times of a new
> server, k+m+2 is not luxury (Depending on the growth of your volume).
>
> 
>
> Cordialement,
>
> *David CASIER*
>
> 
>
>
>
> Le mar. 5 déc. 2023 à 11:17, Patrick Begou <
> patrick.be...@univ-grenoble-alpes.fr> a écrit :
>
>> Hi Robert,
>>
>> Le 05/12/2023 à 10:05, Robert Sander a écrit :
>> > On 12/5/23 10:01, duluxoz wrote:
>> >> Thanks David, I knew I had something wrong  :-)
>> >>
>> >> Just for my own edification: Why is k=2, m=1 not recommended for
>> >> production? Considered to "fragile", or something else?
>> >
>> > It is the same as a replicated pool with size=2. Only one host can go
>> > down. After that you risk to lose data.
>> >
>> > Erasure coding is possible with a cluster size of 10 nodes or more.
>> > With smaller clusters you have to go with replicated pools.
>> >
>> Could you explain why 10 nodes are required for EC ?
>>
>> On my side, I'm working on building my first (small) Ceph cluster using
>> E.C. and I was thinking about 5 nodes and k=4 m=2. With a failure domain
>> on host and several osd by nodes, in my mind this setup may run degraded
>> with 3 nodes using 2 distincts osd by node and the ultimate possibility
>> to loose an additional node without loosing data.  Of course with
>> sufficient free storage available.
>>
>> Am I totally wrong in my first ceph approach ?
>>
>> Patrick
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Osd full

2023-12-11 Thread David C.
Hi Mohamed,

Changing weights is no longer a good practice.

The balancer is supposed to do the job.

The number of pg per osd is really tight on your infrastructure.

Can you display the ceph osd tree command?


Cordialement,

*David CASIER*




*Ligne directe: +33(0) 9 72 61 98 29*




Le lun. 11 déc. 2023 à 11:06, Mohamed LAMDAOUAR 
a écrit :

> Hello the team,
>
> We initially had a cluster of 3 machines with 4 osd on each machine, we
> added 4 machines in the cluster (each machine with 4 osd)
> We launched the balancing but it never finished, still in progress. But the
> big issue: we have an osd full and all the pools on this osd are read only.
>
> *ceph osd df *:
>
> ID  CLASS  WEIGHT   REWEIGHT  SIZE RAW USE  DATA OMAP META
>  AVAIL%USE   VAR   PGS  STATUS
> 20hdd  9.09569   1.0  9.1 TiB  580 GiB  576 GiB  1.2 GiB  3.1 GiB
> 8.5 TiB   6.23  0.32  169  up
> 21hdd  9.09569   1.0  9.1 TiB  1.5 TiB  1.5 TiB  252 MiB  7.7 GiB
> 7.6 TiB  16.08  0.82  247  up
> 22hdd  9.09569   1.0  9.1 TiB  671 GiB  667 GiB  204 MiB  4.1 GiB
> 8.4 TiB   7.21  0.37  136  up
> 23hdd  9.09569   1.0  9.1 TiB  665 GiB  660 GiB  270 MiB  4.5 GiB
> 8.4 TiB   7.14  0.37  124  up
>  0hdd  9.09569   1.0  9.1 TiB  1.2 TiB  1.2 TiB   87 MiB  6.0 GiB
> 7.9 TiB  13.30  0.68  230  up
>  1hdd  9.09569   1.0  9.1 TiB  1.3 TiB  1.3 TiB  347 MiB  6.6 GiB
> 7.8 TiB  14.01  0.72  153  up
>  2hdd  9.09569   0.65009  9.1 TiB  1.8 TiB  1.8 TiB  443 MiB  7.3 GiB
> 7.3 TiB  20.00  1.03  147  up
>  3hdd  9.09569   1.0  9.1 TiB  617 GiB  611 GiB  220 MiB  5.8 GiB
> 8.5 TiB   6.62  0.34  101  up
>  4hdd  9.09569   0.80005  9.1 TiB  2.0 TiB  2.0 TiB  293 MiB  8.2 GiB
> 7.1 TiB  22.12  1.13  137  up
>  5hdd  9.09569   1.0  9.1 TiB  857 GiB  852 GiB  157 MiB  4.9 GiB
> 8.3 TiB   9.20  0.47  155  up
>  6hdd  9.09569   1.0  9.1 TiB  580 GiB  575 GiB  678 MiB  4.4 GiB
> 8.5 TiB   6.23  0.32  114  up
>  7hdd  9.09569   0.5  9.1 TiB  7.7 TiB  7.7 TiB  103 MiB   16 GiB
> 1.4 TiB  85.03  4.36  201  up
> 24hdd  9.09569   1.0  9.1 TiB  1.2 TiB  1.2 TiB  133 MiB  6.2 GiB
> 7.9 TiB  13.11  0.67  225  up
> 25hdd  9.09569   0.34999  9.1 TiB  8.3 TiB  8.2 TiB  101 MiB   17 GiB
> 860 GiB  90.77  4.66  159  up
> 26hdd  9.09569   1.0  9.1 TiB  665 GiB  661 GiB  292 MiB  3.8 GiB
> 8.4 TiB   7.14  0.37  107  up
> 27hdd  9.09569   1.0  9.1 TiB  427 GiB  423 GiB  241 MiB  3.4 GiB
> 8.7 TiB   4.58  0.24  103  up
>  8hdd  9.09569   1.0  9.1 TiB  845 GiB  839 GiB  831 MiB  5.9 GiB
> 8.3 TiB   9.07  0.47  163  up
>  9hdd  9.09569   1.0  9.1 TiB  727 GiB  722 GiB  162 MiB  4.8 GiB
> 8.4 TiB   7.80  0.40  169  up
> 10hdd  9.09569   0.80005  9.1 TiB  1.9 TiB  1.9 TiB  742 MiB  7.5 GiB
> 7.2 TiB  21.01  1.08  136  up
> 11hdd  9.09569   1.0  9.1 TiB  733 GiB  727 GiB  498 MiB  5.2 GiB
> 8.4 TiB   7.87  0.40  163  up
> 12hdd  9.09569   1.0  9.1 TiB  892 GiB  886 GiB  318 MiB  5.6 GiB
> 8.2 TiB   9.58  0.49  254  up
> 13hdd  9.09569   1.0  9.1 TiB  759 GiB  755 GiB   37 MiB  4.0 GiB
> 8.4 TiB   8.15  0.42  134  up
> 14hdd  9.09569   0.85004  9.1 TiB  2.3 TiB  2.3 TiB  245 MiB  7.7 GiB
> 6.8 TiB  24.96  1.28  142  up
> 15hdd  9.09569   1.0  9.1 TiB  7.3 TiB  7.3 TiB  435 MiB   16 GiB
> 1.8 TiB  80.17  4.11  213  up
> 16hdd  9.09569   1.0  9.1 TiB  784 GiB  781 GiB  104 MiB  3.6 GiB
> 8.3 TiB   8.42  0.43  247  up
> 17hdd  9.09569   1.0  9.1 TiB  861 GiB  856 GiB  269 MiB  5.1 GiB
> 8.3 TiB   9.25  0.47  102  up
> 18hdd  9.09569   1.0  9.1 TiB  1.9 TiB  1.9 TiB  962 MiB  8.2 GiB
> 7.2 TiB  21.15  1.09  283  up
> 19hdd  9.09569   1.0  9.1 TiB  893 GiB  888 GiB  291 MiB  4.6 GiB
> 8.2 TiB   9.59  0.49  148  up
>TOTAL  255 TiB   50 TiB   49 TiB  9.7 GiB  187 GiB
> 205 TiB  19.49
> MIN/MAX VAR: 0.24/4.66  STDDEV: 19.63
>
>
>
>
> *ceph health detail |grep -i wrn*
> [WRN] OSDMAP_FLAGS: nodeep-scrub flag(s) set
> [WRN] OSD_NEARFULL: 2 nearfull osd(s)
> [WRN] PG_BACKFILL_FULL: Low space hindering backfill (add storage if this
> doesn't resolve itself): 16 pgs backfill_toofull
> [WRN] PG_NOT_DEEP_SCRUBBED: 1360 pgs not deep-scrubbed in time
> [WRN] PG_NOT_SCRUBBED: 53 pgs not scrubbed in time
> [WRN] POOL_NEARFULL: 36 pool(s) nearfull
>
>
> Thanks  the team ;)
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: FS down - mds degraded

2023-12-20 Thread David C.
Hi Sake,

I would start by decrementing max_mds by 1:
ceph fs set atlassian-prod max_mds 2

The mds.1 no longer restarts?
logs?





Le jeu. 21 déc. 2023 à 08:11, Sake Ceph  a écrit :

> Starting a new thread, forgot subject in the previous.
> So our FS down. Got the following error, what can I do?
>
> # ceph health detail
> HEALTH_ERR 1 filesystem is degraded; 1 mds daemon damaged
> [WRN] FS_DEGRADED: 1 filesystem is degraded
> fs atlassian/prod is degraded
> [ERR] MDS_DAMAGE: 1 mds daemon damaged
> fs atlassian-prod mds.1 is damaged
>
> # ceph fs get atlassian-prod
> Filesystem 'atlassian-prod' (2)
> fs_name atlassian-prod
> epoch   43440
> flags   32 joinable allow_snaps allow_multimds_snaps allow_standby_replay
> created 2023-05-10T08:45:46.911064+
> modified2023-12-21T06:47:19.291154+
> tableserver 0
> root0
> session_timeout 60
> session_autoclose   300
> max_file_size   1099511627776
> required_client_features{}
> last_failure0
> last_failure_osd_epoch  29480
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
> data,8=no anchor table,9=file layout v2,10=snaprealm v2}
> max_mds 3
> in  0,1,2
> up  {0=1073573,2=1073583}
> failed
> damaged 1
> stopped
> data_pools  [5]
> metadata_pool   4
> inline_data disabled
> balancer
> standby_count_wanted1
> [mds.atlassian-prod.pwsoel13142.egsdfl{0:1073573} state up:resolve seq 573
> join_fscid=2 addr [v2:
> 10.233.127.22:6800/61692284,v1:10.233.127.22:6801/61692284] compat
> {c=[1],r=[1],i=[7ff]}]
> [mds.atlassian-prod.pwsoel13143.qlvypn{2:1073583} state up:resolve seq 571
> join_fscid=2 addr [v2:
> 10.233.127.18:6800/3627858294,v1:10.233.127.18:6801/3627858294] compat
> {c=[1],r=[1],i=[7ff]}]
>
> Best regards,
> Sake
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm discovery service certificate absent after upgrade.

2024-01-23 Thread David C.
Hello Nicolas,

I don't know if it's an update issue.

If this is not a problem for you, you can consider redeploying
grafana/prometheus.

It is also possible to inject your own certificates :

https://docs.ceph.com/en/latest/cephadm/services/monitoring/#example

https://github.com/ceph/ceph/blob/main/src/pybind/mgr/cephadm/templates/services/prometheus/prometheus.yml.j2



Cordialement,

*David CASIER*




Le mar. 23 janv. 2024 à 10:56, Nicolas FOURNIL 
a écrit :

>  Hello,
>
> I've just fresh upgrade from Quincy to Reef and my graphs are now blank...
> after investigations, it seems that discovery service is not working
> because of no certificate :
>
> # ceph orch sd dump cert
> Error EINVAL: No certificate found for service discovery
>
> Maybe an upgrade issue ?
>
> Is there a way to generate or replace the certificate properly ?
>
> Regards
>
> Nicolas F.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm discovery service certificate absent after upgrade.

2024-01-23 Thread David C.
Is the cephadm http server service starting correctly (in the mgr logs)?

IPv6 ?


Cordialement,

*David CASIER*





Le mar. 23 janv. 2024 à 16:29, Nicolas FOURNIL 
a écrit :

> Hello,
>
> Thanks for advice but Prometheus cert is ok, (Self signed) and tested with
> curl and web navigator.
>
>  it seems to be the "Service discovery" certificate from cephadm who is
> missing but I cannot figure out how to set it.
>
> There's in the code a function to create this certificate inside the Key
> store but how ... that's the point :-(
>
> Regards.
>
>
>
> Le mar. 23 janv. 2024 à 15:52, David C.  a écrit :
>
>> Hello Nicolas,
>>
>> I don't know if it's an update issue.
>>
>> If this is not a problem for you, you can consider redeploying
>> grafana/prometheus.
>>
>> It is also possible to inject your own certificates :
>>
>> https://docs.ceph.com/en/latest/cephadm/services/monitoring/#example
>>
>>
>> https://github.com/ceph/ceph/blob/main/src/pybind/mgr/cephadm/templates/services/prometheus/prometheus.yml.j2
>>
>> 
>>
>> Cordialement,
>>
>> *David CASIER*
>> 
>>
>>
>>
>> Le mar. 23 janv. 2024 à 10:56, Nicolas FOURNIL 
>> a écrit :
>>
>>>  Hello,
>>>
>>> I've just fresh upgrade from Quincy to Reef and my graphs are now
>>> blank...
>>> after investigations, it seems that discovery service is not working
>>> because of no certificate :
>>>
>>> # ceph orch sd dump cert
>>> Error EINVAL: No certificate found for service discovery
>>>
>>> Maybe an upgrade issue ?
>>>
>>> Is there a way to generate or replace the certificate properly ?
>>>
>>> Regards
>>>
>>> Nicolas F.
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephadm discovery service certificate absent after upgrade.

2024-01-23 Thread David C.
According to sources, the certificates are generated automatically at
startup. Hence my question if the service started correctly.

I also had problems with IPv6 only, but I don't immediately have more info


Cordialement,

*David CASIER*



Le mar. 23 janv. 2024 à 17:46, Nicolas FOURNIL 
a écrit :

> IPv6 only : Yes, the -ms_bind_ipv6=true is already set-
>
> I had tried a rotation of the keys for node-exporter and I get this :
>
> 2024-01-23T16:43:56.098796+ mgr.srv06-r2b-fl1.foxykh (mgr.342408)
> 87074 : cephadm [INF] Rotating authentication key for
> node-exporter.srv06-r2b-fl1
> 2024-01-23T16:43:56.099224+ mgr.srv06-r2b-fl1.foxykh (mgr.342408)
> 87075 : cephadm [ERR] unknown daemon type node-exporter
> Traceback (most recent call last):
>   File "/usr/share/ceph/mgr/cephadm/serve.py", line 1039, in _check_daemons
> self.mgr._daemon_action(daemon_spec, action=action)
>   File "/usr/share/ceph/mgr/cephadm/module.py", line 2203, in
> _daemon_action
> return self._rotate_daemon_key(daemon_spec)
>   File "/usr/share/ceph/mgr/cephadm/module.py", line 2147, in
> _rotate_daemon_key
> 'entity': daemon_spec.entity_name(),
>   File "/usr/share/ceph/mgr/cephadm/services/cephadmservice.py", line 108,
> in entity_name
> return get_auth_entity(self.daemon_type, self.daemon_id,
> host=self.host)
>   File "/usr/share/ceph/mgr/cephadm/services/cephadmservice.py", line 47,
> in get_auth_entity
> raise OrchestratorError(f"unknown daemon type {daemon_type}")
> orchestrator._interface.OrchestratorError: unknown daemon type
> node-exporter
>
> Tried to remove & recreate service : it's the same ... how to stop the
> rotation now :-/
>
>
>
> Le mar. 23 janv. 2024 à 17:18, David C.  a écrit :
>
>> Is the cephadm http server service starting correctly (in the mgr logs)?
>>
>> IPv6 ?
>> 
>>
>> Cordialement,
>>
>> *David CASIER*
>> 
>>
>>
>>
>>
>> Le mar. 23 janv. 2024 à 16:29, Nicolas FOURNIL 
>> a écrit :
>>
>>> Hello,
>>>
>>> Thanks for advice but Prometheus cert is ok, (Self signed) and tested
>>> with curl and web navigator.
>>>
>>>  it seems to be the "Service discovery" certificate from cephadm who is
>>> missing but I cannot figure out how to set it.
>>>
>>> There's in the code a function to create this certificate inside the Key
>>> store but how ... that's the point :-(
>>>
>>> Regards.
>>>
>>>
>>>
>>> Le mar. 23 janv. 2024 à 15:52, David C.  a
>>> écrit :
>>>
>>>> Hello Nicolas,
>>>>
>>>> I don't know if it's an update issue.
>>>>
>>>> If this is not a problem for you, you can consider redeploying
>>>> grafana/prometheus.
>>>>
>>>> It is also possible to inject your own certificates :
>>>>
>>>> https://docs.ceph.com/en/latest/cephadm/services/monitoring/#example
>>>>
>>>>
>>>> https://github.com/ceph/ceph/blob/main/src/pybind/mgr/cephadm/templates/services/prometheus/prometheus.yml.j2
>>>>
>>>> 
>>>>
>>>> Cordialement,
>>>>
>>>> *David CASIER*
>>>> 
>>>>
>>>>
>>>>
>>>> Le mar. 23 janv. 2024 à 10:56, Nicolas FOURNIL <
>>>> nicolas.four...@gmail.com> a écrit :
>>>>
>>>>>  Hello,
>>>>>
>>>>> I've just fresh upgrade from Quincy to Reef and my graphs are now
>>>>> blank...
>>>>> after investigations, it seems that discovery service is not working
>>>>> because of no certificate :
>>>>>
>>>>> # ceph orch sd dump cert
>>>>> Error EINVAL: No certificate found for service discovery
>>>>>
>>>>> Maybe an upgrade issue ?
>>>>>
>>>>> Is there a way to generate or replace the certificate properly ?
>>>>>
>>>>> Regards
>>>>>
>>>>> Nicolas F.
>>>>> ___
>>>>> ceph-users mailing list -- ceph-users@ceph.io
>>>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>>>
>>>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How many pool for cephfs

2024-01-24 Thread David C.
Hi Albert,

In this scenario, it is more consistent to work with subvolumes.

Regarding security, you can use namespaces to isolate access at the OSD
level.

What Robert emphasizes is that creating pools dynamically is not without
effect on the number of PGs and (therefore) on the architecture (PG per
OSD, balancer, pg autoscaling, etc.)


Cordialement,

*David CASIER*




Le mer. 24 janv. 2024 à 10:10, Albert Shih  a écrit :

> Le 24/01/2024 à 09:45:56+0100, Robert Sander a écrit
> Hi
>
> >
> > On 1/24/24 09:40, Albert Shih wrote:
> >
> > > Knowing I got two class of osd (hdd and ssd), and I have a need of ~
> 20/30
> > > cephfs (currently and that number will increase with time).
> >
> > Why do you need 20 - 30 separate CephFS instances?
>
> 99.99% because I'm newbie with ceph and don't understand clearly how
> the autorisation work with cephfs ;-)
>
> If I say 20-30 it's because I currently have on my classic ZFS/NFS server
> around 25 «datasets» exported to various server.
>
> But because you question I understand I can put many export «inside» one
> cephfs.
>
> > > and put all my cephfs inside two of them. Or should I create for each
> > > cephfs a couple of pool metadata/data ?
> >
> > Each CephFS instance needs their own pools, at least two (data +
> metadata)
> > per instance. And each CephFS needs at least one MDS running, better
> with an
> > additional cold or even hot standby MDS.
>
> Ok. I got for my ceph cluster two set of servers, first set are for
> services (mgr,mon,etc.) with ssd and don't currently run any osd (but still
> have 2 ssd not used), I also got a second set of server with HDD and 2
> SSD. The data pool will be on
> the second set (with HDD). Where should I run the MDS and on which osd ?
>
> >
> > > Il will also need to have ceph S3 storage, same question, should I
> have a
> > > designated pool for S3 storage or can/should I use the same
> > > cephfs_data_replicated/erasure pool ?
> >
> > No, S3 needs its own pools. It cannot re-use CephFS pools.
>
> Ok thanks.
>
> Regards
> --
> Albert SHIH 🦫 🐸
> France
> Heure locale/Local time:
> mer. 24 janv. 2024 09:55:26 CET
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Questions about the CRUSH details

2024-01-24 Thread David C.
Hi,

The client calculates the location (PG) of an object from its name and the
crushmap.
This is what makes it possible to parallelize the flows directly from the
client.

The client also has the map of the PGs which are relocated to other OSDs
(upmap, temp, etc.)


Cordialement,

*David CASIER*




Le mer. 24 janv. 2024 à 17:49, Henry lol  a
écrit :

> Hello, I'm new to ceph and sorry in advance for the naive questions.
>
> 1.
> As far as I know, CRUSH utilizes the cluster map consisting of the PG
> map and others.
> I don't understand why CRUSH computation is required on client-side,
> even though PG-to-OSDs mapping can be acquired from the PG map.
>
> 2.
> how does the client get a valid(old) OSD set when the PG is being
> remapped to a new ODS set which CRUSH returns?
>
> thanks.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stupid question about ceph fs volume

2024-01-25 Thread David C.
Albert,
Never used EC for (root) data pool.

Le jeu. 25 janv. 2024 à 12:08, Albert Shih  a écrit :

> Le 25/01/2024 à 08:42:19+, Eugen Block a écrit
> > Hi,
> >
> > it's really as easy as it sounds (fresh test cluster on 18.2.1 without
> any
> > pools yet):
> >
> > ceph:~ # ceph fs volume create cephfs
>
> Yes...I already try that with the label and works fine.
>
> But I prefer to use «my» pools. Because I have ssd/hdd and want also try
> «erasure coding» pool for the data.
>

> I also need to set the pg_num and pgp_num (I know I can do that after the
> creation).


> So I manage to do ... half what I want...
>
> In fact
>
>   ceph fs volume create thing
>
> will create two pools
>
>   cephfs.thing.meta
>   cephfs.thing.data
>
> and if those pool already existe it will use them.
>
> But that's only if the data are replicated no with erasure coding(maybe
> I forget something config on the pool).
>
> Well I will currently continue my test with replicated data.
>
> > The pools and the daemons are created automatically (you can control the
> > placement of the daemons with the --placement option). Note that the
> > metadata pool needs to be on fast storage, so you might need to change
> the
> > ruleset for the metadata pool after creation in case you have HDDs in
> place.
> > Changing pools after the creation can be done via ceph fs commands:
> >
> > ceph:~ # ceph osd pool create cephfs_data2
> > pool 'cephfs_data2' created
> >
> > ceph:~ # ceph fs add_data_pool cephfs cephfs_data2
> >   Pool 'cephfs_data2' (id '4') has pg autoscale mode 'on' but is not
> marked
> > as bulk.
> >   Consider setting the flag by running
> > # ceph osd pool set cephfs_data2 bulk true
> > added data pool 4 to fsmap
> >
> > ceph:~ # ceph fs status
> > cephfs - 0 clients
> > ==
> > RANK  STATE MDS   ACTIVITY DNSINOS   DIRS
> > CAPS
> >  0active  cephfs.soc9-ceph.uqcybj  Reqs:0 /s10 13 12
> > 0
> >POOL   TYPE USED  AVAIL
> > cephfs.cephfs.meta  metadata  64.0k  13.8G
> > cephfs.cephfs.datadata   0   13.8G
> >cephfs_data2   data   0   13.8G
> >
> >
> > You can't remove the default data pool, though (here it's
> > cephfs.cephfs.data). If you want to control the pool creation you can
> fall
> > back to the method you mentioned, create pools as you require them and
> then
> > create a new cephfs, and deploy the mds service.
>
> Yes, but I'm guessing the
>
>   ceph fs volume
>
> are the «future» so it would be super nice to add (at least) the option to
> choose the couple of pool...
>
> >
> > I haven't looked too deep into changing the default pool yet, so there
> might
> > be a way to switch that as well.
>
> Ok. I will also try but...well...newbie ;-)
>
> Anyway thanks.
>
> regards
>
> --
> Albert SHIH 🦫 🐸
> France
> Heure locale/Local time:
> jeu. 25 janv. 2024 12:00:08 CET
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stupid question about ceph fs volume

2024-01-25 Thread David C.
In case the root is EC, it is likely that is not possible to apply the
disaster recovery procedure, (no xattr layout/parent on the data pool).



Cordialement,

*David CASIER*



Le jeu. 25 janv. 2024 à 13:03, Eugen Block  a écrit :

> I'm not sure if using EC as default data pool for cephfs is still
> discouraged as stated in the output when attempting to do that, the
> docs don't mention that (at least not in the link I sent in the last
> mail):
>
> ceph:~ # ceph fs new cephfs cephfs_metadata cephfs_data
> Error EINVAL: pool 'cephfs_data' (id '8') is an erasure-coded pool.
> Use of an EC pool for the default data pool is discouraged; see the
> online CephFS documentation for more information. Use --force to
> override.
>
> ceph:~ # ceph fs new cephfs cephfs_metadata cephfs_data --force
> new fs with metadata pool 6 and data pool 8
>
> CC'ing Zac here to hopefully clear that up.
>
> Zitat von "David C." :
>
> > Albert,
> > Never used EC for (root) data pool.
> >
> > Le jeu. 25 janv. 2024 à 12:08, Albert Shih  a
> écrit :
> >
> >> Le 25/01/2024 à 08:42:19+, Eugen Block a écrit
> >> > Hi,
> >> >
> >> > it's really as easy as it sounds (fresh test cluster on 18.2.1 without
> >> any
> >> > pools yet):
> >> >
> >> > ceph:~ # ceph fs volume create cephfs
> >>
> >> Yes...I already try that with the label and works fine.
> >>
> >> But I prefer to use «my» pools. Because I have ssd/hdd and want also try
> >> «erasure coding» pool for the data.
> >>
> >
> >> I also need to set the pg_num and pgp_num (I know I can do that after
> the
> >> creation).
> >
> >
> >> So I manage to do ... half what I want...
> >>
> >> In fact
> >>
> >>   ceph fs volume create thing
> >>
> >> will create two pools
> >>
> >>   cephfs.thing.meta
> >>   cephfs.thing.data
> >>
> >> and if those pool already existe it will use them.
> >>
> >> But that's only if the data are replicated no with erasure
> coding(maybe
> >> I forget something config on the pool).
> >>
> >> Well I will currently continue my test with replicated data.
> >>
> >> > The pools and the daemons are created automatically (you can control
> the
> >> > placement of the daemons with the --placement option). Note that the
> >> > metadata pool needs to be on fast storage, so you might need to change
> >> the
> >> > ruleset for the metadata pool after creation in case you have HDDs in
> >> place.
> >> > Changing pools after the creation can be done via ceph fs commands:
> >> >
> >> > ceph:~ # ceph osd pool create cephfs_data2
> >> > pool 'cephfs_data2' created
> >> >
> >> > ceph:~ # ceph fs add_data_pool cephfs cephfs_data2
> >> >   Pool 'cephfs_data2' (id '4') has pg autoscale mode 'on' but is not
> >> marked
> >> > as bulk.
> >> >   Consider setting the flag by running
> >> > # ceph osd pool set cephfs_data2 bulk true
> >> > added data pool 4 to fsmap
> >> >
> >> > ceph:~ # ceph fs status
> >> > cephfs - 0 clients
> >> > ==
> >> > RANK  STATE MDS   ACTIVITY DNSINOS
>  DIRS
> >> > CAPS
> >> >  0active  cephfs.soc9-ceph.uqcybj  Reqs:0 /s10 13
>  12
> >> > 0
> >> >POOL   TYPE USED  AVAIL
> >> > cephfs.cephfs.meta  metadata  64.0k  13.8G
> >> > cephfs.cephfs.datadata   0   13.8G
> >> >cephfs_data2   data   0   13.8G
> >> >
> >> >
> >> > You can't remove the default data pool, though (here it's
> >> > cephfs.cephfs.data). If you want to control the pool creation you can
> >> fall
> >> > back to the method you mentioned, create pools as you require them and
> >> then
> >> > create a new cephfs, and deploy the mds service.
> >>
> >> Yes, but I'm guessing the
> >>
> >>   ceph fs volume
> >>
> >> are the «future» so it would be super nice to add (at least) the option
> to
> >> choose the couple of pool...
> >>
> >> >
> >> > I haven't looked too deep into changing the default pool yet, so there
> >> might
> >> > be a way to switch that as well.
> >>
> >> Ok. I will also try but...well...newbie ;-)
> >>
> >> Anyway thanks.
> >>
> >> regards
> >>
> >> --
> >> Albert SHIH 🦫 🐸
> >> France
> >> Heure locale/Local time:
> >> jeu. 25 janv. 2024 12:00:08 CET
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Stupid question about ceph fs volume

2024-01-25 Thread David C.
It would be a pleasure to complete the documentation but we would need to
test or have someone confirm what I have assumed.

Concerning the warning, I think we should not talk about the discovery
procedure.
While the discovery procedure has already saved some entities, it has also
put entities at risk if misused.


Cordialement,

*David CASIER*




Le jeu. 25 janv. 2024 à 14:45, Eugen Block  a écrit :

> Oh right, I forgot about that, good point! But if that is (still) true
> then this should definitely be in the docs as a warning for EC pools
> in cephfs!
>
> Zitat von "David C." :
>
> > In case the root is EC, it is likely that is not possible to apply the
> > disaster recovery procedure, (no xattr layout/parent on the data pool).
> >
> > 
> >
> > Cordialement,
> >
> > *David CASIER*
> > 
> >
> >
> > Le jeu. 25 janv. 2024 à 13:03, Eugen Block  a écrit :
> >
> >> I'm not sure if using EC as default data pool for cephfs is still
> >> discouraged as stated in the output when attempting to do that, the
> >> docs don't mention that (at least not in the link I sent in the last
> >> mail):
> >>
> >> ceph:~ # ceph fs new cephfs cephfs_metadata cephfs_data
> >> Error EINVAL: pool 'cephfs_data' (id '8') is an erasure-coded pool.
> >> Use of an EC pool for the default data pool is discouraged; see the
> >> online CephFS documentation for more information. Use --force to
> >> override.
> >>
> >> ceph:~ # ceph fs new cephfs cephfs_metadata cephfs_data --force
> >> new fs with metadata pool 6 and data pool 8
> >>
> >> CC'ing Zac here to hopefully clear that up.
> >>
> >> Zitat von "David C." :
> >>
> >> > Albert,
> >> > Never used EC for (root) data pool.
> >> >
> >> > Le jeu. 25 janv. 2024 à 12:08, Albert Shih  a
> >> écrit :
> >> >
> >> >> Le 25/01/2024 à 08:42:19+, Eugen Block a écrit
> >> >> > Hi,
> >> >> >
> >> >> > it's really as easy as it sounds (fresh test cluster on 18.2.1
> without
> >> >> any
> >> >> > pools yet):
> >> >> >
> >> >> > ceph:~ # ceph fs volume create cephfs
> >> >>
> >> >> Yes...I already try that with the label and works fine.
> >> >>
> >> >> But I prefer to use «my» pools. Because I have ssd/hdd and want also
> try
> >> >> «erasure coding» pool for the data.
> >> >>
> >> >
> >> >> I also need to set the pg_num and pgp_num (I know I can do that after
> >> the
> >> >> creation).
> >> >
> >> >
> >> >> So I manage to do ... half what I want...
> >> >>
> >> >> In fact
> >> >>
> >> >>   ceph fs volume create thing
> >> >>
> >> >> will create two pools
> >> >>
> >> >>   cephfs.thing.meta
> >> >>   cephfs.thing.data
> >> >>
> >> >> and if those pool already existe it will use them.
> >> >>
> >> >> But that's only if the data are replicated no with erasure
> >> coding(maybe
> >> >> I forget something config on the pool).
> >> >>
> >> >> Well I will currently continue my test with replicated data.
> >> >>
> >> >> > The pools and the daemons are created automatically (you can
> control
> >> the
> >> >> > placement of the daemons with the --placement option). Note that
> the
> >> >> > metadata pool needs to be on fast storage, so you might need to
> change
> >> >> the
> >> >> > ruleset for the metadata pool after creation in case you have HDDs
> in
> >> >> place.
> >> >> > Changing pools after the creation can be done via ceph fs commands:
> >> >> >
> >> >> > ceph:~ # ceph osd pool create cephfs_data2
> >> >> > pool 'cephfs_data2' created
> >> >> >
> >> >> > ceph:~ # ceph fs add_data_pool cephfs cephfs_data2
> >> >> >   Pool 'cephfs_data2' (id 

[ceph-users] Re: cephadm discovery service certificate absent after upgrade.

2024-01-25 Thread David C.
It would be cool, actually, to have the metrics working in 18.2.2, for IPv6
only

Otherwise, everything works fine on my side.


Cordialement,

*David CASIER*




Le jeu. 25 janv. 2024 à 16:12, Nicolas FOURNIL 
a écrit :

> Gotcha !
>
> I've got the point, after restarting the CA certificate creation with :
> ceph restful create-self-signed-cert
>
> I get this error :
> Module 'cephadm' has failed: Expected 4 octets in
> 'fd30:::0:1101:2:0:501'
>
>
> *Ouch 4 octets = IP4 address expected... some nice code in perspective.*
>
> I go through podman to get more traces :
>
>   File "/usr/share/ceph/mgr/cephadm/ssl_cert_utils.py", line 49, in
> generate_root_cert
> [x509.IPAddress(ipaddress.IPv4Address(addr))]
>   File "/lib64/python3.6/ipaddress.py", line 1284, in __init__
> self._ip = self._ip_int_from_string(addr_str)
>   File "/lib64/python3.6/ipaddress.py", line 1118, in _ip_int_from_string
> raise AddressValueError("Expected 4 octets in %r" % ip_str)
> ipaddress.AddressValueError: Expected 4 octets in
> 'fd30:::0:1101:2:0:501'
>
> So I github this and find this fix in 19.0.0 (with backport not yet
> released) :
>
>
> https://github.com/ceph/ceph/commit/647b5d67a8a800091acea68d20e87354373b0fac
>
> This example shows that it's impossible to get any metrics in an IPv6 only
> network (Discovery is impossible) and it's visible at install so there's no
> test for IPv6 only environnement before release ?
>
> Now I'm seriously asking myself to put a crappy IPv4 subnet only for my
> ceph cluster, because it's always a headache to get it working in an IPv6
> environment.
>
>
> Le mar. 23 janv. 2024 à 17:58, David C.  a écrit :
>
>> According to sources, the certificates are generated automatically at
>> startup. Hence my question if the service started correctly.
>>
>> I also had problems with IPv6 only, but I don't immediately have more info
>> 
>>
>> Cordialement,
>>
>> *David CASIER*
>> 
>>
>>
>> Le mar. 23 janv. 2024 à 17:46, Nicolas FOURNIL 
>> a écrit :
>>
>>> IPv6 only : Yes, the -ms_bind_ipv6=true is already set-
>>>
>>> I had tried a rotation of the keys for node-exporter and I get this :
>>>
>>> 2024-01-23T16:43:56.098796+ mgr.srv06-r2b-fl1.foxykh (mgr.342408)
>>> 87074 : cephadm [INF] Rotating authentication key for
>>> node-exporter.srv06-r2b-fl1
>>> 2024-01-23T16:43:56.099224+ mgr.srv06-r2b-fl1.foxykh (mgr.342408)
>>> 87075 : cephadm [ERR] unknown daemon type node-exporter
>>> Traceback (most recent call last):
>>>   File "/usr/share/ceph/mgr/cephadm/serve.py", line 1039, in
>>> _check_daemons
>>> self.mgr._daemon_action(daemon_spec, action=action)
>>>   File "/usr/share/ceph/mgr/cephadm/module.py", line 2203, in
>>> _daemon_action
>>> return self._rotate_daemon_key(daemon_spec)
>>>   File "/usr/share/ceph/mgr/cephadm/module.py", line 2147, in
>>> _rotate_daemon_key
>>> 'entity': daemon_spec.entity_name(),
>>>   File "/usr/share/ceph/mgr/cephadm/services/cephadmservice.py", line
>>> 108, in entity_name
>>> return get_auth_entity(self.daemon_type, self.daemon_id,
>>> host=self.host)
>>>   File "/usr/share/ceph/mgr/cephadm/services/cephadmservice.py", line
>>> 47, in get_auth_entity
>>> raise OrchestratorError(f"unknown daemon type {daemon_type}")
>>> orchestrator._interface.OrchestratorError: unknown daemon type
>>> node-exporter
>>>
>>> Tried to remove & recreate service : it's the same ... how to stop the
>>> rotation now :-/
>>>
>>>
>>>
>>> Le mar. 23 janv. 2024 à 17:18, David C.  a
>>> écrit :
>>>
>>>> Is the cephadm http server service starting correctly (in the mgr logs)?
>>>>
>>>> IPv6 ?
>>>> 
>>>>
>>>> Cordialement,
>>>>
>>>> *David CASIER*
>>>> 
>>>>
>>>>
>>>>
>>>>
>>>> Le mar. 23 janv. 2024 à 16:29, Nicolas FOURNIL <
>&g

[ceph-users] Re: How check local network

2024-01-29 Thread David C.
Hello Albert,

this should return you the sockets used on the network cluster :

ceph report | jq '.osdmap.osds[] | .cluster_addrs.addrvec[] | .addr'



Cordialement,

*David CASIER*





Le lun. 29 janv. 2024 à 22:52, Albert Shih  a écrit :

> Le 29/01/2024 à 22:43:46+0100, Albert Shih a écrit
> > Hi
> >
> > When I deploy my cluster I didn't notice on two of my servers the private
> > network was not working (wrong vlan), now it's working, but how can I
> check
> > the it's indeed working (currently I don't have data).
>
> I mean...ceph going to use it...sorry.
>
> --
> Albert SHIH 🦫 🐸
> France
> Heure locale/Local time:
> lun. 29 janv. 2024 22:50:36 CET
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-23 Thread David C.
Hi,
The problem seems to come from the clients (reconnect).

Test by disabling metrics on all clients:
echo Y > /sys/module/ceph/parameters/disable_send_metrics



Cordialement,

*David CASIER*





Le ven. 23 févr. 2024 à 10:20, Eugen Block  a écrit :

> This seems to be the relevant stack trace:
>
> ---snip---
> Feb 23 15:18:39 cephgw02 conmon[2158052]: debug -1>
> 2024-02-23T08:18:39.609+ 7fccc03c0700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4/rpm/el8/BUILD/ceph-16.2.4/src/include/cephfs/metrics/Types.h:
> In function 'std::ostream& operator<<(std::ostream&, const
> ClientMetricType&)' thread 7fccc03c0700 time
> 2024-02-23T08:18:39.609581+
> Feb 23 15:18:39 cephgw02 conmon[2158052]:
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4/rpm/el8/BUILD/ceph-16.2.4/src/include/cephfs/metrics/Types.h:
> 56: ceph_abort_msg("abort()
> called")
> Feb 23 15:18:39 cephgw02 conmon[2158052]:
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  ceph version 16.2.4
> (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  1: (ceph::__ceph_abort(char
> const*, int, char const*, std::__cxx11::basic_string std::char_traits, std::allocator > const&)+0xe5)
> [0x7fccc9021cdc]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  2:
> (operator<<(std::ostream&, ClientMetricType const&)+0x10e)
> [0x7fccc92a642e]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  3:
> (MClientMetrics::print(std::ostream&) const+0x1a1) [0x7fccc92a6601]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  4:
> (DispatchQueue::pre_dispatch(boost::intrusive_ptr
> const&)+0x710) [0x7fccc9259c30]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  5:
> (DispatchQueue::entry()+0xdeb) [0x7fccc925b69b]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  6:
> (DispatchQueue::DispatchThread::entry()+0x11) [0x7fccc930bb71]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  7:
> /lib64/libpthread.so.0(+0x814a) [0x7fccc7dc314a]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  8: clone()
> Feb 23 15:18:39 cephgw02 conmon[2158052]:
> Feb 23 15:18:39 cephgw02 conmon[2158052]: debug  0>
> 2024-02-23T08:18:39.610+ 7fccc03c0700 -1 *** Caught signal
> (Aborted) **
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  in thread 7fccc03c0700
> thread_name:ms_dispatch
> Feb 23 15:18:39 cephgw02 conmon[2158052]:
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  ceph version 16.2.4
> (3cbe25cde3cfa028984618ad32de9edc4c1eaed0) pacific (stable)
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  1:
> /lib64/libpthread.so.0(+0x12b20) [0x7fccc7dcdb20]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  2: gsignal()
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  3: abort()
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  4: (ceph::__ceph_abort(char
> const*, int, char const*, std::__cxx11::basic_string std::char_traits, std::allocator > const&)+0x1b6)
> [0x7fccc9021dad]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  5: (opera
> Feb 23 15:18:39 cephgw02 conmon[2158052]: tor<<(std::ostream&,
> ClientMetricType const&)+0x10e) [0x7fccc92a642e]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  6:
> (MClientMetrics::print(std::ostream&) const+0x1a1) [0x7fccc92a6601]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  7:
> (DispatchQueue::pre_dispatch(boost::intrusive_ptr
> const&)+0x710) [0x7fccc9259c30]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  8:
> (DispatchQueue::entry()+0xdeb) [0x7fccc925b69b]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  9:
> (DispatchQueue::DispatchThread::entry()+0x11) [0x7fccc930bb71]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  10:
> /lib64/libpthread.so.0(+0x814a) [0x7fccc7dc314a]
> Feb 23 15:18:39 cephgw02 conmon[2158052]:  11: clone()
> ---snip---
>
> But I can't really help here, hopefully someone else can chime in and
> interpret it.
>
>
> Zitat von nguyenvand...@baoviet.com.vn:
>
> >
> https://drive.google.com/file/d/1OIN5O2Vj0iWfEMJ2fyHN_xV6fpknBmym/view?usp=sharing
> >
> > Pls check my mds log which generate by command
> >
> > cephadm logs --name mds.cephfs.cephgw02.qqsavr --fsid
> > 258af72a-cff3-11eb-a261-d4f5ef25154c
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-23 Thread David C.
look at ALL cephfs kernel clients (no effect on RGW)

Le ven. 23 févr. 2024 à 16:38,  a écrit :

> And we  dont have parameter folder
>
> cd /sys/module/ceph/
> [root@cephgw01 ceph]# ls
> coresize  holders  initsize  initstate  notes  refcnt  rhelversion
> sections  srcversion  taint  uevent
>
> My Ceph is 16.2.4
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-24 Thread David C.
Do you have the possibility to stop/unmount cephfs clients ?

If so, do that and restart the MDS.
It should restart.

Have the clients restart one by one and check that the MDS does not crash
(by monitoring the logs)



Cordialement,

*David CASIER*




*Ligne directe: +33(0) 9 72 61 98 29*




Le sam. 24 févr. 2024 à 10:01,  a écrit :

> Hi Mathew
>
> Pls chekc my ceph -s
>
> ceph -s
>   cluster:
> id: 258af72a-cff3-11eb-a261-d4f5ef25154c
> health: HEALTH_WARN
> 3 failed cephadm daemon(s)
> 1 filesystem is degraded
> insufficient standby MDS daemons available
> 1 nearfull osd(s)
> Low space hindering backfill (add storage if this doesn't
> resolve itself):
> 21 pgs backfill_toofull
> 15 pool(s) nearfull
> 11 daemons have recently crashed
>
>   services:
> mon: 6 daemons, quorum
> cephgw03,cephosd01,cephgw01,cephosd03,cephgw02,cephosd02 (age 30h)
> mgr: cephgw01.vwoffq(active, since 17h), standbys:
> cephgw02.nauphz,
> cephgw03.aipvii
> mds: 1/1 daemons up
> osd: 29 osds: 29 up (since 40h), 29 in (since 29h); 402
> remapped pgs
> rgw: 2 daemons active (2 hosts, 1 zones)
> tcmu-runner: 18 daemons active (2 hosts)
>
>   data:
> volumes: 0/1 healthy, 1 recovering
> pools:   15 pools, 1457 pgs
> objects: 36.87M objects, 25 TiB
> usage:   75 TiB used, 41 TiB / 116 TiB avail
> pgs: 17759672/110607480 objects misplaced (16.056%)
>  1055 active+clean
>  363  active+remapped+backfill_wait
>  18   active+remapped+backfilling
>  14   active+remapped+backfill_toofull
>  7
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [Urgent] Ceph system Down, Ceph FS volume in recovering

2024-02-24 Thread David C.
if rebalancing tasks have been launched it's not a big deal, but I don't
think it's the priority.
The priority being to get the MDS back on its feet.
I haven't seen an answer to this question: can you stop/unmount cephfs
clients or not ?

There are other solutions but as you are not comfortable I am making the
simplest one on the ceph side but not the most comfortable on the business
side.

There is no one (in Vietnam?) who could help you more seriously (and in a
lasting way) ?



Le sam. 24 févr. 2024 à 15:55,  a écrit :

> Hi David,
>
> I ll follow your suggestion. Do you have Telegram ? If yes, could you pls
> add my Telegram, +84989177619. Thank you so much
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Seperate metadata pool in 3x MDS node

2024-02-24 Thread David C.
Hello,

Each rack works on different trees or is everything parallelized ?
The meta pools would be distributed over racks 1,2,4,5 ?
If it is distributed, even if the addressed MDS is on the same switch as
the client, you will always have this MDS which will consult/write (nvme)
OSDs on the other racks (among 1,2,4,5).

In any case, the exercise is interesting.



Le sam. 24 févr. 2024 à 19:56, Özkan Göksu  a écrit :

> Hello folks!
>
> I'm designing a new Ceph storage from scratch and I want to increase CephFS
> speed and decrease latency.
> Usually I always build (WAL+DB on NVME with Sas-Sata SSD's) and I deploy
> MDS and MON's on the same servers.
> This time a weird idea came to my mind and I think it has great potential
> and will perform better on paper with my limited knowledge.
>
> I have 5 racks and the 3nd "middle" rack is my storage and management rack.
>
> - At RACK-3 I'm gonna locate 8x 1u OSD server (Spec: 2x E5-2690V4, 256GB,
> 4x 25G, 2x 1.6TB PCI-E NVME "MZ-PLK3T20", 8x 4TB SATA SSD)
>
> - My Cephfs kernel clients are 40x GPU nodes located at RACK1,2,4,5
>
> With my current workflow, all the clients;
> 1- visit the rack data switch
> 2- jump to main VPC switch via 2x100G,
> 3- talk with MDS servers,
> 4- Go back to the client with the answer,
> 5- To access data follow the same HOP's and visit the OSD's everytime.
>
> If I deploy separate metadata pool by using 4x MDS server at top of
> RACK-1,2,4,5 (Spec: 2x E5-2690V4, 128GB, 2x 10G(Public), 2x 25G (cluster),
> 2x 960GB U.2 NVME "MZ-PLK3T20")
> Then all the clients will make the request directly in-rack 1 HOP away MDS
> servers and if the request is only metadata, then the MDS node doesn't need
> to redirect the request to OSD nodes.
> Also by locating MDS servers with seperated metadata pool across all the
> racks will reduce the high load on main VPC switch at RACK-3
>
> If I'm not missing anything then only Recovery workload will suffer with
> this topology.
>
> What do you think?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: has anyone enabled bdev_enable_discard?

2024-03-02 Thread David C.
I came across an enterprise NVMe used for BlueFS DB whose performance
dropped sharply after a few months of delivery (I won't mention the brand
here but it was not among these 3: Intel, Samsung, Micron).
It is clear that enabling bdev_enable_discard impacted performance, but
this option also saved the platform after a few days of discard.

IMHO the most important thing is to validate the behavior when there has
been a write to the entire flash media.
But this option has the merit of existing.

it seems to me that the ideal would be not to have several options on
bdev_*discard, and that this task should be asynchronous and with the
(D)iscard instructions during a calmer period of activity (I do not see any
impact if the instructions are lost during an OSD reboot)


Le ven. 1 mars 2024 à 19:17, Igor Fedotov  a écrit :

> I played with this feature a while ago and recall it had visible
> negative impact on user operations due to the need to submit tons of
> discard operations - effectively each data overwrite operation triggers
> one or more discard operation submission to disk.
>
> And I doubt this has been widely used if any.
>
> Nevertheless recently we've got a PR to rework some aspects of thread
> management for this stuff, see https://github.com/ceph/ceph/pull/55469
>
> The author claimed they needed this feature for their cluster so you
> might want to ask him about their user experience.
>
>
> W.r.t documentation - actually there are just two options
>
> - bdev_enable_discard - enables issuing discard to disk
>
> - bdev_async_discard - instructs whether discard requests are issued
> synchronously (along with disk extents release) or asynchronously (using
> a background thread).
>
> Thanks,
>
> Igor
>
> On 01/03/2024 13:06, jst...@proxforge.de wrote:
> > Is there any update on this? Did someone test the option and has
> > performance values before and after?
> > Is there any good documentation regarding this option?
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: has anyone enabled bdev_enable_discard?

2024-03-02 Thread David C.
Could we not consider setting up a “bluefstrim” which could be orchestrated
?

This would avoid having a continuous stream of (D)iscard instructions on
the disks during activity.

A weekly (probably monthly) bluefstrim could probably be enough for
platforms that really need it.


Le sam. 2 mars 2024 à 12:58, Matt Vandermeulen  a
écrit :

> We've had a specific set of drives that we've had to enable
> bdev_enable_discard and bdev_async_discard for in order to maintain
> acceptable performance on block clusters. I wrote the patch that Igor
> mentioned in order to try and send more parallel discards to the
> devices, but these ones in particular seem to process them in serial
> (based on observed discard counts and latency going to the device),
> which is unfortunate. We're also testing new firmware that suggests it
> should help alleviate some of the initial concerns we had about discards
> not keeping up which prompted the patch in the first place.
>
> Most of our drives do not need discards enabled (and definitely not
> without async) in order to maintain performance unless we're doing a
> full disk fio test or something like that where we're trying to find its
> cliff profile. We've used OSD classes to help target the options being
> applied to specific OSDs via centralized conf which helps when we would
> add new hosts that may have different drives so that the options weren't
> applied globally.
>
> Based on our experience, I wouldn't enable it unless you're seeing some
> sort of cliff-like behaviour as your OSDs run low on free space, or are
> heavily fragmented. I would also deem bdev_async_enabled = 1 to be a
> requirement so that it doesn't block user IO. Keep an eye on your
> discards being sent to devices and the discard latency, as well (via
> node_exporter, for example).
>
> Matt
>
>
> On 2024-03-02 06:18, David C. wrote:
> > I came across an enterprise NVMe used for BlueFS DB whose performance
> > dropped sharply after a few months of delivery (I won't mention the
> > brand
> > here but it was not among these 3: Intel, Samsung, Micron).
> > It is clear that enabling bdev_enable_discard impacted performance, but
> > this option also saved the platform after a few days of discard.
> >
> > IMHO the most important thing is to validate the behavior when there
> > has
> > been a write to the entire flash media.
> > But this option has the merit of existing.
> >
> > it seems to me that the ideal would be not to have several options on
> > bdev_*discard, and that this task should be asynchronous and with the
> > (D)iscard instructions during a calmer period of activity (I do not see
> > any
> > impact if the instructions are lost during an OSD reboot)
> >
> >
> > Le ven. 1 mars 2024 à 19:17, Igor Fedotov  a
> > écrit :
> >
> >> I played with this feature a while ago and recall it had visible
> >> negative impact on user operations due to the need to submit tons of
> >> discard operations - effectively each data overwrite operation
> >> triggers
> >> one or more discard operation submission to disk.
> >>
> >> And I doubt this has been widely used if any.
> >>
> >> Nevertheless recently we've got a PR to rework some aspects of thread
> >> management for this stuff, see https://github.com/ceph/ceph/pull/55469
> >>
> >> The author claimed they needed this feature for their cluster so you
> >> might want to ask him about their user experience.
> >>
> >>
> >> W.r.t documentation - actually there are just two options
> >>
> >> - bdev_enable_discard - enables issuing discard to disk
> >>
> >> - bdev_async_discard - instructs whether discard requests are issued
> >> synchronously (along with disk extents release) or asynchronously
> >> (using
> >> a background thread).
> >>
> >> Thanks,
> >>
> >> Igor
> >>
> >> On 01/03/2024 13:06, jst...@proxforge.de wrote:
> >> > Is there any update on this? Did someone test the option and has
> >> > performance values before and after?
> >> > Is there any good documentation regarding this option?
> >> > ___
> >> > ceph-users mailing list -- ceph-users@ceph.io
> >> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >>
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] All MGR loop crash

2024-03-07 Thread David C.
Hello everybody,

I'm encountering strange behavior on an infrastructure (it's pre-production
but it's very ugly). After a "drain" on monitor (and a manager). MGRs all
crash on startup:

Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr ms_dispatch2 standby
mgrmap(e 1310) v1
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map received
map epoch 1310
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map active in
map: 1 active is 99148504
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map Activating!
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map I am now
activating
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr ms_dispatch2 mgrmap(e
1310) v1
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc handle_mgr_map Got map
version 1310
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc handle_mgr_map Active mgr
is now
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc reconnect No active mgr
available yet
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr init waiting for OSDMap...
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _renew_subs
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _send_mon_message
to mon.idf-pprod-osd3 at v2:X.X.X.X:3300/0 
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _reopen_session
rank -1
Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: *** Caught signal (Aborted) **
  in thread 7f9a07a27640
thread_name:mgr-fin

  ceph version
17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)
  1:
/lib64/libc.so.6(+0x54db0) [0x7f9a2364ddb0]
  2:
/lib64/libc.so.6(+0xa154c) [0x7f9a2369a54c]
  3: raise()
  4: abort()
  5:
/usr/lib64/ceph/libceph-common.so.2(+0x1c1fa8) [0x7f9a23ce2fa8]
  6:
/usr/lib64/ceph/libceph-common.so.2(+0x25) [0x7f9a23f65425]
  7:
/usr/lib64/ceph/libceph-common.so.2(+0x4442e0) [0x7f9a23f652e0]
  8:
/usr/lib64/ceph/libceph-common.so.2(+0x4442e0) [0x7f9a23f652e0]
  9:
(MonClient::_add_conns()+0x242) [0x7f9a23f5fa42]
  10:
(MonClient::_reopen_session(int)+0x428) [0x7f9a23f60518]
  11: (Mgr::init()+0x384)
[0x5604667a6434]
  12:
/usr/bin/ceph-mgr(+0x1af271) [0x5604667ae271]
  13:
/usr/bin/ceph-mgr(+0x11364d) [0x56046671264d]
  14:
(Finisher::finisher_thread_entry()+0x175) [0x7f9a23d10645]
  15:
/lib64/libc.so.6(+0x9f802) [0x7f9a23698802]
  16:
/lib64/libc.so.6(+0x3f450) [0x7f9a23638450]
  NOTE: a copy of the
executable, or `objdump -rdS ` is needed to interpret this.

I have the impression that the MGRs are ejected by the monitors, however
after debugging monitor, I don't see anything abnormal on the monitor side
(if I haven't missed something).

All we can see is that we get an exception on the "_add_conn" method (
https://github.com/ceph/ceph/blob/v17.2.6/src/mon/MonClient.cc #L775)

Version : 17.2.6-170.el9cp (RHCS6)
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: All MGR loop crash

2024-03-07 Thread David C.
I took the wrong ligne =>
https://github.com/ceph/ceph/blob/v17.2.6/src/mon/MonClient.cc#L822


Le jeu. 7 mars 2024 à 18:21, David C.  a écrit :

>
> Hello everybody,
>
> I'm encountering strange behavior on an infrastructure (it's
> pre-production but it's very ugly). After a "drain" on monitor (and a
> manager). MGRs all crash on startup:
>
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr ms_dispatch2 standby
> mgrmap(e 1310) v1
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map received
> map epoch 1310
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map active in
> map: 1 active is 99148504
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map Activating!
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map I am now
> activating
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr ms_dispatch2 mgrmap(e
> 1310) v1
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc handle_mgr_map Got map
> version 1310
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc handle_mgr_map Active
> mgr is now
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc reconnect No active mgr
> available yet
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr init waiting for OSDMap...
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _renew_subs
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _send_mon_message
> to mon.idf-pprod-osd3 at v2:X.X.X.X:3300/0 <http://10.191.10.3:3300/0>
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _reopen_session
> rank -1
> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: *** Caught signal (Aborted) **
>   in thread 7f9a07a27640
> thread_name:mgr-fin
>
>   ceph version
> 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)
>   1:
> /lib64/libc.so.6(+0x54db0) [0x7f9a2364ddb0]
>   2:
> /lib64/libc.so.6(+0xa154c) [0x7f9a2369a54c]
>   3: raise()
>   4: abort()
>   5:
> /usr/lib64/ceph/libceph-common.so.2(+0x1c1fa8) [0x7f9a23ce2fa8]
>   6:
> /usr/lib64/ceph/libceph-common.so.2(+0x25) [0x7f9a23f65425]
>   7:
> /usr/lib64/ceph/libceph-common.so.2(+0x4442e0) [0x7f9a23f652e0]
>   8:
> /usr/lib64/ceph/libceph-common.so.2(+0x4442e0) [0x7f9a23f652e0]
>   9:
> (MonClient::_add_conns()+0x242) [0x7f9a23f5fa42]
>   10:
> (MonClient::_reopen_session(int)+0x428) [0x7f9a23f60518]
>   11: (Mgr::init()+0x384)
> [0x5604667a6434]
>   12:
> /usr/bin/ceph-mgr(+0x1af271) [0x5604667ae271]
>   13:
> /usr/bin/ceph-mgr(+0x11364d) [0x56046671264d]
>   14:
> (Finisher::finisher_thread_entry()+0x175) [0x7f9a23d10645]
>   15:
> /lib64/libc.so.6(+0x9f802) [0x7f9a23698802]
>   16:
> /lib64/libc.so.6(+0x3f450) [0x7f9a23638450]
>   NOTE: a copy of the
> executable, or `objdump -rdS ` is needed to interpret this.
>
> I have the impression that the MGRs are ejected by the monitors, however
> after debugging monitor, I don't see anything abnormal on the monitor side
> (if I haven't missed something).
>
> All we can see is that we get an exception on the "_add_conn" method (
> https://github.com/ceph/ceph/blob/v17.2.6/src/mon/MonClient.cc #L775)
>
> Version : 17.2.6-170.el9cp (RHCS6)
>
>
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: All MGR loop crash

2024-03-07 Thread David C.
Ok, got it :

[root@pprod-admin:/var/lib/ceph/]# ceph mon dump -f json-pretty
|egrep "name|weigh"
dumped monmap epoch 14
"min_mon_release_name": "quincy",
"name": "pprod-mon2",
"weight": 10,
"name": "pprod-mon3",
"weight": 10,
"name": "pprod-osd2",
"weight": 0,
"name": "pprod-osd1",
"weight": 0,
"name": "pprod-osd3",
"weight": 0,

ceph mon set-weight pprod-mon2 0
ceph mon set-weight pprod-mon3 0

And restart ceph-mgr

Le jeu. 7 mars 2024 à 18:25, David C.  a écrit :

> I took the wrong ligne =>
> https://github.com/ceph/ceph/blob/v17.2.6/src/mon/MonClient.cc#L822
>
>
> Le jeu. 7 mars 2024 à 18:21, David C.  a écrit :
>
>>
>> Hello everybody,
>>
>> I'm encountering strange behavior on an infrastructure (it's
>> pre-production but it's very ugly). After a "drain" on monitor (and a
>> manager). MGRs all crash on startup:
>>
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr ms_dispatch2 standby
>> mgrmap(e 1310) v1
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map received
>> map epoch 1310
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map active in
>> map: 1 active is 99148504
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map
>> Activating!
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map I am now
>> activating
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr ms_dispatch2 mgrmap(e
>> 1310) v1
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc handle_mgr_map Got map
>> version 1310
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc handle_mgr_map Active
>> mgr is now
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc reconnect No active mgr
>> available yet
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr init waiting for
>> OSDMap...
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _renew_subs
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _send_mon_message
>> to mon.idf-pprod-osd3 at v2:X.X.X.X:3300/0 <http://10.191.10.3:3300/0>
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _reopen_session
>> rank -1
>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: *** Caught signal (Aborted)
>> **
>>   in thread 7f9a07a27640
>> thread_name:mgr-fin
>>
>>   ceph version
>> 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy (stable)
>>   1:
>> /lib64/libc.so.6(+0x54db0) [0x7f9a2364ddb0]
>>   2:
>> /lib64/libc.so.6(+0xa154c) [0x7f9a2369a54c]
>>   3: raise()
>>   4: abort()
>>   5:
>> /usr/lib64/ceph/libceph-common.so.2(+0x1c1fa8) [0x7f9a23ce2fa8]
>>   6:
>> /usr/lib64/ceph/libceph-common.so.2(+0x25) [0x7f9a23f65425]
>>   7:
>> /usr/lib64/ceph/libceph-common.so.2(+0x4442e0) [0x7f9a23f652e0]
>>   8:
>> /usr/lib64/ceph/libceph-common.so.2(+0x4442e0) [0x7f9a23f652e0]
>>   9:
>> (MonClient::_add_conns()+0x242) [0x7f9a23f5fa42]
>>   10:
>> (MonClient::_reopen_session(int)+0x428) [0x7f9a23f60518]
>>   11: (Mgr::init()+0x384)
>> [0x5604667a6434]
>>   12:
>> /usr/bin/ceph-mgr(+0x1af271) [0x5604667ae271]
>>   13:
>> /usr/bin/ceph-mgr(+0x11364d) [0x56046671264d]
>>   14:
>> (Finisher::finisher_thread_entry()+0x175) [0x7f9a23d10645]
>>   15:
>> /lib64/libc.so.6(+0x9f802) [0x7f9a23698802]
>>   16:
>> /lib64/libc.so.6(+0x3f450) [0x7f9a23638450]
>>   NOTE: a copy of the
>> executable, or `objdump -rdS ` is needed to interpret this.
>>
>> I have the impression that the MGRs are ejected by the monitors, however
>> after debugging monitor, I don't see anything abnormal on the monitor side
>> (if I haven't missed something).
>>
>> All we can see is that we get an exception on the "_add_conn" method (
>> https://github.com/ceph/ceph/blob/v17.2.6/src/mon/MonClient.cc #L775)
>>
>> Version : 17.2.6-170.el9cp (RHCS6)
>>
>>
>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] All MGR loop crash

2024-03-07 Thread David C.
some monitors have existed for many years (weight 10) others have been
added (weight 0)

=> https://github.com/ceph/ceph/commit/2d113dedf851995e000d3cce136b69
bfa94b6fe0

Le jeudi 7 mars 2024, Eugen Block  a écrit :

> I’m curious how the weights might have been changed. I’ve never touched a
> mon weight myself, do you know how that happened?
>
> Zitat von "David C." :
>
> Ok, got it :
>>
>> [root@pprod-admin:/var/lib/ceph/]# ceph mon dump -f json-pretty
>> |egrep "name|weigh"
>> dumped monmap epoch 14
>> "min_mon_release_name": "quincy",
>> "name": "pprod-mon2",
>> "weight": 10,
>> "name": "pprod-mon3",
>> "weight": 10,
>> "name": "pprod-osd2",
>> "weight": 0,
>> "name": "pprod-osd1",
>> "weight": 0,
>> "name": "pprod-osd3",
>> "weight": 0,
>>
>> ceph mon set-weight pprod-mon2 0
>> ceph mon set-weight pprod-mon3 0
>>
>> And restart ceph-mgr
>>
>> Le jeu. 7 mars 2024 à 18:25, David C.  a écrit :
>>
>> I took the wrong ligne =>
>>> https://github.com/ceph/ceph/blob/v17.2.6/src/mon/MonClient.cc#L822
>>>
>>>
>>> Le jeu. 7 mars 2024 à 18:21, David C.  a écrit :
>>>
>>>
>>>> Hello everybody,
>>>>
>>>> I'm encountering strange behavior on an infrastructure (it's
>>>> pre-production but it's very ugly). After a "drain" on monitor (and a
>>>> manager). MGRs all crash on startup:
>>>>
>>>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr ms_dispatch2 standby
>>>> mgrmap(e 1310) v1
>>>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map received
>>>> map epoch 1310
>>>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map active
>>>> in
>>>> map: 1 active is 99148504
>>>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map
>>>> Activating!
>>>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr handle_mgr_map I am now
>>>> activating
>>>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr ms_dispatch2 mgrmap(e
>>>> 1310) v1
>>>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc handle_mgr_map Got map
>>>> version 1310
>>>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc handle_mgr_map Active
>>>> mgr is now
>>>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgrc reconnect No active
>>>> mgr
>>>> available yet
>>>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: mgr init waiting for
>>>> OSDMap...
>>>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _renew_subs
>>>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient:
>>>> _send_mon_message
>>>> to mon.idf-pprod-osd3 at v2:X.X.X.X:3300/0 <http://10.191.10.3:3300/0>
>>>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: monclient: _reopen_session
>>>> rank -1
>>>> Mar 07 17:06:47 pprod-mon1 ceph-mgr[564045]: *** Caught signal (Aborted)
>>>> **
>>>>   in thread 7f9a07a27640
>>>> thread_name:mgr-fin
>>>>
>>>>   ceph version
>>>> 17.2.6-170.el9cp (59bbeb8815ec3aeb3c8bba1e1866f8f6729eb840) quincy
>>>> (stable)
>>>>   1:
>>>> /lib64/libc.so.6(+0x54db0) [0x7f9a2364ddb0]
>>>>   2:
>>>> /lib64/libc.so.6(+0xa154c) [0x7f9a2369a54c]
>>>>   3: raise()
>>>>   4: abort()
>>>>   5:
>>>> /usr/lib64/ceph/libceph-common.so.2(+0x1c1fa8) [0x7f9a23ce2fa8]
>>>>   6:
>>>> /usr/lib64/ceph/libceph-common.so.2(+0x25) [0x7f9a23f65425]
>>>>   7:
>>>> /usr/lib64/ceph/libceph-common.so.2(+0x4442e0) [0x7f9a23f652e0]
>>>>   8:
>>>> /usr/lib64/ceph/libceph-common.so.2(

[ceph-users] Re: Erasure Code with Autoscaler and Backfill_toofull

2024-03-27 Thread David C.
Hi Daniel,

Changing pg_num when some OSD is almost full is not a good strategy (or
even dangerous).

What is causing this backfilling? loss of an OSD? balancer? other ?

What is the least busy OSD level (sort -nrk17)

Is the balancer activated? (upmap?)

Once the situation stabilizes, it becomes interesting to think about the
number of pg/osd =>
https://docs.ceph.com/en/latest/rados/operations/placement-groups/#managing-pools-that-are-flagged-with-bulk


Le mer. 27 mars 2024 à 09:41, Daniel Williams  a
écrit :

> Hey,
>
> I'm running ceph version 18.2.1 (reef) but this problem must have existed a
> long time before reef.
>
> The documentation says the autoscaler will target 100 pgs per OSD but I'm
> only seeing ~10. My erasure encoding is a stripe of 6 data 3 parity.
> Could that be the reason? PGs numbers for that EC pool are therefore
> multiplied by k+m by the autoscaler calculations?
>
> Is backfill_toofull calculated against the total size of the PG against
> every OSD it is destined for? For my case I have ~1TiB PGs because the
> autoscaler is creating only 10 per host, and then backfill too full is
> considering that one of my OSDs only has 500GiB free, although that doesn't
> quite add up either because two 1TiB PGs are backfilling two pg's that have
> OSD 1 in them. My backfill full ratio is set to 97%.
>
> Would it be correct for me to change the autoscaler to target ~700 pgs per
> osd and bias for storagefs and all EC pools to k+m? Should that be the
> default or the documentation recommended value?
>
> How scary is changing PG_NUM while backfilling misplaced PGs? It seems like
> there's a chance the backfill might succeed so I think I can wait.
>
> Any help is greatly appreciated, I've tried to include as much of the
> relevant debugging output as I can think of.
>
> Daniel
>
> # ceph osd ls | wc -l
> 44
> # ceph pg ls | wc -l
> 484
>
> # ceph osd pool autoscale-status
> POOL SIZE  TARGET SIZE   RATE  RAW CAPACITY   RATIO
>  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK
> .rgw.root  216.0k 3.0480.2T  0.
>  1.0  32  on False
> default.rgw.control0  3.0480.2T  0.
>  1.0  32  on False
> default.rgw.meta   0  3.0480.2T  0.
>  1.0  32  on False
> default.rgw.log 1636k 3.0480.2T  0.
>  1.0  32  on False
> storagefs  233.5T 1.5480.2T  0.7294
>  1.0 256  on False
> storagefs-meta 850.2M 4.0480.2T  0.
>  4.0  32  on False
> storagefs_wide 355.3G   1.375480.2T  0.0010
>  1.0  32  on False
> .mgr   457.3M 3.0480.2T  0.
>  1.0   1  on False
> mgr-backup-2022-08-19  370.6M 3.0480.2T  0.
>  1.0  32  on False
>
> # ceph osd pool ls detail | column -t
> pool  15  '.rgw.root'  replicated  size 3min_size  2
> crush_rule  0  object_hash  rjenkins  pg_num   32pgp_num  32
> autoscale_mode  on
> pool  16  'default.rgw.control'replicated  size 3min_size  2
> crush_rule  0  object_hash  rjenkins  pg_num   32pgp_num  32
> autoscale_mode  on
> pool  17  'default.rgw.meta'   replicated  size 3min_size  2
> crush_rule  0  object_hash  rjenkins  pg_num   32pgp_num  32
> autoscale_mode  on
> pool  18  'default.rgw.log'replicated  size 3min_size  2
> crush_rule  0  object_hash  rjenkins  pg_num   32pgp_num  32
> autoscale_mode  on
> pool  36  'storagefs'  erasure profile  6.3  size  9
> min_size7  crush_rule   2 object_hash  rjenkins  pg_num   256
>  pgp_num 256  autoscale_mode  on
> pool  37  'storagefs-meta' replicated  size 4min_size  1
> crush_rule  0  object_hash  rjenkins  pg_num   32pgp_num  32
> autoscale_mode  on
> pool  45  'storagefs_wide' erasure profile  8.3  size  11
>  min_size9  crush_rule   8 object_hash  rjenkins  pg_num   32
> pgp_num 32   autoscale_mode  on
> pool  46  '.mgr'   replicated  size 3min_size  2
> crush_rule  0  object_hash  rjenkins  pg_num   1 pgp_num  1
>  autoscale_mode  on
> pool  48  'mgr-backup-2022-08-19'  replicated  size 3min_size  2
> crush_rule  0  object_hash  rjenkins  pg_num   32pgp_num  32
> aut

[ceph-users] Re: Impact of Slow OPS?

2024-04-06 Thread David C.
Hi,

Do slow ops impact data integrity => No
Can I generally ignore it => No :)

This means that some client transactions are blocked for 120 sec (that's a
lot).
This could be a lock on the client side (CephFS, essentially), an incident
on the infrastructure side (a disk about to fall, network instability,
etc.), ...

When this happens, you need to look at the blocked requests.
If you systematically see an osd ID, then look at dmesg and the SMART of
the disk.

This can also be an architectural problem (for example, high IOPS load with
osdmap on HDD, all multiplied by the erasure code)

*David*


Le ven. 5 avr. 2024 à 19:42, adam.ther  a écrit :

> Hello,
>
> Do slow ops impact data integrity or can I generally ignore it? I'm
> loading 3 hosts with a 10GB link and it saturating the disks or the OSDs.
>
> 2024-04-05T15:33:10.625922+ mon.CEPHADM-1 [WRN] Health check
> update: 3 slow ops, oldest one blocked for 117 sec, daemons
> [osd.0,osd.13,osd.14,osd.17,osd.3,osd.4,osd.9] have slow ops.
> (SLOW_OPS)
>
> 2024-04-05T15:33:15.628271+ mon.CEPHADM-1 [WRN] Health check
> update: 2 slow ops, oldest one blocked for 123 sec, daemons
> [osd.0,osd.1,osd.14,osd.17,osd.3,osd.4,osd.9] have slow ops. (SLOW_OPS)
>
> I guess more to the point, what the impact here?
>
> Thanks,
>
> Adam
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS performance

2022-11-22 Thread David C
My understanding is BeeGFS doesn't offer data redundancy by default,
you have to configure mirroring. You've not said how your Ceph cluster
is configured but my guess is you have the recommended 3x replication
- I wouldn't be surprised if BeeGFS was much faster than Ceph in this
case. I'd be interested to see your results after ensuring equivalent
data redundancy between the platforms.

On Thu, Oct 20, 2022 at 9:02 PM quag...@bol.com.br  wrote:
>
> Hello everyone,
> I have some considerations and doubts to ask...
>
> I work at an HPC center and my doubts stem from performance in this 
> environment. All clusters here was suffering from NFS performance and also 
> problems of a single point of failure it has. We were suffering from the 
> performance of NFS and also the single point of failure it has.
>
> At that time, we decided to evaluate some available SDS and the chosen 
> one was Ceph (first for its resilience and later for its performance).
> I deployed CephFS in a small cluster: 6 nodes and 1 HDD per machine with 
> 1Gpbs connection.
> The performance was as good as a large NFS we have on another cluster 
> (spending much less). In addition, it was able to evaluate all the benefits 
> of resiliency that Ceph offers (such as activating an OSD, MDS, MON or MGR 
> server) and the objects/services to settle on other nodes. All this in a way 
> that the user did not even notice.
>
> Given this information, a new storage cluster was acquired last year with 
> 6 machines and 22 disks (HDDs) per machine. The need was for the amount of 
> available GBs. The amount of IOPs was not so important at that time.
>
> Right at the beginning, I had a lot of work to optimize the performance 
> in the cluster (the main deficiency was in the performance in the 
> access/write of metadata). The problem was not at the job execution, but the 
> user's perception of slowness when executing interactive commands (my 
> perception was in the slowness of Ceph metadata).
> There were a few months of high loads in which storage was the bottleneck 
> of the environment.
>
> After a lot of research in documentation, I made several optimizations on 
> the available parameters and currently CephFS is able to reach around 10k 
> IOPS (using size=2).
>
> Anyway, my boss asked for other solutions to be evaluated to verify the 
> performance issue.
> First of all, it was suggested to put the metadata on SSD disks for a 
> higher amount of IOPS.
> In addition, a test environment was set up and the solution that made the 
> most difference in performance was with BeeGFS.
>
> In some situations, BeeGFS is many times faster than Ceph in the same 
> tests and under the same hardware conditions. This happens in both the 
> throuput (BW) and IOPS.
>
> We tested it using io500 as follows:
> 1-) An individual process
> 2-) 8 processes (4 processes on 2 different machines)
> 3-) 16 processes (8 processes on 2 different machines)
>
> I did tests configuring CephFS to use:
> * HDD only (for both data and metadata)
> * Metadata on SSD
> * Using Linux FSCache features
> * With some optimizations (increasing MDS memory, client memory, inflight 
> parameters, etc)
> * Cache tier with SSD
>
> Even so, the benchmark score was lower than the BeeGFS installed without 
> any optimization. This difference becomes even more evident as the number of 
> simultaneous accesses increases.
>
> The two best results of CephFS were using metadata on SSD and also doing 
> TIER on SSD.
>
> Here is the result of Ceph's performance when compared to BeeGFS:
>
> Bandwith Test (bw is in GB/s):
>
> ==
> |fs|bw|process|
> ==
> |beegfs-metassd|0.078933|01|
> |beegfs-metassd|0.051855|08|
> |beegfs-metassd|0.039459|16|
> ==
> |cephmetassd|0.022489|01|
> |cephmetassd|0.009789|08|
> |cephmetassd|0.002957|16|
> ==
> |cephcache|0.023966|01|
> |cephcache|0.021131|08|
> |cephcache|0.007782|16|
> ==
>
> IOPS Test:
>
> ==
> |fs|iops|process|
> ==
> |beegfs-metassd|0.740658|01|
> |beegfs-metassd|3.508879|08|
> |beegfs-metassd|6.514768|16|
> ==
> |cephmetassd|1.224963|01|
> |cephmetassd|3.762794|08|
> |cephmetassd|3.188686|16|
> ==
> |cephcache|1.829107|01|
> |cephcache|

[ceph-users] Ceph upgrade advice - Luminous to Pacific with OS upgrade

2022-12-06 Thread David C
Hi All

I'm planning to upgrade a Luminous 12.2.10 cluster to Pacific 16.2.10,
cluster is primarily used for CephFS, mix of Filestore and Bluestore
OSDs, mons/osds collocated, running on CentOS 7 nodes

My proposed upgrade path is: Upgrade to Nautilus 14.2.22 -> Upgrade to
EL8 on the nodes (probably Rocky) -> Upgrade to Pacific

I assume the cleanest way to update the node OS would be to drain the
node and remove from the cluster, install Rocky 8, add back to cluster
as effectively a new node

I have a relatively short maintenance window and was hoping to speed
up OS upgrade with the following approach on each node:

- back up ceph config/systemd files etc.
- set noout etc.
- deploy Rocky 8, being careful not to touch OSD block devices
- install Nautilus binaries (ensuring I use same version as pre OS upgrade)
- copy ceph config back over

In theory I could then start up the daemons and they wouldn't care
that we're now running on a different OS

Does anyone see any issues with that approach? I plan to test on a dev
cluster anyway but would be grateful for any thoughts

Thanks,
David
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [SPAM] Ceph upgrade advice - Luminous to Pacific with OS upgrade

2022-12-06 Thread David C
Hi Wolfpaw, thanks for the response

- Id upgrade to Nautilus on Centos 7 before moving to EL8. We then used
> AlmaLinux Elevate to love from 7 to 8 without a reinstall. Rocky has a
> similar path I think.
>

Good to hear you had success with the ELevate tool, I'd looked at that but
seemed a bit risky. The tool supports Rocky so I may give it a look.

>
> - you will need to love those filestore OSD’s to Bluestore before hitting
> Pacific, might even be part of the Nautilus upgrade. This takes some time
> if I remember correctly.
>

This one is surprising since in theory Pacific still supports Filestore,
there is at least one thread on the list where someone upgraded to Pacific
and is still running some Filestore OSDs - on the other hand, there's also
a recent thread where someone ran into problems and was forced to upgrade
to Bluestore - did you experience issues yourself or was this advice you
picked up? I do ultimately want to get all my OSDs on Bluestore but was
hoping to do that after the Ceph version upgrade.


> - You may need to upgrade monitors to RocksDB too.


Thanks, I wasn't aware of this  - I suppose I'll do that when I'm on
Nautilus


On Tue, Dec 6, 2022 at 3:22 PM Wolfpaw - Dale Corse 
wrote:

> We did this (over a longer timespan).. it worked ok.
>
> A couple things I’d add:
>
> - Id upgrade to Nautilus on Centos 7 before moving to EL8. We then used
> AlmaLinux Elevate to love from 7 to 8 without a reinstall. Rocky has a
> similar path I think.
>
> - you will need to love those filestore OSD’s to Bluestore before hitting
> Pacific, might even be part of the Nautilus upgrade. This takes some time
> if I remember correctly.
>
> - You may need to upgrade monitors to RocksDB too.
>
> Sent from my iPhone
>
> > On Dec 6, 2022, at 7:59 AM, David C  wrote:
> >
> > Hi All
> >
> > I'm planning to upgrade a Luminous 12.2.10 cluster to Pacific 16.2.10,
> > cluster is primarily used for CephFS, mix of Filestore and Bluestore
> > OSDs, mons/osds collocated, running on CentOS 7 nodes
> >
> > My proposed upgrade path is: Upgrade to Nautilus 14.2.22 -> Upgrade to
> > EL8 on the nodes (probably Rocky) -> Upgrade to Pacific
> >
> > I assume the cleanest way to update the node OS would be to drain the
> > node and remove from the cluster, install Rocky 8, add back to cluster
> > as effectively a new node
> >
> > I have a relatively short maintenance window and was hoping to speed
> > up OS upgrade with the following approach on each node:
> >
> > - back up ceph config/systemd files etc.
> > - set noout etc.
> > - deploy Rocky 8, being careful not to touch OSD block devices
> > - install Nautilus binaries (ensuring I use same version as pre OS
> upgrade)
> > - copy ceph config back over
> >
> > In theory I could then start up the daemons and they wouldn't care
> > that we're now running on a different OS
> >
> > Does anyone see any issues with that approach? I plan to test on a dev
> > cluster anyway but would be grateful for any thoughts
> >
> > Thanks,
> > David
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: [SPAM] Ceph upgrade advice - Luminous to Pacific with OS upgrade

2022-12-06 Thread David C
>
> I don't think this is necessary. It _is_ necessary to convert all
> leveldb to rocksdb before upgrading to Pacific, on both mons and any
> filestore OSDs.


Thanks, Josh, I guess that explains why some people had issues with
Filestore OSDs post Pacific upgrade

On Tue, Dec 6, 2022 at 4:07 PM Josh Baergen 
wrote:

> > - you will need to love those filestore OSD’s to Bluestore before
> hitting Pacific, might even be part of the Nautilus upgrade. This takes
> some time if I remember correctly.
>
> I don't think this is necessary. It _is_ necessary to convert all
> leveldb to rocksdb before upgrading to Pacific, on both mons and any
> filestore OSDs.
>
> Quincy will warn you about filestore OSDs, and Reef will no longer
> support filestore.
>
> Josh
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Impacts on doubling the size of pgs in a rbd pool?

2023-10-03 Thread David C.
 Hi,

Michel,

the pool already appears to be in automatic autoscale ("autoscale_mode on").
If you're worried (if, for example, the platform is having trouble handling
a large data shift) then you can set the parameter to warn (like the
rjenkis pool).

If not, as Hervé says, the transition to 2048 pg will be smoother if it's
automatic.

To answer your questions:

1/ There's not much point in doing it before adding the OSDs. In any case,
there will be a significant but gradual replacement of the data. Even if
it's unlikely to see nearfull with the data you've notified.

2/ The recommendation would be to leave the default settings (pg autoscale,
osd_max_backfills, recovery, ...). If there really is a concern, then leave
it at 1024 and set autoscale_mode to warn.


Le mar. 3 oct. 2023 à 17:13, Michel Jouvin 
a écrit :

> Hi Herve,
>
> Why you don't use the automatic adjustment of the number of PGs. This
> makes life much easier and works well.
>
> Cheers,
>
> Michel
>
> Le 03/10/2023 à 17:06, Hervé Ballans a écrit :
> > Hi all,
> >
> > Sorry for the reminder, but does anyone have any advice on how to deal
> > with this?
> >
> > Many thanks!
> > Hervé
> >
> > Le 29/09/2023 à 11:34, Hervé Ballans a écrit :
> >> Hi all,
> >>
> >> I have a Ceph cluster on Quincy (17.2.6), with 3 pools (1 rbd and 1
> >> CephFS volume), each configured with 3 replicas.
> >>
> >> $ sudo ceph osd pool ls detail
> >> pool 7 'cephfs_data_home' replicated size 3 min_size 2 crush_rule 1
> >> object_hash rjenkins pg_num 512 pgp_num 512 autoscale_mode on
> >> last_change 6287147 lfor 0/5364613/5364611 flags hashpspool
> >> stripe_width 0 application cephfs
> >> pool 8 'cephfs_metadata_home' replicated size 3 min_size 2 crush_rule
> >> 3 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on
> >> last_change 641 lfor 0/641/639 flags hashpspool
> >> stripe_width 0 application cephfs
> >> pool 9 'rbd_backup_vms' replicated size 3 min_size 2 crush_rule 2
> >> object_hash rjenkins pg_num 1024 pgp_num 1024 autoscale_mode on
> >> last_change 6365131 lfor 0/211948/249421 flags
> >> hashpspool,selfmanaged_snaps stripe_width 0 application rbd
> >> pool 10 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash
> >> rjenkins pg_num 1 pgp_num 1 autoscale_mode warn last_change 6365131
> >> flags hashpspool stripe_width 0 pg_num_min 1 application
> >> mgr,mgr_devicehealth
> >>
> >> $ sudo ceph df
> >> --- RAW STORAGE ---
> >> CLASS SIZEAVAIL USED  RAW USED  %RAW USED
> >> hdd306 TiB  186 TiB  119 TiB   119 TiB  39.00
> >> nvme   4.4 TiB  4.3 TiB  118 GiB   118 GiB   2.63
> >> TOTAL  310 TiB  191 TiB  119 TiB   119 TiB  38.49
> >>
> >> --- POOLS ---
> >> POOL  ID   PGS  STORED  OBJECTSUSED  %USED MAX AVAIL
> >> cephfs_data_home   7   512  12 TiB   28.86M  12 TiB 12.85 27 TiB
> >> cephfs_metadata_home   832  33 GiB3.63M  33 GiB 0.79 1.3 TiB
> >> rbd_backup_vms 9  1024  24 TiB6.42M  24 TiB 58.65 5.6 TiB
> >> .mgr  10 1  35 MiB9  35 MiB 0 12 TiB
> >>
> >> I am going to extend the rbd pool (rbd_backup_vms), currently used at
> >> 60%.
> >> This pool contains 60 disks, i.e. 20 disks by rack in the crushmap.
> >> This pool is used for storing VM disk images (available to a separate
> >> ProxmoxVE cluster)
> >>
> >> For this purpose, I am going to add 42 disks of the same size as
> >> those currently in the pool, i.e. 14 additional disks on each rack.
> >>
> >> Currently, this pool is configured with 1024 pgs.
> >> Before this operation, I would like to extend the number of pgs,
> >> let's say 2048 (i.e. double).
> >>
> >> I wonder about the overall impact of this change on the cluster. I
> >> guess that the heavy moves in the pgs will have a strong impact
> >> regarding the iops?
> >>
> >> I have two questions:
> >>
> >> 1) Is it useful to make this modification before adding the new OSDs?
> >> (I'm afraid of warnings about full or nearfull pgs if not)
> >>
> >> 2) are there any configuration recommendations in order to minimize
> >> these anticipated impacts?
> >>
> >> Thank you!
> >>
> >> Cheers,
> >> Hervé
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Dump/Add users yaml/json

2024-12-03 Thread David C.
Hi Albert,

(open question, without judgment)
What is the purpose of importing users recurrently ?
It seems to me that import is the complement of export, to restore.
Creating in ceph and exporting (possibly) in json format is not enough ?

Le mar. 3 déc. 2024 à 13:29, Albert Shih  a écrit :

> Le 01/12/2024 à 13:13:35+0100, Joachim Kraftmayer a écrit
> Hi,
>
> >
> > I think json format is not supported to add or change users.
>
> S*it ;-) ;-)
>
> > But you could use:
> >
> >
> https://docs.ceph.com/en/reef/rados/operations/user-management/#importing-a-user
> >
> > and the -i is also available for ceph auth add and other commands.
> >
> >
> https://docs.ceph.com/en/reef/rados/operations/user-management/#adding-a-user
>
> Yeah...I already check that. But that's mean I would need to edit some ini
> style file. And I don't like that. Especially when the purpose is to do
> that with some script.
>
> Anyway thanks.
>
> Regards
> --
> Albert SHIH 🦫 🐸
> Observatoire de Paris
> France
> Heure locale/Local time:
> mar. 03 déc. 2024 13:27:23 CET
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Dump/Add users yaml/json

2024-12-04 Thread David C.
Hi,

In this case, the tool that adds the account should perform a caps check
(for security reasons) and probably use get-or-create/caps (not import)


Le mer. 4 déc. 2024 à 10:42, Albert Shih  a écrit :

> Le 03/12/2024 à 18:27:57+0100, David C. a écrit
> Hi,
>
> >
> > (open question, without judgment)
> > What is the purpose of importing users recurrently ?
>
> No the point is not to importing users recurrently.
>
>
> > It seems to me that import is the complement of export, to restore.
> > Creating in ceph and exporting (possibly) in json format is not enough ?
>
> We use puppet to manage our all infrastructure. In our team each person has
> his own speciality. But everyone has a set of «standard» procedures to
> answer «standards tickets» (or level 1 tickets if you prefer).
>
> So with NFS we use puppet to export to client, anyone in our team can add a
> client through puppet. He just have to edit a «standard» file in puppet and
> the puppet go to do his thing to configure the server nfs and the client.
> Even for the person who know nothing how nfs works it's doable.
>
> With cephfs I like to do the same thing, I was able to do that for the
> «first» export by using the some ceph command but that's working only the
> first time (link to the way I do it with puppet).
>
> So the first time, to configure a export cephfs to a client currently the
> member of the team has just to add some
>
> client_hostname:
>   volume_name:
> right: rw
> subvolume: erasure
> mountpoint: /mountpoint
>
> the puppet module would create/install/configure everything. My colleague
> don't need to learn any ceph command.
>
> But...only the first time...something like
> client_hostname:
>   volume_name1:
> right: rw
> subvolume: erasure
> mountpoint: /mountpoint
>   volume_name2:
> right: rw
> subvolume: erasure
> mountpoint: /mountpoint2
>
> would not work.
>
> To be more flexible I need to be able to launch ceph command through ruby,
> and it's much more easy to do it with yaml/json because it's native to
> puppet, don't need to add some library to use ini file. It's more easy to
> check if something already exist (so don't do something already here).
>
> Regards.
>
> JAS
> --
> Albert SHIH 🦫 🐸
> Observatoire de Paris
> France
> Heure locale/Local time:
> mer. 04 déc. 2024 10:27:57 CET
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: HEALTH_ERR: 1 MDSs report damaged metadata - damage_type=dentry

2025-04-18 Thread David C.
I also tend to think that the disk has nothing to do with the problem.

My reading is that the inode associated with the dentry is missing.
Can anyone correct me?

Christophe informed me that the directories were emptied before the
incident.

I don't understand why scrubbing doesn't repair the meta data.
Perhaps because the directory is empty ?

Le jeu. 17 avr. 2025 à 19:06, Anthony D'Atri  a
écrit :

> HPE rebadges drives from manufacturers.  A quick search supports the idea
> that this SKU is fulfilled at least partly by Kioxia, so not likely a PLP
> issue.
>
>
> > On Apr 17, 2025, at 11:39 AM, Christophe DIARRA <
> christophe.dia...@idris.fr> wrote:
> >
> > Hello David,
> >
> > The SSD model is VO007680JWZJL.
> >
> > I will delay the 'ceph tell mds.cfs_irods_test:0 damage rm 241447932'
> for the moment. If any other solution is found I will be obliged to use
> this command.
> >
> > I found 'dentry' in the logs when the cephfs cluster started:
> >
> >> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.cfs_irods_test.mon-02.awuygq
> Updating MDS map to version 15613 from mon.2
> >> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.15612 handle_mds_map i am
> now mds.0.15612
> >> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.15612 handle_mds_map state
> change up:starting --> up:active
> >> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.15612 active_start
> >> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.cache.den(0x1 testdir2)
> loaded already *corrupt dentry*: [dentry #0x1/testdir2 [2,head] rep@0.0
> NULL (dversion lock) pv=0 v=4442 ino=(n
> >> il) state=0 0x5617e18c8280]
> >> Apr 16 17:29:53 mon-02 ceph-mds[2367]: mds.0.cache.den(0x1 testdir1)
> loaded already *corrupt dentry*: [dentry #0x1/testdir1 [2,head] rep@0.0
> NULL (dversion lock) pv=0 v=4442 ino=(n
> >> il) state=0 0x5617e18c8500]
> >> Apr 16 17:29:53 mon-02 ceph-mon[2288]: Health check failed: 1
> filesystem is offline (MDS_ALL_DOWN)
> >> Apr 16 17:29:53 mon-02 ceph-mon[2288]: Health check failed: 1
> filesystem is online with fewer MDS than max_mds (MDS_UP_LESS_THAN_MAX)
> >> Apr 16 17:29:53 mon-02 ceph-mon[2288]: from='client.?
> xx.xx.xx.8:0/3820885518' entity='client.admin' cmd='[{"prefix": "fs set",
> "fs_name": "cfs_irods_test", "var": "down", "val":
> >> "false"}]': finished
> >> Apr 16 17:29:53 mon-02 ceph-mon[2288]: daemon
> mds.cfs_irods_test.mon-02.awuygq assigned to filesystem cfs_irods_test as
> rank 0 (now has 1 ranks)
> >> Apr 16 17:29:53 mon-02 ceph-mon[2288]: Health check cleared:
> MDS_ALL_DOWN (was: 1 filesystem is offline)
> >> Apr 16 17:29:53 mon-02 ceph-mon[2288]: Health check cleared:
> MDS_UP_LESS_THAN_MAX (was: 1 filesystem is online with fewer MDS than
> max_mds)
> >> Apr 16 17:29:53 mon-02 ceph-mon[2288]: daemon
> mds.cfs_irods_test.mon-02.awuygq is now active in filesystem cfs_irods_test
> as rank 0
> >> Apr 16 17:29:54 mon-02 ceph-mgr[2444]: log_channel(cluster) log [DBG] :
> pgmap v1721: 4353 pgs: 4346 active+clean, 7 active+clean+scrubbing+deep;
> 3.9 TiB data, 417 TiB used, 6.4 P
> >> iB / 6.8 PiB avail; 1.4 KiB/s rd, 1 op/s
> >>
> > If you need more extract from the log file please let me know.
> >
> > Thanks for your help,
> >
> > Christophe
> >
> >
> >
> > On 17/04/2025 13:39, David C. wrote:
> >> If I'm not mistaken, this is a fairly rare situation.
> >>
> >> The fact that it's the result of a power outage makes me think of a bad
> SSD (like "S... Pro").
> >>
> >> Does a grep of the dentry id in the MDS logs return anything?
> >> Maybe some interesting information around this grep
> >>
> >> In the heat of the moment, I have no other idea than to delete the
> dentry.
> >>
> >> ceph tell mds.cfs_irods_test:0 damage rm 241447932
> >>
> >> However, in production, this results in the content (of dir
> /testdir[12]) being abandoned.
> >>
> >>
> >> Le jeu. 17 avr. 2025 à 12:44, Christophe DIARRA <
> christophe.dia...@idris.fr> a écrit :
> >>
> >>Hello David,
> >>
> >>Thank you for the tip about the scrubbing. I have tried the
> >>commands found in the documentation but it seems to have no effect:
> >>
> >>[root@mon-01 ~]#*ceph tell mds.cfs_irods_test:0 scrub start /
> recursive,repair,force*
> >>2025-04-17T12:07:20.958+0200 7fd4157fa640  0 client.8630

[ceph-users] Re: HEALTH_ERR: 1 MDSs report damaged metadata - damage_type=dentry

2025-04-17 Thread David C.
If I'm not mistaken, this is a fairly rare situation.

The fact that it's the result of a power outage makes me think of a bad SSD
(like "S... Pro").

Does a grep of the dentry id in the MDS logs return anything?
Maybe some interesting information around this grep

In the heat of the moment, I have no other idea than to delete the dentry.

ceph tell mds.cfs_irods_test:0 damage rm 241447932

However, in production, this results in the content (of dir /testdir[12])
being abandoned.


Le jeu. 17 avr. 2025 à 12:44, Christophe DIARRA 
a écrit :

> Hello David,
>
> Thank you for the tip about the scrubbing. I have tried the commands found
> in the documentation but it seems to have no effect:
>
> [root@mon-01 ~]# *ceph tell mds.cfs_irods_test:0 scrub start / 
> recursive,repair,force*
> 2025-04-17T12:07:20.958+0200 7fd4157fa640  0 client.86301 ms_handle_reset on 
> v2:130.84.80.10:6800/3218663047
> 2025-04-17T12:07:20.979+0200 
> <http://130.84.80.10:6800/32186630472025-04-17T12:07:20.979+0200> 
> 7fd4157fa640  0 client.86307 ms_handle_reset on 
> v2:130.84.80.10:6800/3218663047
> {
> "return_code": 0,
> "scrub_tag": "733b1c6d-a418-4c83-bc8e-b28b556e970c",
> "mode": "asynchronous"
> }
>
> [root@mon-01 ~]#* ceph tell mds.cfs_irods_test:0 scrub status*
> 2025-04-17T12:07:30.734+0200 7f26cdffb640  0 client.86319 ms_handle_reset on 
> v2:130.84.80.10:6800/3218663047
> 2025-04-17T12:07:30.753+0200 
> <http://130.84.80.10:6800/32186630472025-04-17T12:07:30.753+0200> 
> 7f26cdffb640  0 client.86325 ms_handle_reset on 
> v2:130.84.80.10:6800/3218663047
> {
> "status": "no active scrubs running",
> "scrubs": {}
> }
> [root@mon-01 ~]# ceph -s
>   cluster:
> id: b87276e0-1d92-11ef-a9d6-507c6f66ae2e
> *health: HEALTH_ERR
> 1 MDSs report damaged metadata*
>
>   services:
> mon: 3 daemons, quorum mon-01,mon-03,mon-02 (age 19h)
> mgr: mon-02.mqaubn(active, since 19h), standbys: mon-03.gvywio, 
> mon-01.xhxqdi
> mds: 1/1 daemons up, 2 standby
> osd: 368 osds: 368 up (since 18h), 368 in (since 3w)
>
>   data:
> volumes: 1/1 healthy
> pools:   10 pools, 4353 pgs
> objects: 1.25M objects, 3.9 TiB
> usage:   417 TiB used, 6.4 PiB / 6.8 PiB avail
> pgs: 4353 active+clean
>
> Did I miss something ?
>
> The server didn't crash. I don't understand what you are meaning by "there
> may be a design flaw in the infrastructure (insecure cache, for example)".
> How to know if we have a design problem ? What should we check ?
>
> Best regards,
>
> Christophe
> On 17/04/2025 11:07, David C. wrote:
>
> Hello Christophe,
>
> Check the file system scrubbing procedure =>
> https://docs.ceph.com/en/latest/cephfs/scrub/ But this doesn't guarantee
> data recovery.
>
> Was the cluster crashed?
> Ceph should be able to handle it; there may be a design flaw in the
> infrastructure (insecure cache, for example).
>
> David
>
> Le jeu. 17 avr. 2025 à 10:44, Christophe DIARRA <
> christophe.dia...@idris.fr> a écrit :
>
>> Hello,
>>
>> After an electrical maintenance I restarted our ceph cluster but it
>> remains in an unhealthy state: HEALTH_ERR 1 MDSs report damaged metadata.
>>
>> How to repair this damaged metadata ?
>>
>> To bring down the cephfs cluster I unmounted the fs from the client
>> first and then did: ceph fs set cfs_irods_test down true
>>
>> To bring up the cephfs cluster I did: ceph fs set cfs_irods_test down
>> false
>>
>> Fortunately the cfs_irods_test fs is almost empty and is a fs for
>> tests.The ceph cluster is not in production yet.
>>
>> Following is the current status:
>>
>> [root@mon-01 ~]# ceph health detail
>> HEALTH_ERR 1 MDSs report damaged metadata
>> *[ERR] MDS_DAMAGE: 1 MDSs report damaged metadata
>>  mds.cfs_irods_test.mon-03.vlmeuz(mds.0): Metadata damage detected*
>>
>> [root@mon-01 ~]# ceph -s
>>cluster:
>>  id: b87276e0-1d92-11ef-a9d6-507c6f66ae2e
>>  health: HEALTH_ERR
>>  1 MDSs report damaged metadata
>>
>>services:
>>  mon: 3 daemons, quorum mon-01,mon-03,mon-02 (age 17h)
>>  mgr: mon-02.mqaubn(active, since 17h), standbys: mon-03.gvywio,
>> mon-01.xhxqdi
>>  mds: 1/1 daemons up, 2 standby
>>  osd: 368 osds: 368 up (since 17h), 368 in (since 3w)
>>
>>data:
>>  volumes: 1/1 healthy
>>  pools:   10 pools, 4353 pgs
>>  objects: 1.25M objects, 3.9 TiB
&g