[ceph-users] switch restart facilitating cluster/client network.

2022-01-25 Thread Marc


If the switch needs an update and needs to be restarted (expected 2 minutes). 
Can I just leave the cluster as it is, because ceph will handle this correctly? 
Or should I eg. put some vm's I am running in pause mode, or even stop them. 
What happens to the monitors? Can they handle this, or maybe better to switch 
from 3 to 1 one?








___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] How to remove stuck daemon?

2022-01-25 Thread Fyodor Ustinov
Hi!

I have Ceph cluster version 16.2.7 with this error:

root@s-26-9-19-mon-m1:~# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s)
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
daemon osd.91 on s-26-8-2-1 is in error state

But I don't have that osd anymore. I deleted it.

root@s-26-9-19-mon-m1:~# ceph orch ps|grep s-26-8-2-1
crash.s-26-8-2-1 s-26-8-2-1 running (2d)
 1h ago   3M9651k-  16.2.7   cc266d6139f4  2ed049f74b66  
node-exporter.s-26-8-2-1 s-26-8-2-1*:9100   running (2d)
 1h ago   3M24.3M-  0.18.1   e5a616e4b9cf  817cc5370e7e  
osd.90   s-26-8-2-1 running (2d)
 1h ago   3M25.6G4096M  16.2.7   cc266d6139f4  beb2ea3efb3b  

root@s-26-8-2-1:~# cephadm ls|grep osd
"name": "osd.90",
"systemd_unit": "ceph-1ef45b26-dbac-11eb-a357-616c355f48cb@osd.90",
"service_name": "osd",

Can you please tell me how to reset this error message?

WBR,
Fyodor
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Moving all s3 objects from an ec pool to a replicated pool using storage classes.

2022-01-25 Thread Frédéric Nass

Hello,

I've just heard about storage classes and imagined how we could use them 
to migrate all S3 objects within a placement pool from an ec pool to a 
replicated pool (or vice-versa) for data resiliency reasons, not to save 
space.


It looks possible since ;

1. data pools are associated to storage classes in a placement pool
2. bucket lifecycle policies can take care of moving data from a storage 
class to another
3. we can set a user's default_storage_class to have all new objects 
written by this user reach the new storage class / data pool.
4. after all objects have been transitioned to the new storage class, we 
can delete the old storage class, rename the new storage class to 
STANDARD so that it's been used by default and unset any user's 
default_storage_class setting.


Would that work?

Anyone tried this with success yet?

Best regards,

Frédéric.

--
Cordialement,

Frédéric Nass
Direction du Numérique
Sous-direction Infrastructures et Services

Tél : 03.72.74.11.35

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: switch restart facilitating cluster/client network.

2022-01-25 Thread Janne Johansson
If you can stop VMs it will help, even if the cluster recovers
quickly, VMs take great offense if a write does not finish within
120s, and many will put filesystems in readonly-mode if writes are
delayed for so long, so if there is a 120s outage of IO, the VMs will
be stuck/useless anyhow so you might aswell stop them before and
restart them after the outage.

Den tis 25 jan. 2022 kl 10:27 skrev Marc :
>
>
> If the switch needs an update and needs to be restarted (expected 2 minutes). 
> Can I just leave the cluster as it is, because ceph will handle this 
> correctly? Or should I eg. put some vm's I am running in pause mode, or even 
> stop them. What happens to the monitors? Can they handle this, or maybe 
> better to switch from 3 to 1 one?
>
>
>
>
>
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Fwd: Lots of OSDs crashlooping (DRAFT - feedback?)

2022-01-25 Thread Dan van der Ster
Hi Benjamin,

Apologies that I can't help for the bluestore issue.

But that huge 100GB OSD consumption could be related to similar
reports linked here: https://tracker.ceph.com/issues/53729

Does your cluster have the pglog_hardlimit set?

# ceph osd dump | grep pglog
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit

Do you have PGs with really long pglogs?

# ceph pg dump | grep + | awk '{print $10, $11, $12}' | sort -n | tail



-- Dan

On Tue, Jan 25, 2022 at 12:44 AM Benjamin Staffin
 wrote:
>
> I have a cluster where 46 out of 120 OSDs have begun crash looping with the
> same stack trace (see pasted output below).  The cluster is in a very bad
> state with this many OSDs down, unsurprisingly.
>
> The day before this problem showed up, the k8s cluster was under extreme
> memory pressure and a lot of pods were OOM killed, including some of the
> Ceph OSDs, but after the memory pressure abated everything seemed to
> stabilize for about a day.
>
> Then we attempted to set a 4gb memory limit on the OSD pods, because they
> had been using upwards of 100gb of ram(!) per OSD after about a month of
> uptime, and this was a contributing factor in the cluster-wide OOM
> situation.  Everything seemed fine for a few minutes after Rook rolled out
> the memory limit, but then OSDs gradually started to crash, a few at a
> time, up to about 30 of them.  At this point I reverted the memory limit,
> but I don't think the OSDs were hitting their memory limits at all.  In an
> attempt to stabilize the cluster, we eventually the Rook operator and set
> the osd norebalance, nobackfill, noout, and norecover flags, but at this
> point there were 46 OSDs down and pools were hitting BackFillFull.
>
> This is a Rook-ceph deployment on bare-metal kubernetes cluster of 12
> nodes.  Each node has two 7TiB nvme disks dedicated to Ceph, and we have 5
> BlueStore OSDs per nvme disk (so around 1.4TiB per OSD, which ough to be
> fine with a 4gb memory target, right?).  The crash we're seeing looks very
> much like the one in this bug report: https://tracker.ceph.com/issues/52220
>
> I don't know how to proceed from here, so any advice would be very much
> appreciated.
>
> Ceph version: 16.2.6
> Rook version: 1.7.6
> Kubernetes version: 1.21.5
> Kernel version: 5.4.156-1.el7.elrepo.x86_64
> Distro: CentOS 7.9
>
> I've also attached the full log output from one of the crashing OSDs, in
> case that is of any use.
>
> begin stack trace paste
> debug -1> 2022-01-24T22:09:09.405+ 7ff8b4315700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc:
> In function 'void ECUtil::HashInfo::append(uint64_t, std::map ceph::buffer::v15_2_0::list>&)' thread 7ff8b4315700 time
> 2022-01-24T22:09:09.398961+
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc:
> 169: FAILED ceph_assert(to_append.size() == cumulative_shard_hashes.size())
>
>  ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x158) [0x564f88db554c]
>  2: ceph-osd(+0x56a766) [0x564f88db5766]
>  3: (ECUtil::HashInfo::append(unsigned long, std::map ceph::buffer::v15_2_0::list, std::less, std::allocator const, ceph::buffer::v15_2_0::list> > >&)+0x14b) [0x564f8910ca0b]
>  4: (encode_and_write(pg_t, hobject_t const&, ECUtil::stripe_info_t const&,
> std::shared_ptr&, std::set,
> std::allocator > const&, unsigned long, ceph::buffer::v15_2_0::list,
> unsigned int, std::shared_ptr, interval_map long, ceph::buffer::v15_2_0::list, bl_split_merge>&, std::map ceph::os::Transaction, std::less,
> std::allocator > >*,
> DoutPrefixProvider*)+0x6ec) [0x564f8929fa7c]
>  5: ceph-osd(+0xa5a611) [0x564f892a5611]
>  6: (ECTransaction::generate_transactions(ECTransaction::WritePlan&,
> std::shared_ptr&, pg_t, ECUtil::stripe_info_t
> const&, std::map ceph::buffer::v15_2_0::list, bl_split_merge>, std::less,
> std::allocator ceph::buffer::v15_2_0::list, bl_split_merge> > > > const&,
> std::vector >&,
> std::map ceph::buffer::v15_2_0::list, bl_split_merge>, std::less,
> std::allocator ceph::buffer::v15_2_0::list, bl_split_merge> > > >*, std::map ceph::os::Transaction, std::less,
> std::allocator > >*,
> std::set, std::allocator >*,
> std::set, std::allocator >*,
> DoutPrefixProvider*, ceph_release_t)+0x7db) [0x564f892a6dcb]
>  7: (ECBackend::try_reads_to_commit()+0x468) [0x564f8927ec28]
>  8: (ECBackend::check_ops()+0x24) [0x564f89281cd4]
>  9: (CallClientContexts::finish(std::pair ECBackend::read_result_t&>&)+0x1278) [0x564f8929d338]
>  10: (ECBackend::complete_read_op(ECBackend::ReadOp&,
> RecoveryMessages*)+0x8f) [0x564f8926dfaf]
>  11: (ECBackend::handle_sub_read_reply(pg_shard

[ceph-users] Re: CephFS keyrings for K8s

2022-01-25 Thread Frédéric Nass

Hello Michal,

With cephfs and a single filesystem shared across multiple k8s clusters, 
you should subvolumegroups to limit data exposure. You'll find an 
example of how to use subvolumegroups in the ceph-csi-cephfs helm chart 
[1]. Essentially you just have to set the subvolumeGroup to whatever you 
like and then create the associated cephfs keyring with the following caps:


ceph auth get-or-create client.cephfs.k8s-cluster-1.admin mon "allow r" 
osd "allow rw tag cephfs *=*" mds "allow rw 
path=/volumes/csi-k8s-cluster-1" mgr "allow rw" -o 
/etc/ceph/client.cephfs.k8s-cluster-1.admin.keyring


    caps: [mds] allow rw path=/volumes/csi-k8s-cluster-1
    caps: [mgr] allow rw
    caps: [mon] allow r
    caps: [osd] allow rw tag cephfs *=*

The subvolume group will be created by ceph-csi-cephfs if I remember 
correctly but you can also take care of this on the ceph side with 'ceph 
fs subvolumegroup create cephfs csi-k8s-cluster-1'.
PVs will then be created as subvolumes in this subvolumegroup. To list 
them, use 'ceph fs subvolume ls cephfs --group_name=csi-k8s-cluster-1'.


To achieve the same goal with RBD images, you should use rados 
namespaces. The current helm chart [2] seems to lack information about 
the radosNamespace setting but it works effectively considering you set 
it as below:


csiConfig:
  - clusterID: ""
    monitors:
  - ""
  - ""
    radosNamespace: "k8s-cluster-1"

ceph auth get-or-create client.rbd.name.admin mon "profile rbd" osd 
"allow rwx pool  object_prefix rbd_info, allow rwx pool 
 namespace k8s-cluster-1" mgr "profile rbd 
pool= namespace=k8s-cluster-1" -o 
/etc/ceph/client.rbd.name.admin.keyring


    caps: [mon] profile rbd
    caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
pool= namespace=k8s-cluster-1


ceph auth get-or-create client.rbd.name.user mon "profile rbd" osd 
"allow class-read object_prefix rbd_children, allow rwx 
pool= namespace=k8s-cluster-1" -o 
/etc/ceph/client.rbd.name.user.keyring


    caps: [mon] profile rbd
    caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
pool= namespace=k8s-cluster-1


Capabilities required for ceph-csi-cephfs and ceph-csi-rbd are described 
here [3].


This should get you started. Let me know if you see any clever/safer 
caps to use.


Regards,

Frédéric.

[1] 
https://github.com/ceph/ceph-csi/blob/devel/charts/ceph-csi-cephfs/values.yaml#L20
[2] 
https://github.com/ceph/ceph-csi/blob/devel/charts/ceph-csi-rbd/values.yaml#L20

[3] https://github.com/ceph/ceph-csi/blob/devel/docs/capabilities.md

--
Cordialement,

Frédéric Nass
Direction du Numérique
Sous-direction Infrastructures et Services

Tél : 03.72.74.11.35

Le 20/01/2022 à 09:26, Michal Strnad a écrit :

Hi,

We are using CephFS in our Kubernetes clusters and now we are trying 
to optimize permissions/caps in keyrings. Every guide which we found 
contains something like - Create the file system by specifying the 
desired settings for the metadata pool, data pool and admin keyring 
with access to the entire file system ... Is there better way where we 
don't need admin key, but restricted key only? What are you using in 
your environments?


Multiple file systems isn't option for us.

Thanks for your help

Regards,
Michal Strnad


___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS keyrings for K8s

2022-01-25 Thread Frédéric Nass


Le 25/01/2022 à 12:09, Frédéric Nass a écrit :


Hello Michal,

With cephfs and a single filesystem shared across multiple k8s 
clusters, you should subvolumegroups to limit data exposure. You'll 
find an example of how to use subvolumegroups in the ceph-csi-cephfs 
helm chart [1]. Essentially you just have to set the subvolumeGroup to 
whatever you like and then create the associated cephfs keyring with 
the following caps:


ceph auth get-or-create client.cephfs.k8s-cluster-1.admin mon "allow 
r" osd "allow rw tag cephfs *=*" mds "allow rw 
path=/volumes/csi-k8s-cluster-1" mgr "allow rw" -o 
/etc/ceph/client.cephfs.k8s-cluster-1.admin.keyring


    caps: [mds] allow rw path=/volumes/csi-k8s-cluster-1
    caps: [mgr] allow rw
    caps: [mon] allow r
    caps: [osd] allow rw tag cephfs *=*

The subvolume group will be created by ceph-csi-cephfs if I remember 
correctly but you can also take care of this on the ceph side with 
'ceph fs subvolumegroup create cephfs csi-k8s-cluster-1'.
PVs will then be created as subvolumes in this subvolumegroup. To list 
them, use 'ceph fs subvolume ls cephfs --group_name=csi-k8s-cluster-1'.


To achieve the same goal with RBD images, you should use rados 
namespaces. The current helm chart [2] seems to lack information about 
the radosNamespace setting but it works effectively considering you 
set it as below:


csiConfig:
  - clusterID: ""
    monitors:
  - ""
  - ""
    radosNamespace: "k8s-cluster-1"

ceph auth get-or-create client.rbd.name.admin mon "profile rbd" osd 
"allow rwx pool  object_prefix rbd_info, allow rwx pool 
 namespace k8s-cluster-1" mgr "profile rbd 
pool= namespace=k8s-cluster-1" -o 
/etc/ceph/client.rbd.name.admin.keyring


    caps: [mon] profile rbd
    caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
pool= namespace=k8s-cluster-1


Sorry, the admin caps should read:

    caps: [mgr] profile rbd pool= namespace=k8s-cluster-1
    caps: [mon] profile rbd
    caps: [osd] allow rwx pool  object_prefix rbd_info, 
allow rwx pool  namespace k8s-cluster-1


Regards,

Frédéric.



ceph auth get-or-create client.rbd.name.user mon "profile rbd" osd 
"allow class-read object_prefix rbd_children, allow rwx 
pool= namespace=k8s-cluster-1" -o 
/etc/ceph/client.rbd.name.user.keyring


    caps: [mon] profile rbd
    caps: [osd] allow class-read object_prefix rbd_children, allow rwx 
pool= namespace=k8s-cluster-1


Capabilities required for ceph-csi-cephfs and ceph-csi-rbd are 
described here [3].


This should get you started. Let me know if you see any clever/safer 
caps to use.


Regards,

Frédéric.

[1] 
https://github.com/ceph/ceph-csi/blob/devel/charts/ceph-csi-cephfs/values.yaml#L20
[2] 
https://github.com/ceph/ceph-csi/blob/devel/charts/ceph-csi-rbd/values.yaml#L20

[3] https://github.com/ceph/ceph-csi/blob/devel/docs/capabilities.md


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: switch restart facilitating cluster/client network.

2022-01-25 Thread Tyler Stachecki
I would still set noout on relevant parts of the cluster in case something
goes south and it does take longer than 2 minutes. Otherwise OSDs will
start outing themselves after 10 minutes or so by default and then you have
a lot of churn going on.

The monitors monitors will be fine unless you lose quorum, but even so
they'll just recover once the switch comes back. You just won't be able to
make changes to the cluster if you lose mon quorum, nor will the OSDs start
recovering etc. until that occurs.

Depending on which version of Ceph/libvirt/etc. you are running, I have
seen issues with older releases of the same where a handful of VMs get
indefinitely stuck with really high I/Owait afterwards and needed to be
manually rebooted on occasion when doing something like this.

As another user mentioned, the kernels softlockup handler kicks in after
120 seconds by default so you'll see lots of stacktraces in the VMs due to
processes blocked on I/O if the reboot and repeering doesn't all happen
within exactly two minutes.

If you can afford to shutdown all the VMs in the cluster, it might be for
the best as they'll be losing I/O...

On Tue, Jan 25, 2022, 4:27 AM Marc  wrote:

>
> If the switch needs an update and needs to be restarted (expected 2
> minutes). Can I just leave the cluster as it is, because ceph will handle
> this correctly? Or should I eg. put some vm's I am running in pause mode,
> or even stop them. What happens to the monitors? Can they handle this, or
> maybe better to switch from 3 to 1 one?
>
>
>
>
>
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Using s3website with ceph orch?

2022-01-25 Thread Manuel Holtgrewe
Thanks,

I had another review of the configuration and it appears that the
configuration *is* properly propagated to the daemon (also visible in
my second link).

I traced down my issues further and it looks like I have first tripped
over the following issue again...

https://tracker.ceph.com/issues/52826

After changing the deployment host spec I now apparently crash the
radosgw server through ceph-s3-website-ext.example.com as described
here.

https://tracker.ceph.com/issues/54012

I guess I'm doing something wrong and would be happy to learn what it
is and how I can do it right.

On Mon, Jan 24, 2022 at 6:15 PM Sebastian Wagner  wrote:
>
> can you do a config dump? I'm curious what is actually set by cephadm
>
> Am 24.01.22 um 17:35 schrieb Manuel Holtgrewe:
> > Dear all,
> >
> > I'm trying to configure the s3website with a site managed by
> > ceph-orch. I'm trying to follow [1] in spirit. I have configured two
> > ingress.rgw services "ingress.rgw.ext" and "ingress.rgw.ext-website"
> > and point to them via ceph-s3-ext.example.com and
> > ceph-s3-website-ext.example.com in DNS. I'm attempting to pass the
> > configuration from below.
> >
> > However, looking at the configuration of the daemons via the admin
> > socket tells me that the website-related configuration is not applied.
> >
> > Is this configuration supported? Would there be a workaround?
> >
> > Best wishes,
> > Manuel
> >
> > # cat  rgw.ext.yml
> > service_type: rgw
> > service_id: ext
> > service_name: rgw.ext
> > placement:
> >   hosts:
> > - osd-1
> > # count_per_host: 1
> > # label: rgw
> > spec:
> >   rgw_frontend_port: 8100
> >   rgw_realm: ext
> >   rgw_zone: ext-default-primary
> >   config:
> > rgw_dns_name: ceph-s3-ext.example.com
> > rgw_dns_s3website_name: ceph-s3-website-ext.example.com
> > rgw_enable_apis: s3, swift, swift_auth, admin
> > rgw_enable_static_website: true
> > rgw_expose_bucket: true
> > rgw_resolve_cname: true
> > # cat rgw.ext-website.yml
> > service_type: rgw
> > service_id: ext-website
> > service_name: rgw.ext-website
> > placement:
> >   hosts:
> > - osd-1
> > # count_per_host: 1
> > # label: rgw
> > spec:
> >   rgw_frontend_port: 8200
> >   rgw_realm: ext
> >   rgw_zone: ext-default-primary
> >   config:
> > rgw_dns_name: ceph-s3-ext.example.com
> > rgw_dns_s3website_name: ceph-s3-website-ext.example.com
> > rgw_enable_apis: s3website
> > rgw_enable_static_website: true
> > rgw_resolve_cname: true
> >
> >
> > [1] https://gist.github.com/robbat2/ec0a66eed28e5f0e1ef7018e9c77910c
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Moving all s3 objects from an ec pool to a replicated pool using storage classes.

2022-01-25 Thread Casey Bodley
On Tue, Jan 25, 2022 at 4:49 AM Frédéric Nass
 wrote:
>
> Hello,
>
> I've just heard about storage classes and imagined how we could use them
> to migrate all S3 objects within a placement pool from an ec pool to a
> replicated pool (or vice-versa) for data resiliency reasons, not to save
> space.
>
> It looks possible since ;
>
> 1. data pools are associated to storage classes in a placement pool
> 2. bucket lifecycle policies can take care of moving data from a storage
> class to another
> 3. we can set a user's default_storage_class to have all new objects
> written by this user reach the new storage class / data pool.
> 4. after all objects have been transitioned to the new storage class, we
> can delete the old storage class, rename the new storage class to
> STANDARD so that it's been used by default and unset any user's
> default_storage_class setting.

i don't think renaming the storage class will work the way you're
hoping. this storage class string is stored in each object and used to
locate its data, so renaming it could render the transitioned objects
unreadable

>
> Would that work?
>
> Anyone tried this with success yet?
>
> Best regards,
>
> Frédéric.
>
> --
> Cordialement,
>
> Frédéric Nass
> Direction du Numérique
> Sous-direction Infrastructures et Services
>
> Tél : 03.72.74.11.35
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] January Ceph Science Virtual User Group

2022-01-25 Thread Kevin Hrpcek

Hey all,

Sorry for the late notice. We will be having a Ceph science/research/big 
cluster call on Wednesday January 26th. If anyone wants to discuss 
something specific they can add it to the pad linked below. If you have 
questions or comments you can contact me.


This is an informal open call of community members mostly from 
hpc/htc/research environments where we discuss whatever is on our minds 
regarding ceph. Updates, outages, features, maintenance, etc...there is 
no set presenter but I do attempt to keep the conversation lively.


https://pad.ceph.com/p/Ceph_Science_User_Group_20220126 



We try to keep it to an hour or less.

Ceph calendar event details:
January 26, 2021
15:00 UTC
4pm Central European
9am Central US

Description: Main pad for discussions: 
https://pad.ceph.com/p/Ceph_Science_User_Group_Index

Meetings will be recorded and posted to the Ceph Youtube channel.
To join the meeting on a computer or mobile phone: 
https://bluejeans.com/908675367?src=calendarLink

To join from a Red Hat Deskphone or Softphone, dial: 84336.
Connecting directly from a room system?
    1.) Dial: 199.48.152.152 or bjn.vc
    2.) Enter Meeting ID: 908675367
Just want to dial in on your phone?
    1.) Dial one of the following numbers: 408-915-6466 (US)
    See all numbers: https://www.redhat.com/en/conference-numbers
    2.) Enter Meeting ID: 908675367
    3.) Press #
Want to test your video connection? https://bluejeans.com/111


Kevin

--
Kevin Hrpcek
NASA VIIRS Atmosphere SIPS/TROPICS
Space Science & Engineering Center
University of Wisconsin-Madison

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Monitoring ceph cluster

2022-01-25 Thread Michel Niyoyita
Hello team,

I would like to monitor my ceph cluster using one of the
monitoring tool, does someone has a help on that ?

Michel
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Fwd: Lots of OSDs crashlooping (DRAFT - feedback?)

2022-01-25 Thread Dan van der Ster
On Tue, Jan 25, 2022 at 4:07 PM Frank Schilder  wrote:
>
> Hi Dan,
>
> in several threads I have now seen statements like "Does your cluster have 
> the pglog_hardlimit set?". In this context, I would be grateful if you could 
> shed some light on the following:
>
> 1) How do I check that?
>
> There is no equivalent "osd get pglog_hardlimit".

I showed how to query for it:

# ceph osd dump | grep pglog
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit

>
> 2) What is the recommendation?

Since pacific it should be on by default, but I haven't had any user
confirm this fact.
(On our clusters we have enabled it manually when it was added to nautilus).

>
> In the ceph documentation, the only occurrence of the term pglog_hardlimit 
> are release notes for luminous and mimic, stating (mimic)
>
> > A flag called pglog_hardlimit has been introduced, which is off by default. 
> > Enabling this flag will limit the
> > length of the pg log. In order to enable that, the flag must be set by 
> > running ceph osd set pglog_hardlimit
> > after completely upgrading to 13.2.2. Once the cluster has this flag set, 
> > the length of the pg log will be
> > capped by a hard limit. Once set, this flag must not be unset anymore. In 
> > luminous, this feature was
> > introduced in 12.2.11. Users who are running 12.2.11, and want to continue 
> > to use this feature, should
> > upgrade to 13.2.5 or later.
>
> How do I know if I want to use this feature? I would need a bit of 
> information about pros and cons. Or should one have this enabled in any case? 
> Would be great if you could provide some insight here.

Normally a pg log with even 1 entries consumes just a couple
hundred MBs of memory. (See the osd_pglog mempool).
The pg log length can be queried like I showed earlier:

# ceph pg dump | grep + | awk '{print $10, $11, $12}' | sort -n | tail

(those are the LOG colums in the pg output).

In the past I've seen pg logs with millions of entries. Those are
surely a root cause for huge memory usage, especially at OSD boot
time.
Such pglogs would need to be trimmed, e.g. with the
ceph-objectstore-tool recipes that have been shared around on the
list.
The pglog_hardlimit is meant to limit the growth of the PG log.

On the other hand: it is clear that even with reasonably sized PG
logs, the memory can balloon for some unknown reason.
The devs have asked a couple times for dumps of those logs replaying
huge-memory causing pglogs.

In this case -- Benjamin's issue -- I'm trying to understand if this
is related to:
* a huge pg log -- would need trimming -- perhaps the pglog_hardlimit
isn't on by default as designed
* normal sized pg log, with some entries that are consuming huge
amounts of memory (due to a yet-unsolved bug).

Thanks,
Dan



>
> Thanks and best regards,
> =
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> 
> From: Dan van der Ster 
> Sent: 25 January 2022 11:56:38
> To: Benjamin Staffin
> Cc: Ceph Users; Matthew Wilder; Tara Fly
> Subject: [ceph-users] Re: Fwd: Lots of OSDs crashlooping (DRAFT - feedback?)
>
> Hi Benjamin,
>
> Apologies that I can't help for the bluestore issue.
>
> But that huge 100GB OSD consumption could be related to similar
> reports linked here: https://tracker.ceph.com/issues/53729
>
> Does your cluster have the pglog_hardlimit set?
>
> # ceph osd dump | grep pglog
> flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
>
> Do you have PGs with really long pglogs?
>
> # ceph pg dump | grep + | awk '{print $10, $11, $12}' | sort -n | tail
>
>
>
> -- Dan
>
> On Tue, Jan 25, 2022 at 12:44 AM Benjamin Staffin
>  wrote:
> >
> > I have a cluster where 46 out of 120 OSDs have begun crash looping with the
> > same stack trace (see pasted output below).  The cluster is in a very bad
> > state with this many OSDs down, unsurprisingly.
> >
> > The day before this problem showed up, the k8s cluster was under extreme
> > memory pressure and a lot of pods were OOM killed, including some of the
> > Ceph OSDs, but after the memory pressure abated everything seemed to
> > stabilize for about a day.
> >
> > Then we attempted to set a 4gb memory limit on the OSD pods, because they
> > had been using upwards of 100gb of ram(!) per OSD after about a month of
> > uptime, and this was a contributing factor in the cluster-wide OOM
> > situation.  Everything seemed fine for a few minutes after Rook rolled out
> > the memory limit, but then OSDs gradually started to crash, a few at a
> > time, up to about 30 of them.  At this point I reverted the memory limit,
> > but I don't think the OSDs were hitting their memory limits at all.  In an
> > attempt to stabilize the cluster, we eventually the Rook operator and set
> > the osd norebalance, nobackfill, noout, and norecover flags, but at this
> > point there were 46 OSDs down and pools were hitting BackFillFull.
> >
> > This is a Rook-ceph deployment on bare-metal kuber

[ceph-users] Re: Moving all s3 objects from an ec pool to a replicated pool using storage classes.

2022-01-25 Thread Frédéric Nass


Le 25/01/2022 à 14:48, Casey Bodley a écrit :

On Tue, Jan 25, 2022 at 4:49 AM Frédéric Nass
 wrote:

Hello,

I've just heard about storage classes and imagined how we could use them
to migrate all S3 objects within a placement pool from an ec pool to a
replicated pool (or vice-versa) for data resiliency reasons, not to save
space.

It looks possible since ;

1. data pools are associated to storage classes in a placement pool
2. bucket lifecycle policies can take care of moving data from a storage
class to another
3. we can set a user's default_storage_class to have all new objects
written by this user reach the new storage class / data pool.
4. after all objects have been transitioned to the new storage class, we
can delete the old storage class, rename the new storage class to
STANDARD so that it's been used by default and unset any user's
default_storage_class setting.

i don't think renaming the storage class will work the way you're
hoping. this storage class string is stored in each object and used to
locate its data, so renaming it could render the transitioned objects
unreadable


Hello Casey,

Thanks for pointing that out.

Do you believe this scenario would work if stopped at step 3.? (keeping 
default_storage_class set on users's profiles and not renaming the new 
storage class to STANDARD. Could we delete the STANDARD storage class 
btw since we would not use it anymore?).


If there is no way to define the default storage class of a placement 
pool without naming it STANDARD could we imaging transitioning all 
objects again by:


4. deleting the storage class named STANDARD
5. creating a new one named STANDARD (using a ceph pool of the same data 
placement scheme than the one used by the temporary storage class 
created above)
6. transitioning all objects again to the new STANDARD storage class. 
Then delete the temporary storage class.


?

Best regards,

Frédéric.




Would that work?

Anyone tried this with success yet?

Best regards,

Frédéric.

--
Cordialement,

Frédéric Nass
Direction du Numérique
Sous-direction Infrastructures et Services

Tél : 03.72.74.11.35

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Moving all s3 objects from an ec pool to a replicated pool using storage classes.

2022-01-25 Thread Casey Bodley
On Tue, Jan 25, 2022 at 11:59 AM Frédéric Nass
 wrote:
>
>
> Le 25/01/2022 à 14:48, Casey Bodley a écrit :
> > On Tue, Jan 25, 2022 at 4:49 AM Frédéric Nass
> >  wrote:
> >> Hello,
> >>
> >> I've just heard about storage classes and imagined how we could use them
> >> to migrate all S3 objects within a placement pool from an ec pool to a
> >> replicated pool (or vice-versa) for data resiliency reasons, not to save
> >> space.
> >>
> >> It looks possible since ;
> >>
> >> 1. data pools are associated to storage classes in a placement pool
> >> 2. bucket lifecycle policies can take care of moving data from a storage
> >> class to another
> >> 3. we can set a user's default_storage_class to have all new objects
> >> written by this user reach the new storage class / data pool.
> >> 4. after all objects have been transitioned to the new storage class, we
> >> can delete the old storage class, rename the new storage class to
> >> STANDARD so that it's been used by default and unset any user's
> >> default_storage_class setting.
> > i don't think renaming the storage class will work the way you're
> > hoping. this storage class string is stored in each object and used to
> > locate its data, so renaming it could render the transitioned objects
> > unreadable
>
> Hello Casey,
>
> Thanks for pointing that out.
>
> Do you believe this scenario would work if stopped at step 3.? (keeping
> default_storage_class set on users's profiles and not renaming the new
> storage class to STANDARD. Could we delete the STANDARD storage class
> btw since we would not use it anymore?).
>
> If there is no way to define the default storage class of a placement
> pool without naming it STANDARD could we imaging transitioning all
> objects again by:
>
> 4. deleting the storage class named STANDARD
> 5. creating a new one named STANDARD (using a ceph pool of the same data
> placement scheme than the one used by the temporary storage class
> created above)

instead of deleting/recreating STANDARD, you could probably just
modify it's data pool. only do this once you're certain that there are
no more objects in the old data pool. you might need to wait for
garbage collection to clean up the tail objects there too (or force it
with 'radosgw-admin gc process --include-all')

> 6. transitioning all objects again to the new STANDARD storage class.
> Then delete the temporary storage class.

i think this step 6 would run into the
https://tracker.ceph.com/issues/50974 that Konstantin shared - if the
two storage classes have the same pool name, the transition doesn't
actually take effect. you might consider leaving this 'temporary'
storage class around, but pointing the defaults back at STANDARD

>
> ?
>
> Best regards,
>
> Frédéric.
>
> >
> >> Would that work?
> >>
> >> Anyone tried this with success yet?
> >>
> >> Best regards,
> >>
> >> Frédéric.
> >>
> >> --
> >> Cordialement,
> >>
> >> Frédéric Nass
> >> Direction du Numérique
> >> Sous-direction Infrastructures et Services
> >>
> >> Tél : 03.72.74.11.35
> >>
> >> ___
> >> ceph-users mailing list -- ceph-users@ceph.io
> >> To unsubscribe send an email to ceph-users-le...@ceph.io
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Multipath and cephadm

2022-01-25 Thread Thomas Roth

Would like to know that as well.

I have the same setup - cephadm, Pacific, CentOS8, and a host with a number of 
HDDs which are all connect by 2 paths.
No way to use these without multipath

> ceph orch daemon add osd serverX:/dev/sdax

> Cannot update volume group ceph-51f8b9b0-2917-431d-8a6d-8ff90440641b with 
duplicate PV devices

(because sdax == sdce, etc.)

and with multipath, it fails with

> ceph orch daemon add osd serverX:/dev/mapper/mpathbq

> podman: stderr -->  IndexError: list index out of range


Quite strange that the 'future of storage' does not know how to handle 
multipath devices?

Regrads,
Thomas


On 12/23/21 18:40, Michal Strnad wrote:

Hi all.

We have problem using disks accessible via multipath. We are using cephadm for deployment, Pacific version for containers, CentOS 8 Stream on servers 
and following LVM configuration.


devices {
     multipath_component_detection = 1
}



We tried several methods.

1.) Direct approach.

cephadm shell 


/mapper/mpatha


Errors are attached in 1.output file.



2.  With the help of OSD specifications where they are mpathX devices used.

service_type: osd
service_id: osd-spec-serverX
placement:
   host_pattern: 'serverX'
spec:
   data_devices:
     paths:
   - /dev/mapper/mpathaj
   - /dev/mapper/mpathan
   - /dev/mapper/mpatham
   db_devices:
     paths:
   - /dev/sdc
encrypted: true

Errors are attached in 2.output file.


2.  With the help of OSD specifications where they are dm-X devices used.

service_type: osd
service_id: osd-spec-serverX
placement:
   host_pattern: 'serverX'
spec:
   data_devices:
     paths:
   - /dev/dm-1
   - /dev/dm-2
   - /dev/dm-3
   - /dev/dm-X
   db_devices:
     size: ':2TB'
encrypted: true

Errors are attached in 3.output file.

What is the right method for multipath deployments? I didn't find much on this 
topic.

Thank you

Michal

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



--

Thomas Roth
HPC Department

GSI Helmholtzzentrum für Schwerionenforschung GmbH
Planckstr. 1, 64291 Darmstadt, http://www.gsi.de/

Gesellschaft mit beschraenkter Haftung

Sitz der Gesellschaft / Registered Office:Darmstadt
Handelsregister   / Commercial Register:
Amtsgericht Darmstadt, HRB 1528

Geschaeftsfuehrung/ Managing Directors:
 Professor Dr. Paolo Giubellino, Ursula Weyrich, Jörg Blaurock

Vorsitzender des GSI-Aufsichtsrates /
  Chairman of the Supervisory Board:
   Staatssekretaer / State Secretary Dr. Georg Schütte
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Moving all s3 objects from an ec pool to a replicated pool using storage classes.

2022-01-25 Thread Frédéric Nass


Le 25/01/2022 à 18:28, Casey Bodley a écrit :

On Tue, Jan 25, 2022 at 11:59 AM Frédéric Nass
 wrote:


Le 25/01/2022 à 14:48, Casey Bodley a écrit :

On Tue, Jan 25, 2022 at 4:49 AM Frédéric Nass
 wrote:

Hello,

I've just heard about storage classes and imagined how we could use them
to migrate all S3 objects within a placement pool from an ec pool to a
replicated pool (or vice-versa) for data resiliency reasons, not to save
space.

It looks possible since ;

1. data pools are associated to storage classes in a placement pool
2. bucket lifecycle policies can take care of moving data from a storage
class to another
3. we can set a user's default_storage_class to have all new objects
written by this user reach the new storage class / data pool.
4. after all objects have been transitioned to the new storage class, we
can delete the old storage class, rename the new storage class to
STANDARD so that it's been used by default and unset any user's
default_storage_class setting.

i don't think renaming the storage class will work the way you're
hoping. this storage class string is stored in each object and used to
locate its data, so renaming it could render the transitioned objects
unreadable

Hello Casey,

Thanks for pointing that out.

Do you believe this scenario would work if stopped at step 3.? (keeping
default_storage_class set on users's profiles and not renaming the new
storage class to STANDARD. Could we delete the STANDARD storage class
btw since we would not use it anymore?).

If there is no way to define the default storage class of a placement
pool without naming it STANDARD could we imaging transitioning all
objects again by:

4. deleting the storage class named STANDARD
5. creating a new one named STANDARD (using a ceph pool of the same data
placement scheme than the one used by the temporary storage class
created above)

instead of deleting/recreating STANDARD, you could probably just
modify it's data pool. only do this once you're certain that there are
no more objects in the old data pool. you might need to wait for
garbage collection to clean up the tail objects there too (or force it
with 'radosgw-admin gc process --include-all')


Interesting scenario. So in the end we'd have objects named after both 
storage classes in the same ceph pool, the old ones named after the new 
storage class name and the new ones being written after the STANDARD 
storage class, right?





6. transitioning all objects again to the new STANDARD storage class.
Then delete the temporary storage class.

i think this step 6 would run into the
https://tracker.ceph.com/issues/50974 that Konstantin shared - if the
two storage classes have the same pool name, the transition doesn't
actually take effect. you might consider leaving this 'temporary'
storage class around, but pointing the defaults back at STANDARD


Well, in step 6., I'd thought about using another new pool for the 
recreated STANDARD storage class (to avoid the issue shared by 
Konstantin , thanks to him btw) and move all objects to this new pool 
again in a new global transition.


But, I understand you'd recommend avoiding deleting/recreating STANDARD 
and just modify the STANDARD data pool after GC execution, am I right?


Frédéric.




?

Best regards,

Frédéric.


Would that work?

Anyone tried this with success yet?

Best regards,

Frédéric.

--
Cordialement,

Frédéric Nass
Direction du Numérique
Sous-direction Infrastructures et Services

Tél : 03.72.74.11.35

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Disk Failure Predication cloud module?

2022-01-25 Thread Yaarit Hatuka
Hi Jake,

Many thanks for contributing the data.

Indeed, our data scientists use the data from Backblaze too.

Have you found strong correlations between device health metrics (such as
reallocated sector count, or any combination of attributes) and read/write
errors in /var/log/messages from what you experienced so far?

How long does it take from the moment you indicate such errors until you
decide to remove the disk?

Thanks,
Yaarit


On Fri, Jan 21, 2022 at 7:14 AM Jake Grimmett  wrote:

> Hi Yaarit,
>
> Thanks for confirming.
>
> telemetry is enabled on our clusters, so are contributing data on ~1270
> disks.
>
> Are you able to use data from backblaze?
>
> Deciding on when an OSD is starting to fail is a dark art, we are still
> hoping that the Disk Failure Predication module will take the guess work
> out of this.
>
> We currently use smartctl to look for disks with outliers in
> Reallocated_Sector_Ct and then look for read or write errors in
> /var/log/messages.
>
> best regards,
>
> Jake
>
>
> On 1/20/22 16:43, Yaarit Hatuka wrote:
> > Hi Jake,
> >
> > diskprediction_cloud module is no longer available in Pacific.
> > There are efforts to enhance the diskprediction module, using our
> > anonymized device telemetry data, which is aimed at building a dynamic,
> > large, diverse, free and open data set to help data scientists create
> > accurate failure prediction models.
> >
> > See more details:
> > https://ceph.io/en/users/telemetry/device-telemetry/
> > 
> > https://docs.ceph.com/en/latest/mgr/telemetry/
> > 
> >
> > Please join these efforts by opting-in to telemetry with:
> > `ceph telemetry on`
> > or with the dashboard's wizard.
> > If for some reason you can not or wish not to opt-it, please share the
> > reason with us.
> >
> > Thanks,
> > Yaarit
> >
> >
> > On Thu, Jan 20, 2022 at 6:39 AM Jake Grimmett  > > wrote:
> >
> > Dear All,
> >
> > Is the cloud option for the diskprediction module deprecated in
> Pacific?
> >
> > https://docs.ceph.com/en/pacific/mgr/diskprediction/
> > 
> >
> > If so, are prophetstor still contributing data to the local module,
> or
> > is this being updated by someone using data from Backblaze?
> >
> > Do people find this module useful?
> >
> > many thanks
> >
> > Jake
> >
> > --
> > Dr Jake Grimmett
> > Head Of Scientific Computing
> > MRC Laboratory of Molecular Biology
> > Francis Crick Avenue,
> > Cambridge CB2 0QH, UK.
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > 
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> > 
> >
>
>
> For help, read https://www.mrc-lmb.cam.ac.uk/scicomp/
> then contact unixad...@mrc-lmb.cam.ac.uk
> --
> Dr Jake Grimmett
> Head Of Scientific Computing
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue,
> Cambridge CB2 0QH, UK.
> Phone 01223 267019
> Mobile 0776 9886539
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Fwd: Lots of OSDs crashlooping (DRAFT - feedback?)

2022-01-25 Thread Benjamin Staffin
Thank you for your responses!

Since yesterday we found that several OSD pods still had memory limits set,
and in fact some of them (but far from all) were getting OOM killed, so we
have fully removed those limits again.  Unfortunately this hasn't helped
much and there are still 50ish OSDs down.  We're now experimenting on one
of the down OSDs with "ceph-bluestore-tool --command fsck --deep", followed
by "ceph-objectstore-tool --op fsck".  Those finished successfully but it
hasn't resulted in any fixes.

We've noticed messages like these in dmesg, but don't know what to make of
them yet.  Could these be indicative of a problem, or are they part of
normal operation?

begin paste
[Tue Jan 25 19:39:08 2022] libceph: wrong peer, want (1)
10.6.168.17:6825/-1988778847, got (1)0.0.0.0:6825/1335335775
[Tue Jan 25 19:39:08 2022] libceph: osd32 (1)10.6.168.17:6825 wrong peer at
address
[Tue Jan 25 19:39:09 2022] libceph: wrong peer, want (1)
10.6.168.17:6825/-1988778847, got (1)10.6.168.17:6825/1335335775
[Tue Jan 25 19:39:09 2022] libceph: osd32 (1)10.6.168.17:6825 wrong peer at
address
[Tue Jan 25 19:39:11 2022] libceph: wrong peer, want (1)
10.6.168.17:6825/-1988778847, got (1)10.6.168.17:6825/1335335775
[Tue Jan 25 19:39:11 2022] libceph: osd32 (1)10.6.168.17:6825 wrong peer at
address
[Tue Jan 25 19:39:15 2022] libceph: wrong peer, want (1)
10.6.168.17:6825/-1988778847, got (1)10.6.168.17:6825/1335335775
[Tue Jan 25 19:39:15 2022] libceph: osd32 (1)10.6.168.17:6825 wrong peer at
address
[Tue Jan 25 20:04:58 2022] libceph: wrong peer, want (1)
10.6.168.17:6809/779850463, got (1)0.0.0.0:6809/-398783580
[Tue Jan 25 20:04:58 2022] libceph: osd34 (1)10.6.168.17:6809 wrong peer at
address
[Tue Jan 25 20:04:59 2022] libceph: wrong peer, want (1)
10.6.168.17:6809/779850463, got (1)10.6.168.17:6809/-398783580
[Tue Jan 25 20:04:59 2022] libceph: osd34 (1)10.6.168.17:6809 wrong peer at
address
[Tue Jan 25 20:32:49 2022] libceph: wrong peer, want (1)
10.6.168.11:6833/-483515092, got (1)0.0.0.0:6833/1446518624
[Tue Jan 25 20:32:49 2022] libceph: osd74 (1)10.6.168.11:6833 wrong peer at
address
[Tue Jan 25 20:32:50 2022] libceph: wrong peer, want (1)
10.6.168.11:6833/-483515092, got (1)10.6.168.11:6833/1446518624
[Tue Jan 25 20:32:50 2022] libceph: osd74 (1)10.6.168.11:6833 wrong peer at
address
end paste

(also see inline replies below)

On Tue, Jan 25, 2022 at 10:51 AM Dan van der Ster 
wrote:

> On Tue, Jan 25, 2022 at 4:07 PM Frank Schilder  wrote:
> >
> > Hi Dan,
> >
> > in several threads I have now seen statements like "Does your cluster
> have the pglog_hardlimit set?". In this context, I would be grateful if you
> could shed some light on the following:
> >
> > 1) How do I check that?
> >
> > There is no equivalent "osd get pglog_hardlimit".
>
> I showed how to query for it:
>
> # ceph osd dump | grep pglog
> flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit


Yes, pglog_hardlimit is enabled:

$ ceph osd dump|grep pglog
flags
norebalance,sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit


> > 2) What is the recommendation?
>
> Since pacific it should be on by default, but I haven't had any user
> confirm this fact.
> (On our clusters we have enabled it manually when it was added to
> nautilus).
>
> > In the ceph documentation, the only occurrence of the term
> pglog_hardlimit are release notes for luminous and mimic, stating (mimic)
> >
> > > A flag called pglog_hardlimit has been introduced, which is off by
> default. Enabling this flag will limit the
> > > length of the pg log. In order to enable that, the flag must be set by
> running ceph osd set pglog_hardlimit
> > > after completely upgrading to 13.2.2. Once the cluster has this flag
> set, the length of the pg log will be
> > > capped by a hard limit. Once set, this flag must not be unset anymore.
> In luminous, this feature was
> > > introduced in 12.2.11. Users who are running 12.2.11, and want to
> continue to use this feature, should
> > > upgrade to 13.2.5 or later.
> >
> > How do I know if I want to use this feature? I would need a bit of
> information about pros and cons. Or should one have this enabled in any
> case? Would be great if you could provide some insight here.
>
> Normally a pg log with even 1 entries consumes just a couple
> hundred MBs of memory. (See the osd_pglog mempool).
> The pg log length can be queried like I showed earlier:
>
> # ceph pg dump | grep + | awk '{print $10, $11, $12}' | sort -n | tail
>
> (those are the LOG colums in the pg output).
>

It seems we don't have any above 2800 entries:

$ ceph pg dump | grep + | awk '{print $10, $11, $12}' | sort -n | tail
dumped all
2682 2682 undersized+degraded+peered
2683 2683 down
2685 2685 down
2704 2704 down
2704 2704 down
2710 2710 active+undersized+degraded
2714 2714 down
2726 2726 active+undersized+degraded
2735 2735 undersized+degraded+peered
2737 2737 down


> In the past I've seen pg logs with millions of entries. Those are
> sure

[ceph-users] Re: Disk Failure Predication cloud module?

2022-01-25 Thread Marc


Is there also (going to be) something available that works 'offline'? 


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 3 OSDs can not be started after a server reboot - rocksdb Corruption

2022-01-25 Thread Sebastian Mazza
Hey Igor,

thank you for your response!

>> 
>> Do you suggest to disable the HDD write-caching and / or the 
>> bluefs_buffered_io for productive clusters?
>> 
> Generally upstream recommendation is to disable disk write caching, there 
> were multiple complains it might negatively impact the performance in some 
> setups.
> 
> As for bluefs_buffered_io - please keep it on, the disablmement is known to 
> cause performance drop.

Thanks for the explanation. For the enabled disk write cache you only mentioned 
possible performance problem, but can the enabled disk write cache also lead to 
data corruption? Or make a problem more likely than with a disabled disk cache?

> 
>> 
>>> When rebooting a node  - did you perform it by regular OS command (reboot 
>>> or poweroff) or by a power switch?
>> I never did a hard reset or used the power switch. I used `init 6` for 
>> performing a reboot. Each server has redundant power supplies with one 
>> connected to a battery backup and the other to the grid. Therefore, I do 
>> think that none of the servers ever faced a non clean shutdown or reboot.
>> 
> So the original reboot which caused the failures was made in the same manner, 
> right?

Yes, Exactly.
And the OSD logs confirms that:

OSD 4:
2021-12-12T21:33:07.780+0100 7f464a944700 -1 received  signal: Terminated from 
/sbin/init  (PID: 1) UID: 0
2021-12-12T21:33:07.780+0100 7f464a944700 -1 osd.4 2606 *** Got signal 
Terminated ***
2021-12-12T21:33:07.780+0100 7f464a944700 -1 osd.4 2606 *** Immediate shutdown 
(osd_fast_shutdown=true) ***
2021-12-12T21:35:29.918+0100 7ffa5ce42f00  0 set uid:gid to 64045:64045 
(ceph:ceph)
2021-12-12T21:35:29.918+0100 7ffa5ce42f00  0 ceph version 16.2.6 
(1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process ceph-osd, 
pid 1608
:...
2021-12-12T21:35:32.509+0100 7ffa5ce42f00 -1 rocksdb: Corruption: Bad table 
magic number: expected 9863518390377041911, found 0 in db/002145.sst
2021-12-12T21:35:32.509+0100 7ffa5ce42f00 -1 
bluestore(/var/lib/ceph/osd/ceph-4) _open_db erroring opening db: 


OSD 7:
2021-12-12T21:20:11.141+0100 7f9714894700 -1 received  signal: Terminated from 
/sbin/init  (PID: 1) UID: 0
2021-12-12T21:20:11.141+0100 7f9714894700 -1 osd.7 2591 *** Got signal 
Terminated ***
2021-12-12T21:20:11.141+0100 7f9714894700 -1 osd.7 2591 *** Immediate shutdown 
(osd_fast_shutdown=true) ***
2021-12-12T21:21:41.881+0100 7f63c6557f00  0 set uid:gid to 64045:64045 
(ceph:ceph)
2021-12-12T21:21:41.881+0100 7f63c6557f00  0 ceph version 16.2.6 
(1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process ceph-osd, 
pid 1937
:...
2021-12-12T21:21:44.557+0100 7f63c6557f00 -1 rocksdb: Corruption: Bad table 
magic number: expected 9863518390377041911, found 0 in db/002182.sst
2021-12-12T21:21:44.557+0100 7f63c6557f00 -1 
bluestore(/var/lib/ceph/osd/ceph-7) _open_db erroring opening db: 


OSD 8:
2021-12-12T21:20:11.141+0100 7fd1ccf01700 -1 received  signal: Terminated from 
/sbin/init  (PID: 1) UID: 0
2021-12-12T21:20:11.141+0100 7fd1ccf01700 -1 osd.8 2591 *** Got signal 
Terminated ***
2021-12-12T21:20:11.141+0100 7fd1ccf01700 -1 osd.8 2591 *** Immediate shutdown 
(osd_fast_shutdown=true) ***
2021-12-12T21:21:41.881+0100 7f6d18d2bf00  0 set uid:gid to 64045:64045 
(ceph:ceph)
2021-12-12T21:21:41.881+0100 7f6d18d2bf00  0 ceph version 16.2.6 
(1a6b9a05546f335eeeddb460fdc89caadf80ac7a) pacific (stable), process ceph-osd, 
pid 1938
:...
2021-12-12T21:21:44.577+0100 7f6d18d2bf00 -1 rocksdb: Corruption: Bad table 
magic number: expected 9863518390377041911, found 0 in db/002182.sst
2021-12-12T21:21:44.577+0100 7f6d18d2bf00 -1 
bluestore(/var/lib/ceph/osd/ceph-8) _open_db erroring opening db: 



Best regards,
Sebastian


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] problems with snap-schedule on 16.2.7

2022-01-25 Thread Kyriazis, George
Hello Ceph users,

I have a problem with scheduled snapshots on ceph 16.2.7 (in a Proxmox install).

While trying to understand how snap schedules work, I created more schedules 
than I needed to:

root@vis-mgmt:~# ceph fs  snap-schedule list /backups/nassie/NAS
/backups/nassie/NAS 1h 24h7d8w12m
/backups/nassie/NAS 7d 24h7d8w12m
/backups/nassie/NAS 4w 24h7d8w12m
/backups/nassie/NAS 6h 24h7d8w12m
root@vis-mgmt:~# 

I then went ahead and deleted the ones that I didn’t need:

root@vis-mgmt:~# ceph fs snap-schedule remove /backups/nassie/NAS 1h
Schedule removed for path /backups/nassie/NAS
root@vis-mgmt:~# ceph fs snap-schedule remove /backups/nassie/NAS 7d
Schedule removed for path /backups/nassie/NAS
root@vis-mgmt:~# ceph fs snap-schedule remove /backups/nassie/NAS 4w
Schedule removed for path /backups/nassie/NAS
root@vis-mgmt:~# ceph fs  snap-schedule list /backups/nassie/NAS
/backups/nassie/NAS 6h 24h7d8w12m
root@vis-mgmt:~# 

No problems there.  However, if I restart the ceph manager, the (old) deleted 
snapshot schedules come back.  Not only that, but after the mgr restart, it 
seems like the snap schedule status is not really telling the truth:

root@vis-mgmt:/ceph/backups/nassie/NAS/.snap# ceph fs snap-schedule status 
/backups/nassie/NAS
{"fs": "cephfs", "subvol": null, "path": "/backups/nassie/NAS", "rel_path": 
"/backups/nassie/NAS", "schedule": "6h", "retention": {"h": 24, "d": 7, "w": 8, 
"m": 12}, "start": "2022-01-14T00:00:00", "created": "2022-01-14T22:18:38", 
"first": null, "last": null, "last_pruned": null, "created_count": 0, 
"pruned_count": 0, "active": true}
root@vis-mgmt:/ceph/backups/nassie/NAS/.snap# ls
scheduled-2021-10-17-18_00_00  scheduled-2022-01-19-12_00_00  
scheduled-2022-01-22-12_00_00
scheduled-2021-10-24-18_00_00  scheduled-2022-01-19-18_00_00  
scheduled-2022-01-22-18_00_00
scheduled-2021-10-31-18_00_00  scheduled-2022-01-20-00_00_00  
scheduled-2022-01-23-00_00_00
scheduled-2021-11-07-18_00_00  scheduled-2022-01-20-06_00_00  
scheduled-2022-01-23-06_00_00
scheduled-2021-11-08-18_00_00  scheduled-2022-01-20-12_00_00  
scheduled-2022-01-23-12_00_00
scheduled-2021-11-09-00_00_00  scheduled-2022-01-20-18_00_00  
scheduled-2022-01-23-18_00_00
scheduled-2022-01-15-18_00_00  scheduled-2022-01-21-00_00_00  
scheduled-2022-01-24-00_00_00
scheduled-2022-01-16-18_00_00  scheduled-2022-01-21-06_00_00  
scheduled-2022-01-24-06_00_00
scheduled-2022-01-17-18_00_00  scheduled-2022-01-21-12_00_00  
scheduled-2022-01-24-12_00_00
scheduled-2022-01-18-18_00_00  scheduled-2022-01-21-18_00_00  
scheduled-2022-01-24-18_00_00
scheduled-2022-01-19-00_00_00  scheduled-2022-01-22-00_00_00
scheduled-2022-01-19-06_00_00  scheduled-2022-01-22-06_00_00
root@vis-mgmt:/ceph/backups/nassie/NAS/.snap# 

Note that today (1/26) is after the last snapshot (1/24), but “ceph fs 
snap-schedule status” reports that no snapshots were performed (“first” and 
“last” are null), which is obviously not true.  Moreover, no more snapshots are 
being performed after the mgr restart.

Any thoughts of what’s going on and how to fix it?

Thank you!

George

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Monitoring ceph cluster

2022-01-25 Thread Michel Niyoyita
Thank you for your email Szabo, these can be helpful , can you provide
links then I start to work on it.

Michel.

On Tue, 25 Jan 2022, 18:51 Szabo, Istvan (Agoda), 
wrote:

> Which monitoring tool? Like prometheus or nagios style thing?
> We use sensu for keepalive and ceph health reporting + prometheus with
> grafana for metrics collection.
>
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
>
> On 2022. Jan 25., at 22:38, Michel Niyoyita  wrote:
>
> Email received from the internet. If in doubt, don't click any link nor
> open any attachment !
> 
>
> Hello team,
>
> I would like to monitor my ceph cluster using one of the
> monitoring tool, does someone has a help on that ?
>
> Michel
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io