[ceph-users] MDS crashes after evicting client session

2022-09-22 Thread E Taka
Ceph 17.2.3 (dockerized in Ubuntu 20.04)

The subject says it. The MDS process always crashes after evicting. ceph -w
shows:

2022-09-22T13:26:23.305527+0200 mds.ksz-cephfs2.ceph00.kqjdwe [INF]
Evicting (and blocklisting) client session 5181680 (
10.149.12.21:0/3369570791)
2022-09-22T13:26:35.729317+0200 mon.ceph00 [INF] daemon
mds.ksz-cephfs2.ceph03.vsyrbk restarted
2022-09-22T13:26:36.039678+0200 mon.ceph00 [INF] daemon
mds.ksz-cephfs2.ceph01.xybiqv restarted
2022-09-22T13:29:21.000392+0200 mds.ksz-cephfs2.ceph04.ekmqio [INF]
Evicting (and blocklisting) client session 5249349 (
10.149.12.22:0/2459302619)
2022-09-22T13:29:32.069656+0200 mon.ceph00 [INF] daemon
mds.ksz-cephfs2.ceph01.xybiqv restarted
2022-09-22T13:30:00.000101+0200 mon.ceph00 [INF] overall HEALTH_OK
2022-09-22T13:30:20.710271+0200 mon.ceph00 [WRN] Health check failed: 1
daemons have recently crashed (RECENT_CRASH)

The crash info of the crashed MDS is:
# ceph crash info
2022-09-22T11:26:24.013274Z_b005f3fc-7704-4cfc-96c5-f2a9c993f166
{
   "assert_condition": "!mds->is_any_replay()",
   "assert_file":
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.cc",

   "assert_func": "void MDLog::_submit_entry(LogEvent*,
MDSLogContextBase*)",
   "assert_line": 283,
   "assert_msg":
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.cc:
In function 'void MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)'
thread 7f76fa8f6700 time
2022-09-22T11:26:23.992050+\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.cc:
283: FAILED ceph_assert(!mds->is_any_replay())\n",
   "assert_thread_name": "ms_dispatch",
   "backtrace": [
   "/lib64/libpthread.so.0(+0x12ce0) [0x7f770231bce0]",
   "gsignal()",
   "abort()",
   "(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1b0) [0x7f770333bcd2]",
   "/usr/lib64/ceph/libceph-common.so.2(+0x283e95) [0x7f770333be95]",
   "(MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x3f)
[0x55991905efdf]",
   "(Server::journal_close_session(Session*, int, Context*)+0x78c)
[0x559918d7d63c]",
   "(Server::kill_session(Session*, Context*)+0x212) [0x559918d7dd92]",
   "(Server::apply_blocklist()+0x10d) [0x559918d7e04d]",
   "(MDSRank::apply_blocklist(std::set, std::allocator > const&, unsigned
int)+0x34) [0x559918d39d74]",
   "(MDSRankDispatcher::handle_osd_map()+0xf6) [0x559918d3a0b6]",
   "(MDSDaemon::handle_core_message(boost::intrusive_ptr
const&)+0x39b) [0x559918d2330b]",
   "(MDSDaemon::ms_dispatch2(boost::intrusive_ptr
const&)+0xc3) [0x559918d23cc3]",
   "(DispatchQueue::entry()+0x14fa) [0x7f77035c240a]",
   "(DispatchQueue::DispatchThread::entry()+0x11) [0x7f7703679481]",
   "/lib64/libpthread.so.0(+0x81ca) [0x7f77023111ca]",
   "clone()"
   ],
   "ceph_version": "17.2.3",
   "crash_id":
"2022-09-22T11:26:24.013274Z_b005f3fc-7704-4cfc-96c5-f2a9c993f166",
   "entity_name": "mds.ksz-cephfs2.ceph03.vsyrbk",
   "os_id": "centos",
   "os_name": "CentOS Stream",
   "os_version": "8",
   "os_version_id": "8",
   "process_name": "ceph-mds",
   "stack_sig":
"b75e46941b5f6b7c05a037f9af5d42bb19d82ab7fc6a3c168533fc31a42b4de8",
   "timestamp": "2022-09-22T11:26:24.013274Z",
   "utsname_hostname": "ceph03",
   "utsname_machine": "x86_64",
   "utsname_release": "5.4.0-125-generic",
   "utsname_sysname": "Linux",
   "utsname_version": "#141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022"
}

(Don't be confused by the time information, "ceph -w" is UTC+2, "crash
info" is UTC)

Should I report this a bug or did I miss something which caused the error?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS crashes after evicting client session

2022-09-22 Thread Dhairya Parmar
- What operation was being carried out which led to client eviction?
- Can you share MDS side logs when that event was being carried out?

On Thu, Sep 22, 2022 at 5:12 PM E Taka <0eta...@gmail.com> wrote:

> Ceph 17.2.3 (dockerized in Ubuntu 20.04)
>
> The subject says it. The MDS process always crashes after evicting. ceph -w
> shows:
>
> 2022-09-22T13:26:23.305527+0200 mds.ksz-cephfs2.ceph00.kqjdwe [INF]
> Evicting (and blocklisting) client session 5181680 (
> 10.149.12.21:0/3369570791)
> 2022-09-22T13:26:35.729317+0200 mon.ceph00 [INF] daemon
> mds.ksz-cephfs2.ceph03.vsyrbk restarted
> 2022-09-22T13:26:36.039678+0200 mon.ceph00 [INF] daemon
> mds.ksz-cephfs2.ceph01.xybiqv restarted
> 2022-09-22T13:29:21.000392+0200 mds.ksz-cephfs2.ceph04.ekmqio [INF]
> Evicting (and blocklisting) client session 5249349 (
> 10.149.12.22:0/2459302619)
> 2022-09-22T13:29:32.069656+0200 mon.ceph00 [INF] daemon
> mds.ksz-cephfs2.ceph01.xybiqv restarted
> 2022-09-22T13:30:00.000101+0200 mon.ceph00 [INF] overall HEALTH_OK
> 2022-09-22T13:30:20.710271+0200 mon.ceph00 [WRN] Health check failed: 1
> daemons have recently crashed (RECENT_CRASH)
>
> The crash info of the crashed MDS is:
> # ceph crash info
> 2022-09-22T11:26:24.013274Z_b005f3fc-7704-4cfc-96c5-f2a9c993f166
> {
>"assert_condition": "!mds->is_any_replay()",
>"assert_file":
>
> "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.cc",
>
>"assert_func": "void MDLog::_submit_entry(LogEvent*,
> MDSLogContextBase*)",
>"assert_line": 283,
>"assert_msg":
>
> "/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.cc:
> In function 'void MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)'
> thread 7f76fa8f6700 time
>
> 2022-09-22T11:26:23.992050+\n/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/mds/MDLog.cc:
> 283: FAILED ceph_assert(!mds->is_any_replay())\n",
>"assert_thread_name": "ms_dispatch",
>"backtrace": [
>"/lib64/libpthread.so.0(+0x12ce0) [0x7f770231bce0]",
>"gsignal()",
>"abort()",
>"(ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x1b0) [0x7f770333bcd2]",
>"/usr/lib64/ceph/libceph-common.so.2(+0x283e95) [0x7f770333be95]",
>"(MDLog::_submit_entry(LogEvent*, MDSLogContextBase*)+0x3f)
> [0x55991905efdf]",
>"(Server::journal_close_session(Session*, int, Context*)+0x78c)
> [0x559918d7d63c]",
>"(Server::kill_session(Session*, Context*)+0x212) [0x559918d7dd92]",
>"(Server::apply_blocklist()+0x10d) [0x559918d7e04d]",
>"(MDSRank::apply_blocklist(std::set std::less, std::allocator > const&, unsigned
> int)+0x34) [0x559918d39d74]",
>"(MDSRankDispatcher::handle_osd_map()+0xf6) [0x559918d3a0b6]",
>"(MDSDaemon::handle_core_message(boost::intrusive_ptr
> const&)+0x39b) [0x559918d2330b]",
>"(MDSDaemon::ms_dispatch2(boost::intrusive_ptr
> const&)+0xc3) [0x559918d23cc3]",
>"(DispatchQueue::entry()+0x14fa) [0x7f77035c240a]",
>"(DispatchQueue::DispatchThread::entry()+0x11) [0x7f7703679481]",
>"/lib64/libpthread.so.0(+0x81ca) [0x7f77023111ca]",
>"clone()"
>],
>"ceph_version": "17.2.3",
>"crash_id":
> "2022-09-22T11:26:24.013274Z_b005f3fc-7704-4cfc-96c5-f2a9c993f166",
>"entity_name": "mds.ksz-cephfs2.ceph03.vsyrbk",
>"os_id": "centos",
>"os_name": "CentOS Stream",
>"os_version": "8",
>"os_version_id": "8",
>"process_name": "ceph-mds",
>"stack_sig":
> "b75e46941b5f6b7c05a037f9af5d42bb19d82ab7fc6a3c168533fc31a42b4de8",
>"timestamp": "2022-09-22T11:26:24.013274Z",
>"utsname_hostname": "ceph03",
>"utsname_machine": "x86_64",
>"utsname_release": "5.4.0-125-generic",
>"utsname_sysname": "Linux",
>"utsname_version": "#141-Ubuntu SMP Wed Aug 10 13:42:03 UTC 2022"
> }
>
> (Don't be confused by the time information, "ceph -w" is UTC+2, "crash
> info" is UTC)
>
> Should I report this a bug or did I miss something which caused the error?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
>

-- 
*Dhairya Parmar*

He/Him/His

Associate Software Engineer, CephFS

Red Hat Inc. 

dpar...@redhat.com

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Question about recovery priority

2022-09-22 Thread Fulvio Galeazzi

Hallo all,
 taking advantage of the redundancy of my EC pool, I destroyed a 
couple of servers in order to reinstall them with a new operating system.
  I am on Nautilus (but will evolve soon to Pacific), and today I am 
not in "emergency mode": this is just for my education.  :-)


"ceph pg dump" shows a couple pg's with 3 missing chunks, some other 
with 2, several with 1 missing chunk: that's fine and expected.
Having looked at it for a while, I think I understand the recovery queue 
is unique: there is no internal higher priority for 3-missing-chunks PGs 
wrt 1-missing-chunk PGs, right?
I tried to issue "ceph pg force-recovery" on the few worst-degraded PGs 
but, apparently, numbers of 3-missing 2-missing and 1-missing are going 
down at the same relative speed.

   Is this expected? Can I do something to "guide" the process?

Thanks for your hints

Fulvio

--
Fulvio Galeazzi
GARR-CSD Department
skype: fgaleazzi70
tel.: +39-334-6533-250
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Question about recovery priority

2022-09-22 Thread Josh Baergen
Hi Fulvio,

https://docs.ceph.com/en/quincy/dev/osd_internals/backfill_reservation/
describes the prioritization and reservation mechanism used for
recovery and backfill. AIUI, unless a PG is below min_size, all
backfills for a given pool will be at the same priority.
force-recovery will modify the PG priority but doing so can have a
very delayed effect because a given backfill can be waiting behind a
bunch of other backfills that have acquired partial reservations,
which in turn are waiting behind other backfills that have partial
reservations, etc. etc. Once one is doing degraded backfill, they've
lost a lot of control over their system.

Rather than ripping out hosts like you did here, operators that want
to retain control will drain hosts without degradation.
https://github.com/digitalocean/pgremapper is one tool that can help
with this, though depending on the size of the system one can
sometimes simply downweight the host and then wait.

Josh

On Thu, Sep 22, 2022 at 6:35 AM Fulvio Galeazzi  wrote:
>
> Hallo all,
>   taking advantage of the redundancy of my EC pool, I destroyed a
> couple of servers in order to reinstall them with a new operating system.
>I am on Nautilus (but will evolve soon to Pacific), and today I am
> not in "emergency mode": this is just for my education.  :-)
>
> "ceph pg dump" shows a couple pg's with 3 missing chunks, some other
> with 2, several with 1 missing chunk: that's fine and expected.
> Having looked at it for a while, I think I understand the recovery queue
> is unique: there is no internal higher priority for 3-missing-chunks PGs
> wrt 1-missing-chunk PGs, right?
> I tried to issue "ceph pg force-recovery" on the few worst-degraded PGs
> but, apparently, numbers of 3-missing 2-missing and 1-missing are going
> down at the same relative speed.
> Is this expected? Can I do something to "guide" the process?
>
> Thanks for your hints
>
> Fulvio
>
> --
> Fulvio Galeazzi
> GARR-CSD Department
> skype: fgaleazzi70
> tel.: +39-334-6533-250
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Slow OSD startup and slow ops

2022-09-22 Thread Stefan Kooman

Hi,

On 9/21/22 18:00, Gauvain Pocentek wrote:

Hello all,

We are running several Ceph clusters and are facing an issue on one of
them, we would appreciate some input on the problems we're seeing.

We run Ceph in containers on Centos Stream 8, and we deploy using
ceph-ansible. While upgrading ceph from 16.2.7 to 16.2.10, we noticed that
OSDs were taking a very long time to restart on one of the clusters. (Other
clusters were not impacted at all.) 


Are the other clusters of similar size?


The OSD startup was so slow sometimes

that we ended up having slow ops, with 1 or 2 pg stuck in a peering state.
We've interrupted the upgrade and the cluster runs fine now, although we
have seen 1 OSD flapping recently, having trouble coming back to life.

We've checked a lot of things and read a lot of mails from this list, and
here are some info:

* this cluster has RBD pools for OpenStack and RGW pools; everything is
replicated x 3, except the RGW data pool which is EC 4+2
* we haven't found any hardware related issues; we run fully on SSDs and
they are all in good shape, no network issue, RAM and CPU are available on
all OSD hosts
* bluestore with an LVM collocated setup
* we have seen the slow restart with almost all the OSDs we've upgraded
(100 out of 350)
* on restart the ceph-osd process runs at 100% CPU but we haven't seen
anything weird on the host


Are the containers restricted to use a certain amount of CPU? Do the 
OSDs, after ~ 10-20 seconds increase their CPU usage to 200% (if so this 
is proably because of rocksdb option max_background_compactions = 2).



* no DB spillover
* we have other clusters with the same hardware, and we don't see problems
there

The only thing that we found that looks suspicious is the number of op logs
for the PGs of the RGW index pool. `osd_max_pg_log_entries` is set to 10k
but `ceph pg dump` show PGs with more than 100k logs (the largest one has >
400k logs).

Could this be the reason for the slow startup of OSDs? If so is there a way
to trim these logs without too much impact on the cluster?


Not sure. We have ~ 2K logs per PG.



Let me know if additional info or logs are needed.


Do you have a log of slow ops and osd logs?

Do you have any non-standard configuration for the daemons? I.e. ceph 
daemon osd.$id config diff


We are running a Ceph Octopus (15.2.16) cluster with similar 
configuration. We have *a lot* of slow ops when starting OSDs. Also 
during peering. When the OSDs start they consume 100% CPU for up to ~ 10 
seconds, and after that consume 200% for a minute or more. During that 
time the OSDs perform a compaction. You should be able to find this in 
the OSD logs if it's the same in your case. After some the OSDs are done 
initializing and starting the boot process. As soon as they boot up and 
start peering the slow ops start to kick in. Lot's of "transitioning to 
Primary" and "transitioning to Stray" logging. Some time later the OSD 
becomes "active". While the OSD is busy with peering it's also busy 
compacting. As I also see RocksDB compaction logging. So it might be due 
to RocksDB compactions impacting OSD performance while it's already busy 
becoming primary (and or secondary / tertiary) for it's PGs.


We had norecover, nobackfill, norebalance active when booting the OSDs.

So, it might just take a long time to do RocksDB compaction. In this 
case it might be better to do all needed RocksDB compactions, and then 
start booting. So, what might help is to set "ceph osd set noup". This 
prevents the OSD from becoming active, then wait for the RocksDB 
compactions, and after that unset the flag.


If you try this, please let me know how it goes.

Gr. Stefan





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Telegraf plugin reset

2022-09-22 Thread Nikhil Mitra (nikmitra)
Greetings,

We are trying to use the telegraf module to send metrics to InfluxDB and we 
keep facing the below error. Any help will be appreciated, thank you.

# ceph telegraf config-show
Error EIO: Module 'telegraf' has experienced an error and cannot handle 
commands: invalid literal for int() with base 10: 'https'

# ceph config dump | grep -i telegraf
  mgr   advanced mgr/telegraf/address   
tcp://test.xyz.com:https *

ceph version 14.2.22-110.el7cp

--
Regards,
Nikhil Mitra
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Telegraf plugin reset

2022-09-22 Thread Curt
Hello,

If I had to guess : indicates a port number like :443, so it's expecting an
int and you are passing a string.  Try changing https to 443

On Thu, Sep 22, 2022 at 8:24 PM Nikhil Mitra (nikmitra) 
wrote:

> Greetings,
>
> We are trying to use the telegraf module to send metrics to InfluxDB and
> we keep facing the below error. Any help will be appreciated, thank you.
>
> # ceph telegraf config-show
> Error EIO: Module 'telegraf' has experienced an error and cannot handle
> commands: invalid literal for int() with base 10: 'https'
>
> # ceph config dump | grep -i telegraf
>   mgr   advanced mgr/telegraf/address
>  tcp://test.xyz.com:https *
>
> ceph version 14.2.22-110.el7cp
>
> --
> Regards,
> Nikhil Mitra
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Telegraf plugin reset

2022-09-22 Thread Nikhil Mitra (nikmitra)
Tried that but still fails.

# ceph telegraf config-set address test.xyz.com:443
Error EIO: Module 'telegraf' has experienced an error and cannot handle 
commands: invalid literal for int() with base 10: 'https

--
Regards,
Nikhil Mitra

From: Curt 
Date: Thursday, September 22, 2022 at 12:34 PM
To: Nikhil Mitra (nikmitra) 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] Telegraf plugin reset
Hello,

If I had to guess : indicates a port number like :443, so it's expecting an int 
and you are passing a string.  Try changing https to 443

On Thu, Sep 22, 2022 at 8:24 PM Nikhil Mitra (nikmitra) 
mailto:nikmi...@cisco.com>> wrote:
Greetings,

We are trying to use the telegraf module to send metrics to InfluxDB and we 
keep facing the below error. Any help will be appreciated, thank you.

# ceph telegraf config-show
Error EIO: Module 'telegraf' has experienced an error and cannot handle 
commands: invalid literal for int() with base 10: 'https'

# ceph config dump | grep -i telegraf
  mgr   advanced mgr/telegraf/address   
tcp://test.xyz.com:https *

ceph version 14.2.22-110.el7cp

--
Regards,
Nikhil Mitra
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to 
ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 17.2.4 RC available

2022-09-22 Thread Neha Ojha
On Thu, Sep 22, 2022 at 12:55 PM Yuri Weinstein  wrote:
>
> We are publishing a release candidate this time for users to try
> for testing only.
>
> Please note this RC had only limited testing.  Full testing is being done now.

It might be worth sharing that the Gibba cluster has been upgraded to
17.2.4 RC successfully.

Thanks,
Neha

>
> The branch name:
>
> https://github.com/ceph/ceph/tree/quincy-release
>
> https://shaman.ceph.com/builds/ceph/quincy-release/7f52e260191d7656bf7a362048705c3e36370dad/
>
> To install dev packages see
> https://docs.ceph.com/en/quincy/install/get-packages/#ceph-development-packages
>
> The container build see - https://quay.ceph.io/ceph-ci/ceph:quincy-release
>
> Release notes: https://github.com/ceph/ceph/pull/48072
>
> ***Don’t use this RC on production clusters!***
>
> The goal is to give users time to test and give feedback on RC
> releases while our upstream long-running cluster also runs the same RC
> release during that time (period of one week).
>
> Please respond to this email to provide any feedback on issues found
> in this release.
>
> Thx
> YuriW
>
> ___
> Dev mailing list -- d...@ceph.io
> To unsubscribe send an email to dev-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Telegraf plugin reset

2022-09-22 Thread Stefan Kooman

On 9/22/22 18:23, Nikhil Mitra (nikmitra) wrote:

Greetings,

We are trying to use the telegraf module to send metrics to InfluxDB and we 
keep facing the below error. Any help will be appreciated, thank you.

# ceph telegraf config-show
Error EIO: Module 'telegraf' has experienced an error and cannot handle 
commands: invalid literal for int() with base 10: 'https'

# ceph config dump | grep -i telegraf
   mgr   advanced mgr/telegraf/address   
tcp://test.xyz.com:https *


We have never used this module to send data to influxdb over https 
directly. Not sure if it can do that (I don't think so). What you can do 
instead (and that's what we do) is to


- install telegraf on the Ceph manager node
- configure a "socket listener" in telegraf:

[[inputs.socket_listener]]
 service_address = "unixgram:///etc/telegraf/telegraf.sock"

And configure the https influxdb endpoint in the same configuration file.

- make sure the user ceph is able to write to that socket: chown ceph 
/etc/telegraf/telegraf.sock


- configure the ceph telegraf module to use that socket (ceph config set)

mgr  advanced  mgr/telegraf/address 
  unixgram:///etc/telegraf/telegraf.sock  *
mgr  advanced  mgr/telegraf/interval 
  5   *



Restart the mgr.

Gr. Stefan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Freak issue every few weeks

2022-09-22 Thread J-P Methot

Hi,

We've been running into a mysterious issue on Ceph 16.2.7. Every few 
weeks or so (can be from 2 weeks to a month and a half), we get 
input/output errors on a random OSD. Here's the logs :


2022-09-22T15:54:11.600Z    syslog    debug -6> 
2022-09-22T15:41:05.678+ 7fec2ebaa080 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _sync_write sync_file_range error: (5) 
Input/output error
2022-09-22T15:54:11.600Z    syslog    debug -2> 
2022-09-22T15:54:09.918+ 7fec1fac5700 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _aio_thread got r=-5 ((5) Input/output 
error)
2022-09-22T15:54:11.600Z    syslog    debug -3> 
2022-09-22T15:50:54.170+ 7fec2ebaa080 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _sync_write sync_file_range error: (5) 
Input/output error
2022-09-22T15:54:11.600Z    syslog    debug -4> 
2022-09-22T15:47:36.678+ 7fec2ebaa080 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _sync_write sync_file_range error: (5) 
Input/output error
2022-09-22T15:54:11.600Z    syslog    debug -5> 
2022-09-22T15:44:22.178+ 7fec2ebaa080 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _sync_write sync_file_range error: (5) 
Input/output error
2022-09-22T15:54:10.804Z    syslog    debug -3> 
2022-09-22T15:50:54.170+ 7fec2ebaa080 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _sync_write sync_file_range error: (5) 
Input/output error
2022-09-22T15:54:10.804Z    syslog    debug -2> 
2022-09-22T15:54:09.918+ 7fec1fac5700 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _aio_thread got r=-5 ((5) Input/output 
error)
2022-09-22T15:54:10.803Z    syslog    debug -4> 
2022-09-22T15:47:36.678+ 7fec2ebaa080 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _sync_write sync_file_range error: (5) 
Input/output error
2022-09-22T15:54:10.803Z    syslog    debug -6> 
2022-09-22T15:41:05.678+ 7fec2ebaa080 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _sync_write sync_file_range error: (5) 
Input/output error
2022-09-22T15:54:10.803Z    syslog    debug -5> 
2022-09-22T15:44:22.178+ 7fec2ebaa080 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _sync_write sync_file_range error: (5) 
Input/output error
2022-09-22T15:54:09.995Z    syslog    [19067764.996820] 
blk_update_request: I/O error, dev sdf, sector 520168880 op 0x1:(WRITE) 
flags 0x8800 phys_seg 1 prio class 0
2022-09-22T15:54:09.995Z    syslog    debug 2022-09-22T15:54:09.918+ 
7fec1fac5700 -1 bdev(0x55bf3305a800 /var/lib/ceph/osd/ceph-31/block) 
_aio_thread got r=-5 ((5) Input/output error)
2022-09-22T15:54:09.977Z    syslog    [19067764.996688] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:53:37.229Z    syslog    [19067732.246603] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:53:04.477Z    syslog    [19067699.496476] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:52:31.725Z    syslog    [19067666.746368] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:51:59.080Z    syslog    [19067633.996243] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:51:25.725Z    syslog    [19067600.746160] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:50:54.327Z    syslog    debug 2022-09-22T15:50:54.170+ 
7fec2ebaa080 -1 bdev(0x55bf3305a800 /var/lib/ceph/osd/ceph-31/block) 
_sync_write sync_file_range error: (5) Input/output error
2022-09-22T15:50:54.226Z    syslog    [19067569.246060] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:50:54.226Z    syslog    [19067569.246209] 
blk_update_request: I/O error, dev sdf, sector 461504 op 0x1:(WRITE) 
flags 0x800 phys_seg 3 prio class 0
2022-09-22T15:50:18.477Z    syslog    [19067533.495929] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:49:45.725Z    syslog    [19067500.745820] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:49:12.977Z    syslog    [19067467.995714] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:48:39.977Z    syslog    [19067434.995608] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:48:08.977Z    syslog    [19067403.995482] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:47:36.826Z    syslog    debug 2022-09-22T15:47:36.678+ 
7fec2ebaa080 -1 bdev(0x55bf3305a800 /var/lib/ceph/osd/ceph-31/block) 
_sync_write sync_file_range error: (5) Input/output error
2022-09-22T15:47:36.725Z    syslog    [19067371.745553] 
blk_update_request: I/O error, dev sdf, sector 460544 op 0x1:(WRITE) 
flags 0x800 phys_seg 121 prio class 0


This never happens on the same OSD. When we check the drive, there's no 
issue to report. When this happens, the cluster either momentarily 
freeze or it will glitch and mark the OSD as out. What could be the 
source of this issue? We're thinking it could be either related to the 
drive model or the Ceph version. Here's some info regarding our 
hardware/software:



Drives: All Intel D

[ceph-users] Re: Freak issue every few weeks

2022-09-22 Thread Stefan Kooman

On 9/22/22 19:55, J-P Methot wrote:

Hi,

We've been running into a mysterious issue on Ceph 16.2.7. Every few 
weeks or so (can be from 2 weeks to a month and a half), we get 
input/output errors on a random OSD. Here's the logs :


2022-09-22T15:54:11.600Z    syslog    debug -6> 
2022-09-22T15:41:05.678+ 7fec2ebaa080 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _sync_write sync_file_range error: (5) 
Input/output error
2022-09-22T15:54:11.600Z    syslog    debug -2> 
2022-09-22T15:54:09.918+ 7fec1fac5700 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _aio_thread got r=-5 ((5) Input/output 
error)
2022-09-22T15:54:11.600Z    syslog    debug -3> 
2022-09-22T15:50:54.170+ 7fec2ebaa080 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _sync_write sync_file_range error: (5) 
Input/output error
2022-09-22T15:54:11.600Z    syslog    debug -4> 
2022-09-22T15:47:36.678+ 7fec2ebaa080 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _sync_write sync_file_range error: (5) 
Input/output error
2022-09-22T15:54:11.600Z    syslog    debug -5> 
2022-09-22T15:44:22.178+ 7fec2ebaa080 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _sync_write sync_file_range error: (5) 
Input/output error
2022-09-22T15:54:10.804Z    syslog    debug -3> 
2022-09-22T15:50:54.170+ 7fec2ebaa080 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _sync_write sync_file_range error: (5) 
Input/output error
2022-09-22T15:54:10.804Z    syslog    debug -2> 
2022-09-22T15:54:09.918+ 7fec1fac5700 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _aio_thread got r=-5 ((5) Input/output 
error)
2022-09-22T15:54:10.803Z    syslog    debug -4> 
2022-09-22T15:47:36.678+ 7fec2ebaa080 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _sync_write sync_file_range error: (5) 
Input/output error
2022-09-22T15:54:10.803Z    syslog    debug -6> 
2022-09-22T15:41:05.678+ 7fec2ebaa080 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _sync_write sync_file_range error: (5) 
Input/output error
2022-09-22T15:54:10.803Z    syslog    debug -5> 
2022-09-22T15:44:22.178+ 7fec2ebaa080 -1 bdev(0x55bf3305a800 
/var/lib/ceph/osd/ceph-31/block) _sync_write sync_file_range error: (5) 
Input/output error
2022-09-22T15:54:09.995Z    syslog    [19067764.996820] 
blk_update_request: I/O error, dev sdf, sector 520168880 op 0x1:(WRITE) 
flags 0x8800 phys_seg 1 prio class 0
2022-09-22T15:54:09.995Z    syslog    debug 2022-09-22T15:54:09.918+ 
7fec1fac5700 -1 bdev(0x55bf3305a800 /var/lib/ceph/osd/ceph-31/block) 
_aio_thread got r=-5 ((5) Input/output error)
2022-09-22T15:54:09.977Z    syslog    [19067764.996688] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:53:37.229Z    syslog    [19067732.246603] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:53:04.477Z    syslog    [19067699.496476] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:52:31.725Z    syslog    [19067666.746368] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:51:59.080Z    syslog    [19067633.996243] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:51:25.725Z    syslog    [19067600.746160] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:50:54.327Z    syslog    debug 2022-09-22T15:50:54.170+ 
7fec2ebaa080 -1 bdev(0x55bf3305a800 /var/lib/ceph/osd/ceph-31/block) 
_sync_write sync_file_range error: (5) Input/output error
2022-09-22T15:50:54.226Z    syslog    [19067569.246060] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:50:54.226Z    syslog    [19067569.246209] 
blk_update_request: I/O error, dev sdf, sector 461504 op 0x1:(WRITE) 
flags 0x800 phys_seg 3 prio class 0
2022-09-22T15:50:18.477Z    syslog    [19067533.495929] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:49:45.725Z    syslog    [19067500.745820] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:49:12.977Z    syslog    [19067467.995714] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:48:39.977Z    syslog    [19067434.995608] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:48:08.977Z    syslog    [19067403.995482] sd 0:0:5:0: 
Power-on or device reset occurred
2022-09-22T15:47:36.826Z    syslog    debug 2022-09-22T15:47:36.678+ 
7fec2ebaa080 -1 bdev(0x55bf3305a800 /var/lib/ceph/osd/ceph-31/block) 
_sync_write sync_file_range error: (5) Input/output error
2022-09-22T15:47:36.725Z    syslog    [19067371.745553] 
blk_update_request: I/O error, dev sdf, sector 460544 op 0x1:(WRITE) 
flags 0x800 phys_seg 121 prio class 0


This never happens on the same OSD. When we check the drive, there's no 
issue to report. When this happens, the cluster either momentarily 
freeze or it will glitch and mark the OSD as out. What could be the 
source of this issue? 


Just guessing here: have you configured "discard":

bdev enable discard
bdev async discard

We've see monitor slow o

[ceph-users] Balancer Distribution Help

2022-09-22 Thread Reed Dier
Hoping someone can point me to possible tunables that could hopefully better 
tighten my OSD distribution.

Cluster is currently
> "ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus 
> (stable)": 307
With plans to begin moving to pacific before end of year, with a possible 
interim stop at octopus.17 on the way.

Cluster was born on jewel, and is fully bluestore/straw2.
The upmap balancer works/is working, but not to the degree that I believe it 
could/should work, which seems should be much closer to near perfect than what 
I’m seeing.

https://imgur.com/a/lhtZswo  <- Histograms of my 
OSD distribution

https://pastebin.com/raw/dk3fd4GH  <- 
pastebin of cluster/pool/crush relevant bits

To put it succinctly, I’m hoping to get much tighter OSD distribution, but I’m 
not sure what knobs to try turning next, as the upmap balancer has gone as far 
as it can, and I end up playing “reweight the most full OSD whack-a-mole as 
OSD’s get nearful.”

My goal is obviously something akin to this perfect distribution like here: 
https://www.youtube.com/watch?v=niFNZN5EKvE&t=1353s 


I am looking to tweak the PG counts for a few pool.
Namely the ssd-radosobj has shrunk in size and needs far fewer PGs now.
Similarly hdd-cephfs shrunk in size as well and needs fewer PGs (as ceph health 
shows).
And on the flip side, ec*-cephfs likely need more PGs as they have grown in 
size.
However I was hoping to get more breathing room of free space on my most full 
OSDs before starting to do big PG expand/shrink.

I am assuming that my whacky mix of replicated vs multiple EC storage pools 
coupled with hybrid SSD+HDD pools is throwing off the balance more than if it 
was a more homogenous crush ruleset, but this is what exists and is what I’m 
working with.
Also, since it will look odd in the tree view, the crush rulesets for hdd pools 
are chooseleaf chassis, while ssd pools are chooseleaf host.

Any tips or help would be greatly appreciated.

Reed
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] CLT meeting summary 2022-09-21

2022-09-22 Thread David Orman
This was a short meeting, and in summary:

 * Testing of upgrades for 17.2.4 in Gibba commenced and slowness during 
upgrade has been investigated.
   * Workaround available; not a release blocker
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] questions about rgw gc max objs and rgw gc speed in general

2022-09-22 Thread Adrian Nicolae

 Hi,

 We have  a system running Ceph Pacific with  a large number of delete 
requests (several hundred thousands files per day) and I'm investigating 
how can I increase the gc speed to keep up with our deletes (right now 
there are 44 millions of objects in the gc list).


 I changed max_concurrent_io to 40 ,  rgw_gc_max_trim_chunk to 1024,   
rgw_gc_processor_max_time to 300,  rgw_gc_processor_period to 300 , 
rgw_gc_obj_min_wait to 300, rgw_gc_max_objs to 1000 but I managed only 
to stall the increase of the gc queue.


I'm curious about these parameters :

- if I understand the docs correctly, gc_max_objs = maximum number of 
objects that may be handled by garbage collection in one garbage 
collection processing cycle.  Soif the gc cycle is set to 5 minutes => 
for every 24 hours we have 288 cycles =>  the max number of objects that 
can be deleted by gc for ever 24 hours is 288 * 1000 = 288000 ?  If so, 
this is the total number per cluster or per rgw instance (we have 30 rgw 
containers on different machines) ?


- 44 millions of objects in gc list => this means that we have ~160TB in 
the garbage queue considering one object = 4MB ?


- how can I increase the gc speed beyond the current speed ?  my current 
settings are quite aggresive already


- is the process of increasing the pg_num/pgp_num impacting the gc speed 
?  I'm asking because we are doing that for several weeks, manually 
increasing the pg_num in small steps


Thanks.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancer Distribution Help

2022-09-22 Thread Bailey Allison
Hi Reed,

Just taking a quick glance at the Pastebin provided I have to say your cluster 
balance is already pretty damn good all things considered. 

We've seen the upmap balancer at it's best in practice provides a deviation of 
about 10-20% percent across OSDs which seems to be matching up on your cluster. 
It's something that as the more nodes and OSDs you add that are equal in size 
to the cluster, and as the PGs increase on the cluster it can do a better and 
better job of, but in practice about a 10% difference in OSDs is  very normal.

Something to note in the video provided is that they were using a cluster with 
28PB of storage available, so who knows how many OSDs/nodes/PGs per pool/etc., 
that their cluster has the luxury and ability to balance across.

The only thing I can think to suggest is just increasing the PG count as you've 
already mentioned. The ideal setting is about 100 PGs per OSD, and looking at 
your cluster both the SSDs and the smaller HDDs have only about 50 PGs per OSD.

If you're able to get both of those devices to a closer to 100 PG per OSD ratio 
it should help a lot more with the balancing. More PGs means more places to 
distribute data. 

It will be tricky in that I am just noticing for the HDDs you have some 
hosts/chassis with 24 OSDs per and others with 6 HDDs per so getting the PG 
distribution more even for those will be challenging, but for the SSDs it 
should be quite simple to get those to be 100 PGs per OSD.
 
Just taking a further look it does appear on some OSDs although I will say 
across the entire cluster the actual data stored is balanced good, there are a 
couple of OSDs where the OMAP/metadata is not balanced as well as the others.

Where you are using EC pools for CephFS, any OMAP data cannot be stored within 
EC so it will store all of that within a replication data cephfs pool, most 
likely your hdd_cephfs pool. 

Just something to keep in mind as not only is it important to make sure the 
data is balanced, but the OMAP data and metadata are balanced as well.

Otherwise though I would recommended just trying to get your cluster to a point 
where each of the OSDs have roughly 100 PGs per OSD, or at least as close to 
this as you are able to given your clusters crush rulesets. 

This should then help the balancer spread the data across the cluster, but 
again unless I overlooked something your cluster already appears to be 
extremely well balanced.

There is a PG calculator you can use online at: 

https://old.ceph.com/pgcalc/

There is also a PG calc on the Redhat website but it requires a subscription. 

Both calculators are essentially the same but I have noticed the free one will 
round down the PGs and the Redhat one will round up the PGs.

Regards,

Bailey

-Original Message-
From: Reed Dier  
Sent: September 22, 2022 4:48 PM
To: ceph-users 
Subject: [ceph-users] Balancer Distribution Help

Hoping someone can point me to possible tunables that could hopefully better 
tighten my OSD distribution.

Cluster is currently
> "ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) 
> octopus (stable)": 307
With plans to begin moving to pacific before end of year, with a possible 
interim stop at octopus.17 on the way.

Cluster was born on jewel, and is fully bluestore/straw2.
The upmap balancer works/is working, but not to the degree that I believe it 
could/should work, which seems should be much closer to near perfect than what 
I’m seeing.

https://imgur.com/a/lhtZswo  <- Histograms of my 
OSD distribution

https://pastebin.com/raw/dk3fd4GH  <- 
pastebin of cluster/pool/crush relevant bits

To put it succinctly, I’m hoping to get much tighter OSD distribution, but I’m 
not sure what knobs to try turning next, as the upmap balancer has gone as far 
as it can, and I end up playing “reweight the most full OSD whack-a-mole as 
OSD’s get nearful.”

My goal is obviously something akin to this perfect distribution like here: 
https://www.youtube.com/watch?v=niFNZN5EKvE&t=1353s 


I am looking to tweak the PG counts for a few pool.
Namely the ssd-radosobj has shrunk in size and needs far fewer PGs now.
Similarly hdd-cephfs shrunk in size as well and needs fewer PGs (as ceph health 
shows).
And on the flip side, ec*-cephfs likely need more PGs as they have grown in 
size.
However I was hoping to get more breathing room of free space on my most full 
OSDs before starting to do big PG expand/shrink.

I am assuming that my whacky mix of replicated vs multiple EC storage pools 
coupled with hybrid SSD+HDD pools is throwing off the balance more than if it 
was a more homogenous crush ruleset, but this is what exists and is what I’m 
working with.
Also, since it will look odd in the tree view, the crush rulesets for hdd pools 
are chooseleaf chassis, while ssd pools are chooseleaf host.

Any tips or help would be greatly appreciated.

[ceph-users] how to enable ceph fscache from kernel module

2022-09-22 Thread David Yang
hi,
I am using kernel client to mount cephFS filesystem on Centos8.2.
But my ceph's kernel module does not contain fscache.


[root@host ~]# uname -r
5.4.163-1.el8.elrepo.x86_64
[root@host ~]# lsmod|grep ceph
ceph 446464 0
libceph 368640 1 ceph
dns_resolver 16384 1 libceph
libcrc32c 16384 2xfs, libceph
[root@host ~]# modinfo ceph
filename: /lib/modules/5.4.163-1.el8.elrepo.x86_64/kernel/fs/ceph/ceph.ko.xz
license: GPL
description: Ceph filesystem for Linux
author: Patience Warnick 
author: Yehuda Sadeh 
author: Sage Weil 
alias: fs-ceph
srcversion: 0923A6EE91D4CE16BC32EA2
depends: libceph
retpoline: Y
intree: Y
name: ceph
vermagic: 5.4.163-1.el8.elrepo.x86_64 SMP mod_unload modversions


What should I do to enable fscache in the ceph module, thanks.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancer Distribution Help

2022-09-22 Thread Stefan Kooman

On 9/22/22 21:48, Reed Dier wrote:



Any tips or help would be greatly appreciated.


Try JJ's Ceph balancer [1]. In our case it turned out to be *way* more 
efficient than built-in balancer (faster conversion, less movements 
involved). And able to achieve a very good PG distribution and "reclaim" 
lot's of space. I Highly recommended it.


Gr. Stefan

[1]: https://github.com/TheJJ/ceph-balancer
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Any disadvantage to go above the 100pg/osd or 4osd/disk?

2022-09-22 Thread Eugen Block

Hi,

I can't speak from the developers perspective, but we discussed this  
just recently intenally and with a customer. We doubled the number of  
PGs on one of our customer's data pools from around 100 to 200 PGs/OSD  
(HDDs with rocksDB on SSDs). We're still waiting for the final  
conclusion if the performance has increased or not, but it seems to  
work as expected. We probably would double it again if the PG  
size/objects per PG would affect the performance again. You just need  
to be aware of the mon_max_pg_per_osd and  
osd_max_pg_per_osd_hard_ratio configs in case of recovery. Otherwise  
we don't see any real issue with 200 or 400 PGs/OSD if the nodes can  
handle it.


Regards,
Eugen

Zitat von "Szabo, Istvan (Agoda)" :


Hi,

My question is, is there any technical limit to have 8osd/ssd and on  
each of them 100pg if the memory and cpu resource available (8gb  
memory/osd and 96vcore)?
The iops and bandwidth on the disks are very low so I don’t see any  
issue to go with this.


In my cluster I’m using 15.3TB ssds. We have more than 2 billions of  
objects in each of the 3 clusters.
The bottleneck is the pg/osd so last time when my serious issue  
solved the solution was to bump the pg-s of the data pool the  
allowed maximum with 4:2 ec.


I’m curious of the developers opinion also.

Thank you,
Istvan


This message is confidential and is for the sole use of the intended  
recipient(s). It may also be privileged or otherwise protected by  
copyright or other legal rules. If you have received it by mistake  
please let us know by reply email and delete it from your system. It  
is prohibited to copy this message or disclose its content to  
anyone. Any confidentiality or privilege is not waived or lost by  
any mistaken delivery or unauthorized disclosure of the message. All  
messages sent to and from Agoda may be monitored to ensure  
compliance with company policies, to protect the company's interests  
and to remove potential malware. Electronic messages may be  
intercepted, amended, lost or deleted, or contain viruses.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Balancer Distribution Help

2022-09-22 Thread Eugen Block

+1 for increasing PG numbers, those are quite low.

Zitat von Bailey Allison :


Hi Reed,

Just taking a quick glance at the Pastebin provided I have to say  
your cluster balance is already pretty damn good all things  
considered.


We've seen the upmap balancer at it's best in practice provides a  
deviation of about 10-20% percent across OSDs which seems to be  
matching up on your cluster. It's something that as the more nodes  
and OSDs you add that are equal in size to the cluster, and as the  
PGs increase on the cluster it can do a better and better job of,  
but in practice about a 10% difference in OSDs is  very normal.


Something to note in the video provided is that they were using a  
cluster with 28PB of storage available, so who knows how many  
OSDs/nodes/PGs per pool/etc., that their cluster has the luxury and  
ability to balance across.


The only thing I can think to suggest is just increasing the PG  
count as you've already mentioned. The ideal setting is about 100  
PGs per OSD, and looking at your cluster both the SSDs and the  
smaller HDDs have only about 50 PGs per OSD.


If you're able to get both of those devices to a closer to 100 PG  
per OSD ratio it should help a lot more with the balancing. More PGs  
means more places to distribute data.


It will be tricky in that I am just noticing for the HDDs you have  
some hosts/chassis with 24 OSDs per and others with 6 HDDs per so  
getting the PG distribution more even for those will be challenging,  
but for the SSDs it should be quite simple to get those to be 100  
PGs per OSD.


Just taking a further look it does appear on some OSDs although I  
will say across the entire cluster the actual data stored is  
balanced good, there are a couple of OSDs where the OMAP/metadata is  
not balanced as well as the others.


Where you are using EC pools for CephFS, any OMAP data cannot be  
stored within EC so it will store all of that within a replication  
data cephfs pool, most likely your hdd_cephfs pool.


Just something to keep in mind as not only is it important to make  
sure the data is balanced, but the OMAP data and metadata are  
balanced as well.


Otherwise though I would recommended just trying to get your cluster  
to a point where each of the OSDs have roughly 100 PGs per OSD, or  
at least as close to this as you are able to given your clusters  
crush rulesets.


This should then help the balancer spread the data across the  
cluster, but again unless I overlooked something your cluster  
already appears to be extremely well balanced.


There is a PG calculator you can use online at:

https://old.ceph.com/pgcalc/

There is also a PG calc on the Redhat website but it requires a subscription.

Both calculators are essentially the same but I have noticed the  
free one will round down the PGs and the Redhat one will round up  
the PGs.


Regards,

Bailey

-Original Message-
From: Reed Dier 
Sent: September 22, 2022 4:48 PM
To: ceph-users 
Subject: [ceph-users] Balancer Distribution Help

Hoping someone can point me to possible tunables that could  
hopefully better tighten my OSD distribution.


Cluster is currently

"ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974)
octopus (stable)": 307
With plans to begin moving to pacific before end of year, with a  
possible interim stop at octopus.17 on the way.


Cluster was born on jewel, and is fully bluestore/straw2.
The upmap balancer works/is working, but not to the degree that I  
believe it could/should work, which seems should be much closer to  
near perfect than what I’m seeing.


https://imgur.com/a/lhtZswo  <-  
Histograms of my OSD distribution


https://pastebin.com/raw/dk3fd4GH  
 <- pastebin of  
cluster/pool/crush relevant bits


To put it succinctly, I’m hoping to get much tighter OSD  
distribution, but I’m not sure what knobs to try turning next, as  
the upmap balancer has gone as far as it can, and I end up playing  
“reweight the most full OSD whack-a-mole as OSD’s get nearful.”


My goal is obviously something akin to this perfect distribution  
like here: https://www.youtube.com/watch?v=niFNZN5EKvE&t=1353s  



I am looking to tweak the PG counts for a few pool.
Namely the ssd-radosobj has shrunk in size and needs far fewer PGs now.
Similarly hdd-cephfs shrunk in size as well and needs fewer PGs (as  
ceph health shows).
And on the flip side, ec*-cephfs likely need more PGs as they have  
grown in size.
However I was hoping to get more breathing room of free space on my  
most full OSDs before starting to do big PG expand/shrink.


I am assuming that my whacky mix of replicated vs multiple EC  
storage pools coupled with hybrid SSD+HDD pools is throwing off the  
balance more than if it was a more homogenous crush ruleset, but  
this is what exists and is what I’m working with.
Also, since it will look odd in the tr

[ceph-users] Re: Any disadvantage to go above the 100pg/osd or 4osd/disk?

2022-09-22 Thread Szabo, Istvan (Agoda)
Good to know thank you, so in that case during recovery it worth to increase 
those values right?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

-Original Message-
From: Eugen Block 
Sent: Friday, September 23, 2022 1:19 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: Any disadvantage to go above the 100pg/osd or 
4osd/disk?

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi,

I can't speak from the developers perspective, but we discussed this just 
recently intenally and with a customer. We doubled the number of PGs on one of 
our customer's data pools from around 100 to 200 PGs/OSD (HDDs with rocksDB on 
SSDs). We're still waiting for the final conclusion if the performance has 
increased or not, but it seems to work as expected. We probably would double it 
again if the PG size/objects per PG would affect the performance again. You 
just need to be aware of the mon_max_pg_per_osd and 
osd_max_pg_per_osd_hard_ratio configs in case of recovery. Otherwise we don't 
see any real issue with 200 or 400 PGs/OSD if the nodes can handle it.

Regards,
Eugen

Zitat von "Szabo, Istvan (Agoda)" :

> Hi,
>
> My question is, is there any technical limit to have 8osd/ssd and on
> each of them 100pg if the memory and cpu resource available (8gb
> memory/osd and 96vcore)?
> The iops and bandwidth on the disks are very low so I don’t see any
> issue to go with this.
>
> In my cluster I’m using 15.3TB ssds. We have more than 2 billions of
> objects in each of the 3 clusters.
> The bottleneck is the pg/osd so last time when my serious issue solved
> the solution was to bump the pg-s of the data pool the allowed maximum
> with 4:2 ec.
>
> I’m curious of the developers opinion also.
>
> Thank you,
> Istvan
>
> 
> This message is confidential and is for the sole use of the intended
> recipient(s). It may also be privileged or otherwise protected by
> copyright or other legal rules. If you have received it by mistake
> please let us know by reply email and delete it from your system. It
> is prohibited to copy this message or disclose its content to anyone.
> Any confidentiality or privilege is not waived or lost by any mistaken
> delivery or unauthorized disclosure of the message. All messages sent
> to and from Agoda may be monitored to ensure compliance with company
> policies, to protect the company's interests and to remove potential
> malware. Electronic messages may be intercepted, amended, lost or
> deleted, or contain viruses.
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to 
ceph-users-le...@ceph.io


This message is confidential and is for the sole use of the intended 
recipient(s). It may also be privileged or otherwise protected by copyright or 
other legal rules. If you have received it by mistake please let us know by 
reply email and delete it from your system. It is prohibited to copy this 
message or disclose its content to anyone. Any confidentiality or privilege is 
not waived or lost by any mistaken delivery or unauthorized disclosure of the 
message. All messages sent to and from Agoda may be monitored to ensure 
compliance with company policies, to protect the company's interests and to 
remove potential malware. Electronic messages may be intercepted, amended, lost 
or deleted, or contain viruses.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io