[ceph-users] Re: CephFS Metadata Pool bandwidth usage

2021-12-10 Thread Andras Sali
Hi Greg,

As a follow up, we see items similar to this pop up in the
objecter_requests (when it's not empty). Not sure if reading it right, but
some appear quite large (in the MB range?):

{
"ops": [
{
"tid": 9532804,
"pg": "3.f9c235d7",
"osd": 2,
"object_id": "200.02c7a084",
"object_locator": "@3",
"target_object_id": "200.02c7a084",
"target_object_locator": "@3",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "1121127.434264s",
"age": 0.0160001041,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2021-12-10T08:35:34.582215+",
"osd_ops": [
"write 0~4194304 [fadvise_dontneed] in=4194304b"
]
},
{
"tid": 9532806,
"pg": "3.abba2e66",
"osd": 2,
"object_id": "200.02c7a085",
"object_locator": "@3",
"target_object_id": "200.02c7a085",
"target_object_locator": "@3",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "1121127.438264s",
"age": 0.012781,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2021-12-10T08:35:34.589044+",
"osd_ops": [
"write 0~1236893 [fadvise_dontneed] in=1236893b"
]
},
{
"tid": 9532807,
"pg": "3.abba2e66",
"osd": 2,
"object_id": "200.02c7a085",
"object_locator": "@3",
"target_object_id": "200.02c7a085",
"target_object_locator": "@3",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "1121127.442264s",
"age": 0.0085206,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2021-12-10T08:35:34.592283+",
"osd_ops": [
"write 1236893~510649 [fadvise_dontneed] in=510649b"
]
},
{
"tid": 9532808,
"pg": "3.abba2e66",
"osd": 2,
"object_id": "200.02c7a085",
"object_locator": "@3",
"target_object_id": "200.02c7a085",
"target_object_locator": "@3",
"paused": 0,
"used_replica": 0,
"precalc_pgid": 0,
"last_sent": "1121127.442264s",
"age": 0.0085206,
"attempts": 1,
"snapid": "head",
"snap_context": "0=[]",
"mtime": "2021-12-10T08:35:34.592387+",
"osd_ops": [
"write 1747542~13387 [fadvise_dontneed] in=13387b"
]
}
],
"linger_ops": [],
"pool_ops": [],
"pool_stat_ops": [],
"statfs_ops": [],
"command_ops": []
}

Any suggestions would be much appreciated.

Kind regards,

András


On Thu, Dec 9, 2021 at 7:48 PM Andras Sali  wrote:

> Hi Greg,
>
> Much appreciated for the reply, the image is also available at:
> https://tracker.ceph.com/attachments/download/5808/Bytes_per_op.png
>
> How the graph is generated: we back the cephfs metadata pool with Azure
> ultrassd disks. Azure reports for the disks each minute the average
> read/write iops (operations per sec) and average read/write throughput (in
> bytes per sec).
>
> We then divide the write throughput with the write IOPS number - this is
> the average write bytes / operation (we plot this in the above graph). We
> observe that this increases up to around 300kb, whilst after resetting the
> MDS, it stays around 32kb for some time (then starts increasing). The read
> bytes / operation are constantly much smaller.
>
> The issue is that once we are in the "high" regime, for the same operation
> that does for example 1000 IOPS, we need 300MB throughput, instead of 30MB
> throughput that we observe after a restart. The high throughput often
> results in reaching the VM level limits in Azure and after this the queue
> depth explodes and operations begin stalling.
>
> We will do the dump and report it as well once we have it.
>
> Thanks again for any ideas on this.
>
> Kind regards,
>
> Andras
>
>
> On Thu, Dec 9, 2021, 15:07 Gregory Farnum  wrote:
>
>> Andras,
>>
>> Unfortunately your attachment didn't come through the list. (It might
>> work if you embed it inline? Not sure.) I don't know if anybody's
>> looked too hard at this before, and without the image I don't know
>> exactly what metric you're using to say something's 320KB in size. Can
>> you explain more?
>>
>> It might help if you dump the objecter_requests from the MDS and share
>> those — it'll display what objects are being written to with what
>> sizes

[ceph-users] Re: Local NTP servers on monitor node's.

2021-12-10 Thread mhnx
It's nice to hear I'm on the right track.

Thanks for the answers.

Anthony D'Atri , 8 Ara 2021 Çar, 12:13
tarihinde şunu yazdı:
>
> I’ve had good success with this strategy, have the mons chime each other, and 
> perhaps have OSD / other nodes against the mons too.
> Chrony >> ntpd
> With modern interval backoff / iburst there’s no reason to not have a robust 
> set of peers.
>
> The public NTP pools rotate DNS on some period, so when the quality / jitter 
> varies a lot among a given pool you can experience swings.  So depending on 
> the scale of one’s organization, it often makes sense to have a set of 
> internal stratum 2 servers that servers chime against, which mesh among 
> themselves and against both geo-local public servers and a few hand-picked 
> quality *distant* servers.  Jitter matters more than latency AIUI.
>
> Local stratum 1 servers are cool, though getting coax to a DC roof and an 
> antenna mounted can be an expensive hassle.
>
> Success includes a variety of time sources, so that it doesn’t all go to hell 
> when some specific server goes weird or disappears, both of which happen.  
> Eg, if there’s a window with sky access, even in an office area, add a couple 
> of these (or similar) to the mix, as a source for the workhorse server 
> stratum :
>
> https://www.netburner.com/products/network-time-server/pk70-ex-ntp-network-time-server/#
>
> Not a DC grade item, or a sole solution, but the bang for the buck is 
> unbeatable.
>
>
> Unless things have changed in the last few years, don’t run NTP servers on 
> VMs.  Some network gear can run a server, but be careful with the load it 
> presents and how many clients can be supported without impacting the primary 
> roles.
>
>
> On Dec 8, 2021, at 12:14 AM, Janne Johansson  wrote:
>
> Den ons 8 dec. 2021 kl 02:35 skrev mhnx :
>
> I've been building Ceph clusters since 2014 and the most annoying and
>
> worst failure is the NTP server faults and having different times on
>
> Ceph nodes.
>
>
> I've fixed few clusters because of the ntp failure.
>
> - Sometimes NTP servers can be unavailable,
>
> - Sometimes NTP servers can go crazy.
>
> - Sometimes NTP servers can respond but systemd-timesyncd can not sync
>
> the time without manual help.
>
>
> I don't want to deal with another ntp problem and because of that I've
>
> decided to build internal ntp servers for the cluster.
>
>
> I'm thinking of creating 3 NTP servers on the 3 monitor nodes to get
>
> an internal ntp server cluster.
>
> I will use the internal NTP cluster for the OSD nodes and other services.
>
> With this way, I believe that I'll always have a stable and fast time server.
>
>
> We do something like this. mons gather "calendar time" from outside
> ntp servers, but also peer against eachother, so if/when they drift
> away the mons drift away equal amounts, then all OSDs/RGWs and ceph
> clients pull time from the mons who serve internal ntp based on their
> idea of what time it is.
>
> Not using systemd, but both chronyd and ntpd allow you to set peers
> for which you sync "sideways" just to keep the pace in-between hosts.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: reinstalled node with OSD

2021-12-10 Thread bbk
Hi,

i like to answer to myself :-) I finally found the rest of my documentation... 
So after reinstalling the OS also the osd config must be created.

Here is what i have done, maybe this helps someone:

--

Get the informations:

```
cephadm ceph-volume lvm list
ceph config generate-minimal-conf
ceph auth get osd.[ID]
```

Now create a minimal osd config:

```
vi osd.[ID].json
```

```
{
"config": "# minimal ceph.conf for 
6d0ecf22-9155-4684-971a-2f6cde8628c8\n[global]\n\tfsid = 
6d0ecf22-9155-4684-971a-2f6cde8628c8\n\tmon_host = 
[v2:192.168.6.21:3300/0,v1:192.168.6.21:6789/0] 
[v2:192.168.6.22:3300/0,v1:192.168.6.22:6789/0] 
[v2:192.168.6.23:3300/0,v1:192.168.6.23:6789/0] 
[v2:192.168.6.24:3300/0,v1:192.168.6.24:6789/0] 
[v2:192.168.6.25:3300/0,v1:192.168.6.25:6789/0]\n",
"keyring": "[osd.XXX]\n\tkey = \n"
}
```

Deploy the OSD daemon:

```
cephadm deploy --fsid 6d0ecf22-9155-4684-971a-2f6cde8628c8 --osd-fsid [ID] 
--name osd.[ID] --config-json osd.[ID].json
```

Yours,
bbk

On Thu, 2021-12-09 at 18:35 +0100, bbk wrote:
> After reading my mail it may not be clear that i reinstalled the OS of
> a node with OSDs.
> 
> On Thu, 2021-12-09 at 18:10 +0100, bbk wrote:
> > Hi,
> > 
> > the last time i have reinstalled a node with OSDs, i added the disks
> > with the following command. But unfortunatly this time i ran into a
> > error.
> > 
> > It seems like this time the command doesn't create the container, i
> > am able to run `cephadm shell`, and other daemons (mon,mgr,mds) are
> > running.
> > 
> > I don't know if that is the right way to do it?
> > 
> > 
> > ~# cephadm deploy --fsid 6d0ecf22-9155-4684-971a-2f6cde8628c8 --osd-
> > fsid 941c6cb6-6898-4aa2-a33a-cec3b6a95cf1 --name osd.9
> > 
> > Non-zero exit code 125 from /usr/bin/podman container inspect --
> > format {{.State.Status}} ceph-6d0ecf22-9155-4684-971a-2f6cde8628c8-
> > osd-9
> > /usr/bin/podman: stderr Error: error inspecting object: no such
> > container ceph-6d0ecf22-9155-4684-971a-2f6cde8628c8-osd-9
> > Non-zero exit code 125 from /usr/bin/podman container inspect --
> > format {{.State.Status}} ceph-6d0ecf22-9155-4684-971a-2f6cde8628c8-
> > osd.9
> > /usr/bin/podman: stderr Error: error inspecting object: no such
> > container ceph-6d0ecf22-9155-4684-971a-2f6cde8628c8-osd.9
> > Deploy daemon osd.9 ...
> > Non-zero exit code 1 from systemctl start
> > ceph-6d0ecf22-9155-4684-971a-2f6cde8628c8@osd.9
> > systemctl: stderr Job for
> > ceph-6d0ecf22-9155-4684-971a-2f6cde8628c8@osd.9.service failed
> > because the control process exited with error code.
> > systemctl: stderr See "systemctl status
> > ceph-6d0ecf22-9155-4684-971a-2f6cde8628c8@osd.9.service" and
> > "journalctl -xe" for details.
> > Traceback (most recent call last):
> >   File "/usr/sbin/cephadm", line 8571, in 
> >     main()
> >   File "/usr/sbin/cephadm", line 8559, in main
> >     r = ctx.func(ctx)
> >   File "/usr/sbin/cephadm", line 1787, in _default_image
> >     return func(ctx)
> >   File "/usr/sbin/cephadm", line 4549, in command_deploy
> >     ports=daemon_ports)
> >   File "/usr/sbin/cephadm", line 2677, in deploy_daemon
> >     c, osd_fsid=osd_fsid, ports=ports)
> >   File "/usr/sbin/cephadm", line 2906, in deploy_daemon_units
> >     call_throws(ctx, ['systemctl', 'start', unit_name])
> >   File "/usr/sbin/cephadm", line 1467, in call_throws
> >     raise RuntimeError('Failed command: %s' % ' '.join(command))
> > RuntimeError: Failed command: systemctl start
> > ceph-6d0ecf22-9155-4684-971a-2f6cde8628c8@osd.9
> > 
> > 
> > ~# cephadm ceph-volume lvm list
> > 
> > == osd.9 ===
> > 
> >   [block]   /dev/ceph-07fa2bb7-628f-40c0-8725-0266926371c0/osd-
> > block-941c6cb6-6898-4aa2-a33a-cec3b6a95cf1
> > 
> >   block device  /dev/ceph-07fa2bb7-628f-40c0-8725-
> > 0266926371c0/osd-block-941c6cb6-6898-4aa2-a33a-cec3b6a95cf1
> >   block uuid    mVEhfF-LK4E-Dtmb-Jj23-tn8x-lpLy-
> > KiUy1a
> >   cephx lockbox secret  
> >   cluster fsid  6d0ecf22-9155-4684-971a-2f6cde8628c8
> >   cluster name  ceph
> >   crush device class    None
> >   encrypted 0
> >   osd fsid  941c6cb6-6898-4aa2-a33a-cec3b6a95cf1
> >   osd id    9
> >   type  block
> >   vdo   0
> >   devices   /dev/sdd
> > 
> > 
> > ~# podman --version
> > podman version 3.2.3
> > 
> > 
> > ~# cephadm version
> > Using recent ceph image
> > quay.io/ceph/ceph@sha256:2f7f0af8663e73a422f797de605e769ae44eb0297f2a
> > 79324739404cc1765728
> > ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503)
> > pacific (stable)
> > 
> > 
> > ~# lsb_release -a
> > LSB Version::core-4.1-amd64:core-4.1-noarch
> > Distributor ID: RedHatEnterprise
> > Description:Red Hat Enterprise Linux release 8.5 (Ootpa)
> > Release:8.5
> > Codename:   Ootpa
> > 
> > 
> > ~# cep

[ceph-users] Re: 16.2.6 Convert Docker to Podman?

2021-12-10 Thread 胡玮文
On Fri, Dec 10, 2021 at 01:12:56AM +0100, Roman Steinhart wrote:
> hi,
> 
> recently I had to switch the other way around (from podman to docker).
> I just...
> - stopped all daemons on a host with "systemctl stop ceph-{uuid}@*"
> - purged podman
> - triggered a redeploy for every daemon with "ceph orch daemon redeploy
> osd.{id}"
> 
> ~ Roman

We have switched to podman with similar process. Stop systemd units, install
podman, redeploy, and done. 

Weiwen Hu

> On Thu, 9 Dec 2021 at 16:27, Marco Pizzolo  wrote:
> 
> > Hello Everyone,
> >
> > In an attempt to futureproof, I am beginning to look for information on how
> > one would go about moving to podman from docker on a cephadm 16.2.6
> > installation on ubuntu 20.04.3.
> >
> > I would be interested to know if anyone else has contemplated or performed
> > something similar, and what their findings were.
> >
> > Appreciate any insight you can share.
> >
> > Thanks,
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] OSD storage not balancing properly when crush map uses multiple device classes

2021-12-10 Thread Erik Lindahl
Hi,

We are experimenting with using manually created crush maps to pick one SSD
as primary and and two HDD devices. Since all our HDDs have the DB & WAL on
NVMe drives, this gives us a nice combination of pretty good write
performance, and great read performance while keeping costs manageable for
hundreds of TB of storage.

We have 16 nodes with ~300 HDDs and four separate nodes with 64 7.6TB SSDs.

However, we're noticing that the usage on the SSDs isn't very balanced at
all - it's ranging from 26% to 52% for some reason (The balancer is active
and seems to be happy).


I suspect this might have to do with the placement groups now being mixed
(i.e., each pg uses 1x SSD and 2x HDD). Is there anything we can do about
this to achieve balanced SSD usage automatically?

I've included the crush map below, just in case we/I screwed up something
there instead :-)


Cheers,

Erik


{
"rule_id": 11,
"rule_name": "1ssd_2hdd",
"ruleset": 11,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -52,
"item_name": "default~ssd"
},
{
"op": "chooseleaf_firstn",
"num": 1,
"type": "host"
},
{
"op": "emit"
},
{
"op": "take",
"item": -24,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_firstn",
"num": -1,
"type": "host"
},
{
"op": "emit"
}
]
}

-- 
Erik Lindahl 
Science for Life Laboratory, Box 1031, 17121 Solna, Sweden
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: v16.2.6 PG peering indefinitely after cluster power outage

2021-12-10 Thread Eric Alba
So I did an export of the PG using ceph-objectstore-tool in hopes that I
could push ceph to forget about the rest of the data there. It was a
successful export but we’ll see how it goes importing it. I tried on one
osd already to import but got the message the PG already exists, am I doing
something wrong? Do I have to remove all fragments of the PG and force it
to go NOENT before trying an import?

On Wednesday, December 8, 2021, Eric Alba  wrote:

> I've been trying to get ceph to force the PG to a good state but it
> continues to give me a single PG peering. This is a rook-ceph cluster on
> VMs (hosts went out for a brief period) and I can't figure out how to get
> this 1GB or so of data to become available to the client. This occurred
> during a cluster expansion. This gave the added joy of clock skew. Another
> 3 disks were being added to extend the cluster storage. This is a 3 node
> cluster with 6 200GB Disks each node.
>
>   cluster:
> id: 31689324-f5ba-4aa4-8244-aa09a1119dc3
> health: HEALTH_OK
>
>   services:
> mon: 3 daemons, quorum a,c,d (age 41m)
> mgr: a(active, since 2h)
> osd: 18 osds: 18 up (since 17m), 18 in (since 2h); 1 remapped pgs
>
>   data:
> pools:   2 pools, 768 pgs
> objects: 184.38k objects, 707 GiB
> usage:   1.4 TiB used, 2.1 TiB / 3.5 TiB avail
> pgs: 0.130% pgs not active
>  767 active+clean
>  1   peering
>
>   io:
> client:   14 KiB/s wr, 0 op/s rd, 1 op/s wr
>
>   progress:
> Global Recovery Event (2h)
>   [===.] (remaining: 9s)
>
> PG map for problematic PG
>
> [root@rook-ceph-tools-5b54fb98c-kjt5t /]# ceph pg map 2.11c
> osdmap e11328 pg 2.11c (2.11c) -> up [17,12] acting [4]
>
> Don't see OSD 4 in pg dump:
> [root@rook-ceph-tools-5b54fb98c-kjt5t /]# ceph pg dump 2> /dev/null |
> egrep '2\.11c|MISS'
> PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED  MISPLACED  UNFOUND  BYTES
>   OMAP_BYTES*  OMAP_KEYS*  LOG   DISK_LOG  STATE STATE_STAMP
>VERSION REPORTEDUP  UP_PRIMARY
>  ACTING  ACTING_PRIMARY  LAST_SCRUB SCN
> 2.11c249   0 0  00
>  10237583360   0  5352  5352   peering
>  2021-12-08T22:44:36.691204+3669'8643974  11329:7414
> [17,12]  17 [17,12]  17   3162'8635517  200
>
>
>
> ceph pg 2.11c query
> {
> "snap_trimq": "[]",
> "snap_trimq_len": 0,
> "state": "unknown",
> "epoch": 12191,
> "up": [
> 17,
> 12
> ],
> "acting": [
> 17,
> 12
> ],
> "info": {
> "pgid": "2.11c",
> "last_update": "3669'8643974",
> "last_complete": "3669'8643974",
> "log_tail": "3174'8638622",
> "last_user_version": 8643974,
> "last_backfill": "2:38e2dedc:::rbd_data.1d091ec6540582.
> 158e:head",
> "purged_snaps": [],
> "history": {
> "epoch_created": 3286,
> "epoch_pool_created": 13,
> "last_epoch_started": 3725,
> "last_interval_started": 3724,
> "last_epoch_clean": 3544,
> "last_interval_clean": 3543,
> "last_epoch_split": 3286,
> "last_epoch_marked_full": 0,
> "same_up_since": 9923,
> "same_interval_since": 12191,
> "same_primary_since": 12191,
> "last_scrub": "3162'8635517",
> "last_scrub_stamp": "2021-12-06T22:16:26.082945+",
> "last_deep_scrub": "3122'8608768",
> "last_deep_scrub_stamp": "2021-12-04T02:35:48.882906+",
> "last_clean_scrub_stamp": "2021-12-06T22:16:26.082945+",
> "prior_readable_until_ub": 0
> },
> "stats": {
> "version": "3669'8643974",
> "reported_seq": 8277,
> "reported_epoch": 12191,
> "state": "peering",
> "last_fresh": "2021-12-08T22:59:01.450813+",
> "last_change": "2021-12-08T22:59:01.450813+",
> "last_active": "0.00",
> "last_peered": "0.00",
> "last_clean": "0.00",
> "last_became_active": "0.00",
> "last_became_peered": "0.00",
> "last_unstale": "2021-12-08T22:59:01.450813+",
> "last_undegraded": "2021-12-08T22:59:01.450813+",
> "last_fullsized": "2021-12-08T22:59:01.450813+",
> "mapping_epoch": 12191,
> "log_start": "3174'8638622",
> "ondisk_log_start": "3174'8638622",
> "created": 3286,
> "last_epoch_clean": 3544,
> "parent": "0.0",
> "parent_split_bits": 0,
> "last_scrub": "3162'8635517",
> "last_scrub_stamp": "2021-12-06T22:16:26.082945+",
> "last_deep_scrub": "3122'8608768",
> "last_

[ceph-users] Re: 16.2.6 Convert Docker to Podman?

2021-12-10 Thread Marco Pizzolo
Robert, Roman and Weiwen Hu,

Thank you very much for your responses.  I presume one host at a time, and
the redeploy will take care of any configuration, with nothing further
being necessary?

Thank you.

Marco

On Fri, Dec 10, 2021 at 7:36 AM 胡玮文  wrote:

> On Fri, Dec 10, 2021 at 01:12:56AM +0100, Roman Steinhart wrote:
> > hi,
> >
> > recently I had to switch the other way around (from podman to docker).
> > I just...
> > - stopped all daemons on a host with "systemctl stop ceph-{uuid}@*"
> > - purged podman
> > - triggered a redeploy for every daemon with "ceph orch daemon redeploy
> > osd.{id}"
> >
> > ~ Roman
>
> We have switched to podman with similar process. Stop systemd units,
> install
> podman, redeploy, and done.
>
> Weiwen Hu
>
> > On Thu, 9 Dec 2021 at 16:27, Marco Pizzolo 
> wrote:
> >
> > > Hello Everyone,
> > >
> > > In an attempt to futureproof, I am beginning to look for information
> on how
> > > one would go about moving to podman from docker on a cephadm 16.2.6
> > > installation on ubuntu 20.04.3.
> > >
> > > I would be interested to know if anyone else has contemplated or
> performed
> > > something similar, and what their findings were.
> > >
> > > Appreciate any insight you can share.
> > >
> > > Thanks,
> > > ___
> > > ceph-users mailing list -- ceph-users@ceph.io
> > > To unsubscribe send an email to ceph-users-le...@ceph.io
> > >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: 16.2.6 Convert Docker to Podman?

2021-12-10 Thread Marco Pizzolo
Forgot to confirm, was this process non-destructive in terms of data in
OSDs?
Thanks again,

On Fri, Dec 10, 2021 at 9:23 AM Marco Pizzolo 
wrote:

> Robert, Roman and Weiwen Hu,
>
> Thank you very much for your responses.  I presume one host at a time, and
> the redeploy will take care of any configuration, with nothing further
> being necessary?
>
> Thank you.
>
> Marco
>
> On Fri, Dec 10, 2021 at 7:36 AM 胡玮文  wrote:
>
>> On Fri, Dec 10, 2021 at 01:12:56AM +0100, Roman Steinhart wrote:
>> > hi,
>> >
>> > recently I had to switch the other way around (from podman to docker).
>> > I just...
>> > - stopped all daemons on a host with "systemctl stop ceph-{uuid}@*"
>> > - purged podman
>> > - triggered a redeploy for every daemon with "ceph orch daemon redeploy
>> > osd.{id}"
>> >
>> > ~ Roman
>>
>> We have switched to podman with similar process. Stop systemd units,
>> install
>> podman, redeploy, and done.
>>
>> Weiwen Hu
>>
>> > On Thu, 9 Dec 2021 at 16:27, Marco Pizzolo 
>> wrote:
>> >
>> > > Hello Everyone,
>> > >
>> > > In an attempt to futureproof, I am beginning to look for information
>> on how
>> > > one would go about moving to podman from docker on a cephadm 16.2.6
>> > > installation on ubuntu 20.04.3.
>> > >
>> > > I would be interested to know if anyone else has contemplated or
>> performed
>> > > something similar, and what their findings were.
>> > >
>> > > Appreciate any insight you can share.
>> > >
>> > > Thanks,
>> > > ___
>> > > ceph-users mailing list -- ceph-users@ceph.io
>> > > To unsubscribe send an email to ceph-users-le...@ceph.io
>> > >
>> > ___
>> > ceph-users mailing list -- ceph-users@ceph.io
>> > To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Experience reducing size 3 to 2 on production cluster?

2021-12-10 Thread Marco Pizzolo
Hello,

As part of a migration process where we will be swinging Ceph hosts from
one cluster to another we need to reduce the size from 3 to 2 in order to
shrink the footprint sufficiently to allow safe removal of an OSD/Mon node.

The cluster has about 500M objects as per dashboard, and is about 1.5PB in
size comprised solely of small files served through CephFS to Samba.

Has anyone encountered a similar situation?  What (if any) problems did you
face?

Ceph 14.2.22 bare metal deployment on Centos.

Thanks in advance.

Marco
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs kernel client + snapshots slowness

2021-12-10 Thread Sebastian Knust

Hi,

I also see this behaviour and can more or less reproduce it running 
rsync or Bareos Backup tasks (anything stat-intense should do) on a 
specific directory. Unmounting and then remounting the filesystem fixes 
it, until it is caused again by a stat-intense task.


For me, I only saw two immediate solutions:
1) Use fuse client, which does not seem to exhibit this issue but is in 
general approx. one order of magnitude slower for metadata ops then 
kernel client

2) Do not use snapshots.

I grudgingly chose the second option, since 1) is too slow for my 
nightly backups.


A ticket has been opened at https://tracker.ceph.com/issues/44100, 
showing a few people running into these issues. Maybe your insights 
regarding the cap limit and the possible mitigation of either lowering 
the limit or raising it (above the critical number for your specific use 
case, whichever that might be) can help others as a workaround and might 
point the developers to the cause in the kernel client.


I myself will experiment with raising mds_max_caps_per_client, as I only 
have a handful of clients, all of which are heavily used. So thanks for 
recommending to look into that config value.


Cheers
Sebastian


On 10.12.21 17:33, Andras Pataki wrote:

Hi,

We've recently started using cephfs snapshots and are running into some 
issues with the kernel client.  It seems like traversing the file system 
and stat'ing files have become extremely slow.  Some (not all) stat 
operations went from microseconds to hundreds of milliseconds in 
duration.  The details are from a node that runs a file system scan:


The lstat calls of the file system scan take 200ms+, here is the strace 
-T -tt output:


07:05:28.909309 
lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/tablespoon.hpp", 
{st_mode=S_IFREG|0444, st_size=1145, ...}) = 0 <0.272695>
07:05:29.182086 
lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/teaspoon.hpp", 
{st_mode=S_IFREG|0444, st_size=1133, ...}) = 0 <0.268555>
07:05:29.450680 
lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/ton.hpp", 
{st_mode=S_IFREG|0444, st_size=1107, ...}) = 0 <0.270040>
07:05:29.720758 
lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/yard.hpp", 
{st_mode=S_IFREG|0444, st_size=1051, ...}) = 0 <0.268032>


The file system scanning process is constantly stuck in 'D' state with 
/proc//stack:


[<0>] ceph_mdsc_wait_request+0x88/0x150 [ceph]
[<0>] ceph_mdsc_do_request+0x82/0x90 [ceph]
[<0>] ceph_d_revalidate+0x207/0x300 [ceph]
[<0>] lookup_fast+0x179/0x210
[<0>] walk_component+0x44/0x320
[<0>] path_lookupat+0x7b/0x220
[<0>] filename_lookup+0xa5/0x170
[<0>] vfs_statx+0x6e/0xd0
[<0>] __do_sys_newlstat+0x39/0x70
[<0>] do_syscall_64+0x4a/0xe0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

One of the kernel threads is using close to 100% of a single core 
constantly:
     PID USER  PR  NI    VIRT    RES SHR S  %CPU  %MEM TIME+ 
COMMAND 1350556 root  20   0   0  0  0 R 100.0   0.0 
4:52.85 [kworker/18:0+ceph-msgr]


Dumping the kernel stacks (echo l > /proc/sysrq-trigger) shows typically 
waiting in ceph_queue_cap_snap or queue_realm_cap_snaps. A couple of 
examples:


Dec 10 07:36:56 popeye-mgr-0-10 kernel: CPU: 18 PID: 1350556 Comm: 
kworker/18:0 Not tainted 5.4.114.1.fi #1
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Hardware name: Dell Inc. 
PowerEdge R640/0W23H8, BIOS 2.10.2 02/24/2021
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Workqueue: ceph-msgr 
ceph_con_workfn [libceph]

Dec 10 07:36:56 popeye-mgr-0-10 kernel: RIP: 0010:_raw_spin_lock+0xb/0x20
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Code: 31 c0 ba ff 00 00 00 f0 0f 
b1 17 75 01 c3 e9 ec 8f 98 ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 31 c0 
ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 89 c6 e9 0b 7e 98 ff 90 66 2e 0f 
1f 84 00 00 00 00 00 fa
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RSP: 0018:c9001b607c50 
EFLAGS: 0246
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RAX:  RBX: 
88de44ede7e8 RCX: 
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RDX: 0001 RSI: 
 RDI: 88de44ede7f8
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RBP: 88de0a43ca00 R08: 
88dec0c63120 R09: 88de0a43ca00
Dec 10 07:36:56 popeye-mgr-0-10 kernel: R10: a07e50e9 R11: 
88de36aab878 R12: 88de1f3cab00
Dec 10 07:36:56 popeye-mgr

[ceph-users] CephFS single file size limit and performance impact

2021-12-10 Thread huxia...@horebdata.cn
Dear Ceph experts,

I encounter a use case wherein the size of a single file may go beyound 50TB, 
and would like to know whether CephFS can support a single file with size over 
50TB? Furthermore, if multiple clients, say 50, want to access (read/modify) 
this big file, do we expect any performance issues, e.g. something like a big 
lock on the whole file. I wonder whether Cephfs supports the so-called parallel 
feature like multiple clients can read/write different parts of the same big 
file...

Comments, suggestions, experience are highly appreciated,

Kind regards,

Samuel 



huxia...@horebdata.cn
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Cephalocon 2022 deadline extended?

2021-12-10 Thread Bobby
Hi all,

Has the CfP deadline for Cephalcoon 2022 been extended to 19 December 2022?
Please confirm if anyone knows it...


Thanks
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephalocon 2022 deadline extended?

2021-12-10 Thread Matt Vandermeulen
It appears to have been, and we have an application that's pending an 
internal review before we can submit... so we're hopeful that it has 
been!




On 2021-12-10 15:21, Bobby wrote:

Hi all,

Has the CfP deadline for Cephalcoon 2022 been extended to 19 December
2022? Please confirm if anyone knows it...

Thanks
___
Dev mailing list -- d...@ceph.io
To unsubscribe send an email to dev-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Cephalocon 2022 deadline extended?

2021-12-10 Thread Bobby
one typing mistakeI meant 19 December 2021

On Fri, Dec 10, 2021 at 8:21 PM Bobby  wrote:

>
> Hi all,
>
> Has the CfP deadline for Cephalcoon 2022 been extended to 19 December
> 2022? Please confirm if anyone knows it...
>
>
> Thanks
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Experience reducing size 3 to 2 on production cluster?

2021-12-10 Thread Wesley Dillingham
I would avoid doing this. Size 2 is not where you want to be. Maybe you can
give more details about your cluster size and shape and what you are trying
to accomplish and another solution could be proposed. The contents of "ceph
osd tree " and "ceph df" would help.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Fri, Dec 10, 2021 at 12:05 PM Marco Pizzolo 
wrote:

> Hello,
>
> As part of a migration process where we will be swinging Ceph hosts from
> one cluster to another we need to reduce the size from 3 to 2 in order to
> shrink the footprint sufficiently to allow safe removal of an OSD/Mon node.
>
> The cluster has about 500M objects as per dashboard, and is about 1.5PB in
> size comprised solely of small files served through CephFS to Samba.
>
> Has anyone encountered a similar situation?  What (if any) problems did you
> face?
>
> Ceph 14.2.22 bare metal deployment on Centos.
>
> Thanks in advance.
>
> Marco
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: CephFS single file size limit and performance impact

2021-12-10 Thread Yan, Zheng
On Sat, Dec 11, 2021 at 2:21 AM huxia...@horebdata.cn
 wrote:
>
> Dear Ceph experts,
>
> I encounter a use case wherein the size of a single file may go beyound 50TB, 
> and would like to know whether CephFS can support a single file with size 
> over 50TB? Furthermore, if multiple clients, say 50, want to access 
> (read/modify) this big file, do we expect any performance issues, e.g. 
> something like a big lock on the whole file. I wonder whether Cephfs supports 
> the so-called parallel feature like multiple clients can read/write different 
> parts of the same big file...
>
> Comments, suggestions, experience are highly appreciated,
>

The problem is file recovery.  (If a client opens the file in write
mode disconnect  abnormally, mds need to probe the file's objects, to
recover mtime and file size). operations such as stat(2) hang the file
is in recovery. For very large file,  its recovery process may take a
long time.
> Kind regards,
>
> Samuel
>
>
>
> huxia...@horebdata.cn
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io