[ceph-users] Re: quincy v17.2.0 QE Validation status

2022-04-18 Thread Ilya Dryomov
On Fri, Apr 15, 2022 at 3:10 AM David Galloway  wrote:
>
> For transparency and posterity's sake...
>
> I tried upgrading the LRC and the first two mgrs upgraded fine but
> reesi004 threw an error.
>
> Apr 14 22:54:36 reesi004 podman[2042265]: 2022-04-14 22:54:36.210874346
> + UTC m=+0.138897862 container create
> 3991bea0a86f55679f9892b3fbceeef558dd1edad94eb4bf73deebf6595bcc99
> (image=quay.ceph.io/ceph-ci/ceph@sha256:230120c6a429af7546b91180a3da39846e760787580d7b5193487
> Apr 14 22:54:36 reesi004 bash[2042070]: Error: OCI runtime error:
> writing file `pids.max`: Invalid argument
>
> Adam and I suspected we needed
> https://github.com/ceph/ceph/pull/45853#issue-1200032778 so I took the
> tip of quincy, cherry-picked that PR and pushed to dgalloway-quincy-fix
> in ceph-ci.git.  Then I waited for packages and a container to get built
> and attempted to upgrade the LRC to that container version.
>
> Same error though.  So I'm leaving it for the weekend.  We have two MGRs
> that *did* upgrade to the tip of quincy but the rest of the containers
> are still running 17.1.0-5-g8299cd4c.

I don't think https://github.com/ceph/ceph/pull/45853 would help.
The problem appears to be that --pids-limit=-1 just doesn't work on
older podman versions.  "-1" is not massaged there and is attempted to
be written to /sys/fs/cgroup/pids/.../pids.max, which fails because
pids.max file expects either a non-negative integer or "max" [1].
I don't understand how some of the other manager daemons upgraded
though, since the LRC nodes appear to be running Ubuntu 18.04 LTS with
an older podman:

$ podman --version
podman version 3.0.1

This was reported in [2] and addressed in podman in [3], fairly
recently.  Their fix was to make "-1" be treated the same as "0", as
older podman versions insisted on "0" for unlimited and "-1" either
never worked or stopped working a long time ago.  docker seems to
accept both "-1" and "0" for unlimited.

The best of course of action would probably be to drop [4] from quincy,
getting it back to 17.1.0 state (i.e. no --pids-limit option in sight)
and amend the original --pids-limit change in master so that it works
for all versions of podman.  The podman version is already checked in
a couple of places (e.g. CGROUPS_SPLIT_PODMAN_VERSION) so it should be
easy enough or we could just unconditionally pass "0" even though it
is not documented anymore.

(The reason for backporting [4] to quincy was to fix containerized
iSCSI deployments where bumping into default PID limit is just a matter
of scaling the number of exported LUNs.  It's been that way since the
initial pacific release though so taking it out for now is completely
acceptable.)

[1] https://www.kernel.org/doc/Documentation/cgroup-v1/pids.txt
[2] https://github.com/containers/podman/issues/11782
[3] https://github.com/containers/podman/pull/11794
[4] https://github.com/ceph/ceph/pull/45576

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: quincy v17.2.0 QE Validation status

2022-04-18 Thread Adam King
A patch to revert the pids-limit change has been opened
https://github.com/ceph/ceph/pull/45932. We created a build with that patch
put on top of the current quincy branch and are upgrading the LRC to it
now. So far it seems to be going alright. All the mgr, mon and crash
daemons have been upgraded with no issue and it is currently upgrading the
osds so the patch seems to be working.

Additionally, there was some investigation this morning in order to get
the cluster back into a good state. The mgr daemons were redeployed with a
version with https://github.com/ceph/ceph/pull/45853. While we aren't going
with that patch for now, importantly it would cause us to not deploy other
mgr daemons with the pids-limit set. From that point, modifying the
mgr daemons unit.run file to remove the --pids-limit section and restarting
the mgr daemons' systemd unit and then upgrading the cluster fully to this
patches image got things back into a stable position. This proved fully the
pids-limit was causing the issue in the cluster. From that point we didn't
touch the cluster further until this new upgrade to the version with the
reversion.

To summarize, the reversion on top of the current quincy branch seems to be
working okay and we should be ready to make a new final build based on that.

Thanks,
  - Adam King

On Mon, Apr 18, 2022 at 9:36 AM Ilya Dryomov  wrote:

> On Fri, Apr 15, 2022 at 3:10 AM David Galloway 
> wrote:
> >
> > For transparency and posterity's sake...
> >
> > I tried upgrading the LRC and the first two mgrs upgraded fine but
> > reesi004 threw an error.
> >
> > Apr 14 22:54:36 reesi004 podman[2042265]: 2022-04-14 22:54:36.210874346
> > + UTC m=+0.138897862 container create
> > 3991bea0a86f55679f9892b3fbceeef558dd1edad94eb4bf73deebf6595bcc99
> > (image=
> quay.ceph.io/ceph-ci/ceph@sha256:230120c6a429af7546b91180a3da39846e760787580d7b5193487
> > Apr 14 22:54:36 reesi004 bash[2042070]: Error: OCI runtime error:
> > writing file `pids.max`: Invalid argument
> >
> > Adam and I suspected we needed
> > https://github.com/ceph/ceph/pull/45853#issue-1200032778 so I took the
> > tip of quincy, cherry-picked that PR and pushed to dgalloway-quincy-fix
> > in ceph-ci.git.  Then I waited for packages and a container to get built
> > and attempted to upgrade the LRC to that container version.
> >
> > Same error though.  So I'm leaving it for the weekend.  We have two MGRs
> > that *did* upgrade to the tip of quincy but the rest of the containers
> > are still running 17.1.0-5-g8299cd4c.
>
> I don't think https://github.com/ceph/ceph/pull/45853 would help.
> The problem appears to be that --pids-limit=-1 just doesn't work on
> older podman versions.  "-1" is not massaged there and is attempted to
> be written to /sys/fs/cgroup/pids/.../pids.max, which fails because
> pids.max file expects either a non-negative integer or "max" [1].
> I don't understand how some of the other manager daemons upgraded
> though, since the LRC nodes appear to be running Ubuntu 18.04 LTS with
> an older podman:
>
> $ podman --version
> podman version 3.0.1
>
> This was reported in [2] and addressed in podman in [3], fairly
> recently.  Their fix was to make "-1" be treated the same as "0", as
> older podman versions insisted on "0" for unlimited and "-1" either
> never worked or stopped working a long time ago.  docker seems to
> accept both "-1" and "0" for unlimited.
>
> The best of course of action would probably be to drop [4] from quincy,
> getting it back to 17.1.0 state (i.e. no --pids-limit option in sight)
> and amend the original --pids-limit change in master so that it works
> for all versions of podman.  The podman version is already checked in
> a couple of places (e.g. CGROUPS_SPLIT_PODMAN_VERSION) so it should be
> easy enough or we could just unconditionally pass "0" even though it
> is not documented anymore.
>
> (The reason for backporting [4] to quincy was to fix containerized
> iSCSI deployments where bumping into default PID limit is just a matter
> of scaling the number of exported LUNs.  It's been that way since the
> initial pacific release though so taking it out for now is completely
> acceptable.)
>
> [1] https://www.kernel.org/doc/Documentation/cgroup-v1/pids.txt
> [2] https://github.com/containers/podman/issues/11782
> [3] https://github.com/containers/podman/pull/11794
> [4] https://github.com/ceph/ceph/pull/45576
>
> Thanks,
>
> Ilya
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: quincy v17.2.0 QE Validation status

2022-04-18 Thread David Galloway
The LRC is upgraded but the same mgr did crash during the upgrade.  It 
is running now despite the crash.  Adam suspects it's due to earlier 
breakage.


https://pastebin.com/NWzzsNgk

Shall I start the build after https://github.com/ceph/ceph/pull/45932 
 gets merged?


On 4/18/22 14:02, Adam King wrote:
A patch to revert the pids-limit change has been opened 
https://github.com/ceph/ceph/pull/45932. We created a build with that 
patch put on top of the current quincy branch and are upgrading the 
LRC to it now. So far it seems to be going alright. All the mgr, mon 
and crash daemons have been upgraded with no issue and it is currently 
upgrading the osds so the patch seems to be working.


Additionally, there was some investigation this morning in order to 
get the cluster back into a good state. The mgr daemons were 
redeployed with a version with 
https://github.com/ceph/ceph/pull/45853. While we aren't going with 
that patch for now, importantly it would cause us to not deploy other 
mgr daemons with the pids-limit set. From that point, modifying the 
mgr daemons unit.run file to remove the --pids-limit section and 
restarting the mgr daemons' systemd unit and then upgrading the 
cluster fully to this patches image got things back into a stable 
position. This proved fully the pids-limit was causing the issue in 
the cluster. From that point we didn't touch the cluster further until 
this new upgrade to the version with the reversion.


To summarize, the reversion on top of the current quincy branch seems 
to be working okay and we should be ready to make a new final build 
based on that.


Thanks,
  - Adam King

On Mon, Apr 18, 2022 at 9:36 AM Ilya Dryomov  wrote:

On Fri, Apr 15, 2022 at 3:10 AM David Galloway
 wrote:
>
> For transparency and posterity's sake...
>
> I tried upgrading the LRC and the first two mgrs upgraded fine but
> reesi004 threw an error.
>
> Apr 14 22:54:36 reesi004 podman[2042265]: 2022-04-14
22:54:36.210874346
> + UTC m=+0.138897862 container create
> 3991bea0a86f55679f9892b3fbceeef558dd1edad94eb4bf73deebf6595bcc99
>

(image=quay.ceph.io/ceph-ci/ceph@sha256:230120c6a429af7546b91180a3da39846e760787580d7b5193487


> Apr 14 22:54:36 reesi004 bash[2042070]: Error: OCI runtime error:
> writing file `pids.max`: Invalid argument
>
> Adam and I suspected we needed
> https://github.com/ceph/ceph/pull/45853#issue-1200032778 so I
took the
> tip of quincy, cherry-picked that PR and pushed to
dgalloway-quincy-fix
> in ceph-ci.git.  Then I waited for packages and a container to
get built
> and attempted to upgrade the LRC to that container version.
>
> Same error though.  So I'm leaving it for the weekend. We have
two MGRs
> that *did* upgrade to the tip of quincy but the rest of the
containers
> are still running 17.1.0-5-g8299cd4c.

I don't think https://github.com/ceph/ceph/pull/45853 would help.
The problem appears to be that --pids-limit=-1 just doesn't work on
older podman versions.  "-1" is not massaged there and is attempted to
be written to /sys/fs/cgroup/pids/.../pids.max, which fails because
pids.max file expects either a non-negative integer or "max" [1].
I don't understand how some of the other manager daemons upgraded
though, since the LRC nodes appear to be running Ubuntu 18.04 LTS with
an older podman:

    $ podman --version
    podman version 3.0.1

This was reported in [2] and addressed in podman in [3], fairly
recently.  Their fix was to make "-1" be treated the same as "0", as
older podman versions insisted on "0" for unlimited and "-1" either
never worked or stopped working a long time ago.  docker seems to
accept both "-1" and "0" for unlimited.

The best of course of action would probably be to drop [4] from
quincy,
getting it back to 17.1.0 state (i.e. no --pids-limit option in sight)
and amend the original --pids-limit change in master so that it works
for all versions of podman.  The podman version is already checked in
a couple of places (e.g. CGROUPS_SPLIT_PODMAN_VERSION) so it should be
easy enough or we could just unconditionally pass "0" even though it
is not documented anymore.

(The reason for backporting [4] to quincy was to fix containerized
iSCSI deployments where bumping into default PID limit is just a
matter
of scaling the number of exported LUNs.  It's been that way since the
initial pacific release though so taking it out for now is completely
acceptable.)

[1] https://www.kernel.org/doc/Documentation/cgroup-v1/pids.txt
[2] https://github.com/containers/podman/issues/11782
[3] https://github.com/containers/podman/pull/11794
[4] https://github.com/ceph/ceph/

[ceph-users] Re: quincy v17.2.0 QE Validation status

2022-04-18 Thread Ilya Dryomov
On Mon, Apr 18, 2022 at 9:04 PM David Galloway  wrote:
>
> The LRC is upgraded but the same mgr did crash during the upgrade.  It is 
> running now despite the crash.  Adam suspects it's due to earlier breakage.
>
> https://pastebin.com/NWzzsNgk

src/mgr/DaemonServer.cc: 2946: FAILED
ceph_assert(pending_service_map.epoch > service_map.epoch)

I don't see any relation between this crash (assert failure) and the
--pids-limit handling annoyance.  Neha, could you please double check?

Thanks,

Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Is it normal Ceph reports "Degraded data redundancy" in normal use?

2022-04-18 Thread Wesley Dillingham
If you mark an osd "out" but not down / you dont stop the daemon do the PGs
go remapped or do they go degraded then as well?

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Thu, Apr 14, 2022 at 5:15 AM Kai Stian Olstad 
wrote:

> On 29.03.2022 14:56, Sandor Zeestraten wrote:
> > I was wondering if you ever found out anything more about this issue.
>
> Unfortunately no, so I turned it off.
>
>
> > I am running into similar degradation issues while running rados bench
> > on a
> > new 16.2.6 cluster.
> > In our case it's with a replicated pool, but the degradation problems
> > also
> > go away when we turn off the balancer.
>
> So this goes a long way of confirming there are something wrong with the
> balancer since we now see it on two different installation.
>
>
> --
> Kai Stian Olstad
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Aggressive Bluestore Compression Mode for client data only?

2022-04-18 Thread Wesley Dillingham
I would like to use bluestore compression (probably zstd level 3) to
compress my clients data unless the incompressible hint is set (aggressive
mode) but I do no want to expose myself to the bug experienced in this Cern
talk (Ceph bug of the year) https://www.youtube.com/watch?v=_4HUR00oCGo
where the osd maps are compressed, even though I realize the bug is fixed
in lz4 (and i'm probably not going to use lz4).

So my question is if the compression mode is aggressive but is set on a per
pool basis (compression_mode vs bluestore_compression_mode) is ceph only
attempting to compress the client data and not "bluestore/cluster internal
data" like osd maps etc. Thanks.


Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: quincy v17.2.0 QE Validation status

2022-04-18 Thread Neha Ojha
On Mon, Apr 18, 2022 at 12:34 PM Ilya Dryomov  wrote:
>
> On Mon, Apr 18, 2022 at 9:04 PM David Galloway  wrote:
> >
> > The LRC is upgraded but the same mgr did crash during the upgrade.  It is 
> > running now despite the crash.  Adam suspects it's due to earlier breakage.
> >
> > https://pastebin.com/NWzzsNgk
>
> src/mgr/DaemonServer.cc: 2946: FAILED
> ceph_assert(pending_service_map.epoch > service_map.epoch)
>
> I don't see any relation between this crash (assert failure) and the
> --pids-limit handling annoyance.  Neha, could you please double check?

I agree, these are not related. This assertion seems to be another
occurrence of https://tracker.ceph.com/issues/51835
(https://tracker.ceph.com/issues/48022 was an earlier fix for this
issue). I don't think this is a regression in Quincy, given some of
the latest telemetry reports in the tracker. Also, the assert seems
transient, and the mgr in the LRC seems to be running fine since then.
Having said that, I will increase the priority of this issue to chase
this bug down, but IMO, it does not qualify as a blocker for Quincy.

David, the release notes PR https://github.com/ceph/ceph/pull/45048
needs final review/approval, so I'd suggest we finish reviews and the
build today, but do the final release tomorrow. Does that sound
reasonable to you?

Thanks,
Neha


>
> Thanks,
>
> Ilya
>

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: quincy v17.2.0 QE Validation status

2022-04-18 Thread David Galloway



On 4/18/22 16:45, Neha Ojha wrote:

On Mon, Apr 18, 2022 at 12:34 PM Ilya Dryomov  wrote:


On Mon, Apr 18, 2022 at 9:04 PM David Galloway  wrote:


The LRC is upgraded but the same mgr did crash during the upgrade.  It is 
running now despite the crash.  Adam suspects it's due to earlier breakage.

https://pastebin.com/NWzzsNgk


 src/mgr/DaemonServer.cc: 2946: FAILED
ceph_assert(pending_service_map.epoch > service_map.epoch)

I don't see any relation between this crash (assert failure) and the
--pids-limit handling annoyance.  Neha, could you please double check?


I agree, these are not related. This assertion seems to be another
occurrence of https://tracker.ceph.com/issues/51835
(https://tracker.ceph.com/issues/48022 was an earlier fix for this
issue). I don't think this is a regression in Quincy, given some of
the latest telemetry reports in the tracker. Also, the assert seems
transient, and the mgr in the LRC seems to be running fine since then.
Having said that, I will increase the priority of this issue to chase
this bug down, but IMO, it does not qualify as a blocker for Quincy.

David, the release notes PR https://github.com/ceph/ceph/pull/45048
needs final review/approval, so I'd suggest we finish reviews and the
build today, but do the final release tomorrow. Does that sound
reasonable to you?


Sure.  There is no harm in starting the build right now anyway.



Thanks,
Neha




Thanks,

 Ilya





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] which cdn tool for rgw in production

2022-04-18 Thread norman.kern

Hi guys,

I want to add a CDN service for my rgws, and provide a url without
authentication.

I have a test on openresty, but I'm not sure it it suitable for
production. Which tool do you use in production?

Thanks.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs-top doesn't work

2022-04-18 Thread Xiubo Li



On 4/19/22 3:43 AM, Vladimir Brik wrote:
Does anybody know why cephfs-top may only display header lines (date, 
client types, metric names) but no actual data?


When I run it, cephfs-top consumes quite a bit of the CPU and 
generates quite a bit of network traffic, but it doesn't actually 
display the data.


I poked around in the source code and it seems like it might be curses 
issue, but I am not sure.



Does there any data from `ceph fs perf stats` ?

Before I hit the same issue it was caused by the curses and windows 
issue, we have fixed that long time ago, you can try to enlarge your 
terminal size and try again.


If that still doesn't work please try to revert some recent commits of 
the cephfs-top to see whether will it work for you. Recently there have 
some new features supported.


-- Xiubo



Vlad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Ceph Multisite Cloud Sync Module

2022-04-18 Thread Mark Selby
I am trying to get the Ceph Multisite Clous Sync module working with Amazon S3. 
The docs are not clear on how the sync module is actually configured. I just 
want a POC of the most simple config. Can anyone share the config and 
radosgw-admin commands that were invoked to create a simple sync setup. The 
closest docs that I have seen are 
https://croit.io/blog/setting-up-ceph-cloud-sync-module and to be honest they 
do not make a lot of sense.

 

Thanks!

 

-- 

Mark Selby

Sr Linux Administrator, The Voleon Group

mse...@voleon.com 

 

 This email is subject to important conditions and disclosures that are listed 
on this web page: https://voleon.com/disclaimer/.

 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cephfs-top doesn't work

2022-04-18 Thread Jos Collin

Do you have mounted clients? How many clients do you have?

Please see: https://tracker.ceph.com/issues/55197

On 19/04/22 01:13, Vladimir Brik wrote:
Does anybody know why cephfs-top may only display header lines (date, 
client types, metric names) but no actual data?


When I run it, cephfs-top consumes quite a bit of the CPU and 
generates quite a bit of network traffic, but it doesn't actually 
display the data.


I poked around in the source code and it seems like it might be curses 
issue, but I am not sure.



Vlad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io