One of the main limitations of using CephFS is the requirement to reduce the
number of active MDS daemons to one during upgrades. As far as I can tell this
has been a known problem since Luminous (~2017). This issue essentially
requires downtime during upgrades for any CephFS cluster that needs m
Thanks David! This looks good now. :)
> On Jul 8, 2021, at 6:28 PM, David Galloway wrote:
>
> Done!
>
> On 7/8/21 3:51 PM, Bryan Stillwell wrote:
>> There appears to be arm64 packages built for Ubuntu Bionic, but not for
>> Focal. Any chance Focal pa
I upgraded one of my clusters to v16.2.5 today and now I'm seeing these
messages from 'ceph -W cephadm':
2021-07-08T22:01:55.356953+ mgr.excalibur.kuumco [ERR] Failed to apply
alertmanager spec AlertManagerSpec({'placement': PlacementSpec(count=1),
'service_type': 'alertmanager', 'service_i
There appears to be arm64 packages built for Ubuntu Bionic, but not for Focal.
Any chance Focal packages can be built as well?
Thanks,
Bryan
> On Jul 8, 2021, at 12:20 PM, David Galloway wrote:
>
> Caution: This email is from an external sender. Please do not click links or
> open attachment
ly complete any upgrades after that, which means the global container
image name was never changed.
Bryan
On Jun 1, 2021, at 9:38 AM, Bryan Stillwell
mailto:bstillw...@godaddy.com>> wrote:
This morning I tried adding a mon node to my home Ceph cluster with the
following command:
ceph orch
This morning I tried adding a mon node to my home Ceph cluster with the
following command:
ceph orch daemon add mon ether
This seemed to work at first, but then it decided to remove it fairly quickly
which broke the cluster because the mon. keyring was also removed:
2021-06-01T14:16:11.523210
,1,14,0,19,8]p8[8,17,4,1,14,0,19,8]p8
2021-05-11T22:41:11.332885+ 2021-05-11T22:41:11.332885+
I'm now considering using device classes and assigning the OSDs to either hdd1
or hdd2... Unless someone has another idea?
Thanks,
Bryan
> On May 14, 2021, at 12:35 PM, Bryan Stillwell wrote:
ndep 0 type host
> step chooseleaf indep 1 type osd
> step emit
>
> J.
>
> ‐‐‐ Original Message ‐‐‐
>
> On Wednesday, May 12th, 2021 at 17:58, Bryan Stillwell
> wrote:
>
>> I'm trying to figure out a CRUSH rule that will spread data out across my
I'm looking for help in figuring out why cephadm isn't making any progress
after I told it to redeploy an mds daemon with:
ceph orch daemon redeploy mds.cephfs.aladdin.kgokhr ceph/ceph:v15.2.12
The output from 'ceph -W cephadm' just says:
2021-05-14T16:24:46.628084+ mgr.paris.glbvov [INF]
1 harrahs
1 mirage
2 mandalaybay
2 paris
...
Hopefully someone else will find this useful.
Bryan
> On May 12, 2021, at 9:58 AM, Bryan Stillwell wrote:
>
> I'm trying to figure out a CRUSH rule that will spread data out across my
> cluster as much as possible,
I'm trying to figure out a CRUSH rule that will spread data out across my
cluster as much as possible, but not more than 2 chunks per host.
If I use the default rule with an osd failure domain like this:
step take default
step choose indep 0 type osd
step emit
I get clustering of 3-4 chunks on
I tried upgrading my home cluster to 15.2.7 (from 15.2.5) today and it appears
to be entering a loop when trying to match docker images for ceph:v15.2.7:
2020-12-01T16:47:26.761950-0700 mgr.aladdin.liknom [INF] Upgrade: Checking mgr
daemons...
2020-12-01T16:47:26.769581-0700 mgr.aladdin.liknom [
I have a cluster running Nautilus where the bucket instance (backups.190) has
gone missing:
# radosgw-admin metadata list bucket | grep 'backups.19[0-1]' | sort
"backups.190",
"backups.191",
# radosgw-admin metadata list bucket.instance | grep 'backups.19[0-1]' | sort
"backups.191:00
The last two days we've experienced a couple short outages shortly after
setting both 'noscrub' and 'nodeep-scrub' on one of our largest Ceph clusters
(~2,200 OSDs). This cluster is running Nautilus (14.2.6) and setting/unsetting
these flags has been done many times in the past without a problem.
On Mar 24, 2020, at 5:38 AM, Abhishek Lekshmanan wrote:
> #. Upgrade monitors by installing the new packages and restarting the
> monitor daemons. For example, on each monitor host,::
>
> # systemctl restart ceph-mon.target
>
> Once all monitors are up, verify that the monitor upgrade i
Great work! Thanks to everyone involved!
One minor thing I've noticed so far with the Ubuntu Bionic build is it's
reporting the release as an RC instead of being 'stable':
$ ceph versions | grep octopus
"ceph version 15.2.0 (dc6a0b5c3cbf6a5e1d6d4f20b5ad466d76b96247) octopus
(rc)": 1
B
I just noticed that arm64 packages only exist for xenial. Is there a reason
why bionic packages aren't being built?
Thanks,
Bryan
> On Dec 20, 2019, at 4:22 PM, Bryan Stillwell wrote:
>
> I was going to try adding an OSD to my home cluster using one of the 4GB
> Raspberry
I was going to try adding an OSD to my home cluster using one of the 4GB
Raspberry Pis today, but it appears that the Ubuntu Bionic arm64 repo is
missing a bunch of packages:
$ sudo grep ^Package:
/var/lib/apt/lists/download.ceph.com_debian-nautilus_dists_bionic_main_binary-arm64_Packages
Packa
ou received this in error, please contact the sender and
destroy any copies of this information.
________
From: Bryan Stillwell mailto:bstillw...@godaddy.com>>
Sent: Wednesday, December 18, 2019 4:44:45 PM
To: Sage Weil mailto:s...@newdream.net>
On Dec 18, 2019, at 11:58 AM, Sage Weil
mailto:s...@newdream.net>> wrote:
On Wed, 18 Dec 2019, Bryan Stillwell wrote:
After upgrading one of our clusters from Nautilus 14.2.2 to Nautilus 14.2.5 I'm
seeing 100% CPU usage by a single ceph-mgr thread (found using 'top -H'
On Dec 18, 2019, at 1:48 PM, e...@lapsus.org wrote:
>
> That sounds very similar to what I described there:
> https://tracker.ceph.com/issues/43364
I would agree that they're quite similar if not the same thing! Now that you
mention it I see the thread is named mgr-fin in 'top -H' as well. I
After upgrading one of our clusters from Nautilus 14.2.2 to Nautilus 14.2.5 I'm
seeing 100% CPU usage by a single ceph-mgr thread (found using 'top -H').
Attaching to the thread with strace shows a lot of mmap and munmap calls.
Here's the distribution after watching it for a few minutes:
48.7
roblem.
Bryan
On Dec 14, 2019, at 10:27 AM, Sasha Litvak
mailto:alexander.v.lit...@gmail.com>> wrote:
Notice: This email is from an external sender.
Bryan,
Were you able to resolve this? If yes, can you please share with the list?
On Fri, Dec 13, 2019 at 10:08 AM Bryan Stillwell
alFrameEx
0.55% [kernel] [k] _raw_spin_unlock_irqrestore
I increased mon debugging to 20 and nothing stuck out to me.
Bryan
> On Dec 12, 2019, at 4:46 PM, Bryan Stillwell wrote:
>
> On our test cluster after upgrading to 14.2.5 I'm having problems with th
On our test cluster after upgrading to 14.2.5 I'm having problems with the mons
pegging a CPU core while moving data around. I'm currently converting the OSDs
from FileStore to BlueStore by marking the OSDs out in multiple nodes,
destroying the OSDs, and then recreating them with ceph-volume lv
Rich,
What's your failure domain (osd? host? chassis? rack?) and how big is each of
them?
For example I have a failure domain of type rack in one of my clusters with
mostly even rack sizes:
# ceph osd crush rule dump | jq -r '.[].steps'
[
{
"op": "take",
"item": -1,
"item_name":
On Nov 18, 2019, at 8:12 AM, Dan van der Ster wrote:
>
> On Fri, Nov 15, 2019 at 4:45 PM Joao Eduardo Luis wrote:
>>
>> On 19/11/14 11:04AM, Gregory Farnum wrote:
>>> On Thu, Nov 14, 2019 at 8:14 AM Dan van der Ster
>>> wrote:
Hi Joao,
I might have found the reason why s
e of a solution yet so I'll stick with disabled balancer
> for now since the current pg placement is fine.
>
> Regards,
> Eugen
>
>
> [1] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg56994.html
> [2] https://www.mail-archive.com/ceph-users@lists.ceph.com/msg5
On multiple clusters we are seeing the mgr hang frequently when the balancer is
enabled. It seems that the balancer is getting caught in some kind of infinite
loop which chews up all the CPU for the mgr which causes problems with other
modules like prometheus (we don't have the devicehealth mod
Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Tue, Nov 19, 2019 at 8:42 PM Bryan Stillwell
> wrote:
>>
>> Closing the loop here. I
as to track down, maybe a check should be added before
enabling msgr2 to make sure the require-osd-release is set to nautilus?
Bryan
> On Nov 18, 2019, at 5:41 PM, Bryan Stillwell wrote:
>
> I cranked up debug_ms to 20 on two of these clusters today and I'm still not
> underst
18 16:46:05.979 7f917becf700 1 -- 10.0.13.2:0/3084510 learned_addr
learned my addr 10.0.13.2:0/3084510 (peer_addr_for_me v1:10.0.13.2:0/0)
The learned address is v1:10.0.13.2:0/0. What else can I do to figure out why
it's deciding to use the legacy protocol only?
Thanks,
Bryan
> On Nov 15
I've upgraded 7 of our clusters to Nautilus (14.2.4) and noticed that on some
of the clusters (3 out of 7) the OSDs aren't using msgr2 at all. Here's the
output for osd.0 on 2 clusters of each type:
### Cluster 1 (v1 only):
# ceph osd find 0 | jq -r '.addrs'
{
"addrvec": [
{
"type":
There are some bad links to the mailing list subscribe/unsubscribe/archives on
this page that should get updated:
https://ceph.io/resources/
The subscribe/unsubscribe/archives links point to the old lists vger and
lists.ceph.com, and not the new lists on lists.ceph.io:
ceph-devel
subscribe
With FileStore you can get the number of OSD maps for an OSD by using a simple
find command:
# rpm -q ceph
ceph-12.2.12-0.el7.x86_64
# find /var/lib/ceph/osd/ceph-420/current/meta/ -name 'osdmap*' | wc -l
42486
Does anyone know of an equivalent command that can be used with BlueStore?
Thanks,
B
Thanks Casey!
Adding the following to my swiftclient put_object call caused it to start
compressing the data:
headers={'x-object-storage-class': 'STANDARD'}
I appreciate the help!
Bryan
> On Nov 7, 2019, at 9:26 AM, Casey Bodley wrote:
>
> On 11/7/19 10
port to nautilus
> in https://tracker.ceph.com/issues/41981.
>
> On 11/6/19 5:54 PM, Bryan Stillwell wrote:
>> Today I tried enabling RGW compression on a Nautilus 14.2.4 test cluster and
>> found it wasn't doing any compression at all. I figure I must have missed
>> some
Today I tried enabling RGW compression on a Nautilus 14.2.4 test cluster and
found it wasn't doing any compression at all. I figure I must have missed
something in the docs, but I haven't been able to find out what that is and
could use some help.
This is the command I used to enable zlib-base
Responding to myself to follow up with what I found.
While going over the release notes for 14.2.3/14.2.4 I found this was a known
problem that has already been fixed. Upgrading the cluster to 14.2.4 fixed the
issue.
Bryan
> On Oct 30, 2019, at 10:33 AM, Bryan Stillwell wrote:
>
This morning I noticed that on a new cluster the number of PGs for the
default.rgw.buckets.data pool was way too small (just 8 PGs), but when I try to
split the PGs the cluster doesn't do anything:
# ceph osd pool set default.rgw.buckets.data pg_num 16
set pool 13 pg_num to 16
It seems to set t
lass to be used for new object uploads -
> just note that some 'helpful' s3 clients will insert a
> 'x-amz-storage-class: STANDARD' header to requests that don't specify
> one, and the presence of this header will override the user's default
> storage class.
3 7f0e16363700 0 mgr[dashboard]
> [29/Oct/2019:17:37:56] ENGINE Error in HTTPServer.tick
> Traceback (most recent call last):
> File
> "/usr/lib/python2.7/dist-packages/cherrypy/wsgiserver/__init__.py", line
> 2021, in start
>self.tick()
> File
> "/usr/li
On Oct 29, 2019, at 9:44 AM, Thomas Schneider <74cmo...@gmail.com> wrote:
> in my unhealthy cluster I cannot run several ceph osd command because
> they hang, e.g.
> ceph osd df
> ceph osd pg dump
>
> Also, ceph balancer status hangs.
>
> How can I fix this issue?
Check the status of your ceph-m
I'm wondering if it's possible to enable compression on existing RGW buckets?
The cluster is running Luminous 12.2.12 with FileStore as the backend (no
BlueStore compression then).
We have a cluster that recently started to rapidly fill up with compressible
content (qcow2 images) and I would l
lag
* Taking the fragile OSD out
* restarting the "fragile" OSDs
* check if everything is ok look ing their logs
* taking off the NOUP flag
* Take a coffee and wait till all data are drain
[]'s
Arthur (aKa Guilherme Geronimo)
On 04/09/2019 15:32, Bryan Stillwell wrote:
We are
Sep 4, 2019, at 11:55 AM, Guilherme Geronimo
mailto:guilherme.geron...@gmail.com>> wrote:
Notice: This email is from an external sender.
Hey Bryan,
I suppose all nodes are using jumboframes (mtu 9000), right?
I would suggest to check OSD->MON communication.
Can you send the ou
Our test cluster is seeing a problem where peering is going incredibly slow
shortly after upgrading it to Nautilus (14.2.2) from Luminous (12.2.12).
>From what I can tell it seems to be caused by "wait for new map" taking a long
>time. When looking at dump_historic_slow_ops on pretty much any O
We've run into a problem on our test cluster this afternoon which is running
Nautilus (14.2.2). It seems that any time PGs move on the cluster (from
marking an OSD down, setting the primary-affinity to 0, or by using the
balancer), a large number of the OSDs in the cluster peg the CPU cores the
48 matches
Mail list logo