Thanks for your patience.
host ceph06 isn't referenced in the config database. I think I've
finally purged it. I also reset the dashboard API host address from
ceph08 to dell02. But since prometheus isn't running on dell02 either,
there's no gain there.
I did clear some of that lint out via
you try to deploy
prometheus again. So my recommendation is to get into HEALTH_OK first.
And btw, "TOO_MANY_PGS: too many PGs per OSD (648 > max 560)" is
serious, you can end up with inactive PGs during recovery, so I'd also
consider checking the pools and their PGs.
Zitat von Tim
I don't think Linus is only concerned with public data, no.
The United States Government has had in places for many years effective
means of preserving their data. Some of those systems may be old and
creaky, granted, and not always the most efficient, but they suffice.
The problem is that th
That's quite a large number of storage units per machine.
My suspicion is that since you have apparently an unusually high number
of LVs coming online at boot, the time it takes to linearly activate
them is long enough to overlap with the point in time that ceph starts
bringing up its storage-
t; parameter, I started to wonder. Since it's the only
"extra" paramter in your spec file, I must assume that it's the root
cause of the failures. That would probably easy to test... if you
still have the patience. ;-)
Zitat von Tim Holloway :
Let me close this out by
We're glad to have been of help.
There is no One Size Fits All solution. For you, it seems that speed is
more important than high availability. For me, it's HA+redundancy.
Ceph has 3 ways to deliver data to remote clients:
1. As a direct ceph mount on the client. From experience, this is a pa
I'm going to chime in and vote with the majority. You do not sound like
an ideal user for ceph.
Ceph provides high ability by being able to lose both drives /and
servers/ in a transparent manner. With only one server, you have a
single point of failure, and that's not considered "high availabi
blob/be5dba538167f282c4ec74ea3cae958c8bd79830/src/pybind/mgr/cephadm/utils.py#L141
or
https://github.com/ceph/ceph/blob/be5dba538167f282c4ec74ea3cae958c8bd79830/src/python-common/ceph/deployment/utils.py#L58
where one tries a "_dns_lookup" and the other a "resolve_ip&quo
version again, considering
that's where the problem started. I also didn't attempt to try an "orch
apply" that omitted any running service host, for fear it wouldn't
remove the omitted host.
So the problem is fixed, but it took a lot of banging and hammering to
mak
As Eugen has noted, cephadm/containers were already available in
Octopus. In fact, thanks to the somewhat scrambled nature of the
documents, I had a mix of both containerized and legacy OSDs under
Octopus and for the most part had no issues attributable to that.
My bigger problem with Octopus
Well, I just upgraded Pacific to Reef 18.2.4 and the only problems I
ran into had been problems previously seen in Pacific.
Your Mileage May Vary. as the only part of Ceph I put a strain on is
deploying stuff and adding/removing OSDs, but for the rest, I'd look to
recent Reef complaints on the lis
Yeah, Ceph in its current form doesn't seem like a good fit.
I think that what we need to support the world's knowledge in the face
of enstupidification is some sort of distributed holographic datastore.
so, like Ceph's PG replication, a torrent-like ability to pull from
multiple unreliable so
Hi Alex,
I think one of the scariest things about your setup is that there are
only 4 nodes (I'm assuming that means Ceph hosts carrying OSDs). I've
been bouncing around different configurations lately between some of my
deployment issues and cranky old hardware and I presently am down to 4
h
I just checked an OSD and the "block" entry is indeed linked to storage
using a /dev/mapper uuid LV, not a /dev/device. When ceph builds an
LV-based OSD, it creates a VG whose name is "ceph-u", where ""
is a UUID, and an LV named "osd-block-", where "" is also a
uuid. So althoug
topics and the like.
I considered OIDs as an alternative, but LDAP names are more
human-friendly and easier to add sub-domains to without petitioning a
master registrar. Also there's a better option for adding attributes to
the entry description.
On 4/7/25 09:39, Tim Holloway wrote:
Ye
Peter,
I don't think udev factors in based on the original question. Firstly,
because I'm not sure udev deals with permanently-attached devices (it's
more for hot-swap items). Secondly, because the original complaint
mentioned LVM specifically.
I agree that the hosts seem overloaded, by the
Sounds like a discussion for a discord server. Or BlueSky or something
that's very definitely NOT what used to be known as twitter.
My viewpoint is a little different. I really didn't consider HIPAA
stuff, although since technically that is info that shouldn't be
accessible to anyone but autho
hester Institute of Technology
________
From: Tim Holloway
Sent: Saturday, April 12, 2025 1:13:05 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: nodes with high density of OSDs
When I first migrated to Ceph, my servers were all running CentOS 7,
which I (wrongly) thought could not
cated on its host, but it
seems like it should be possible to carry a copy on the actual device.
Tim
On 4/11/25 16:23, Anthony D'Atri wrote:
Filestore, pre-ceph-volume may have been entirely different. IIRC LVM is used
these days to exploit persistent metadata tags.
On Apr 11, 2025, a
LVs, after all.
Tim
On 4/12/25 10:25, Gregory Orange wrote:
On 12/4/25 20:56, Tim Holloway wrote:
Which brings up something I've wondered about for some time. Shouldn't
it be possible for OSDs to be portable? That is, if a box goes bad, in
theory I should be able to remove the drive a
4/11/25 16:23, Anthony D'Atri wrote:
Filestore, pre-ceph-volume may have been entirely different. IIRC LVM is used
these days to exploit persistent metadata tags.
On Apr 11, 2025, at 4:03 PM, Tim Holloway wrote:
I just checked an OSD and the "block" entry is indeed linked to sto
d, therefore creating a false sense of confidence.
You may start having a look at Prometheus and/or Alertmanager web UIs, or
checking their logs.
Kind Regards,
Ernesto
On Tue, Apr 15, 2025 at 7:28 PM Tim Holloway wrote:
Although I've had this problem since at least Pacific, I
There are likely simpler answers if you want to tier entire buckets, but
it sounds like you are hosting a filesystem(s) on NetApp and want to
tier them. It would be nice to have NetApp running Ceph as a block
store, but I don't think crush is sophisticated enough to migrate
components of a file
Although I've had this problem since at least Pacific, I'm still seeing
it on Reef.
After much pain and suffering (covered elsewhere), I got my Prometheus
services deployed as intended, Ceph health OK, green across the board.
However, over the weekend, the dreaded
"CephMgrPrometheusModuleInactive
Hi Alex,
"Cost concerns" is the fig leaf that is being used in many cases, but
often a closer look indicates political motivations.
The current US administration is actively engaged in the destruction of
anything that would conflict with their view of the world. That includes
health practic
I haven't had the need for capacity or speed that many ceph users do,
but I AM insistent on reliability, and ceph has never failed me on that
point even when I've made a wreck of my hardware and/or configuration.
I don't think that it was explicitly stated, but I'm pretty sure that
Ceph doesn't (a
nesto Puerta wrote:
You could check Alertmanager container logs
<https://docs.ceph.com/en/quincy/cephadm/operations/#example-of-logging-to-journald>
.
Kind Regards,
Ernesto
On Wed, Apr 16, 2025 at 4:54 PM Tim Holloway wrote:
I'm thinking more some sort of latency error.
I h
.
Anyway, thanks all for the help!
Tim
On 4/21/25 09:46, Tim Holloway wrote:
Thanks, but all I'm getting is the following every 10 minutes from the
prometheus nodes:
Apr 21 09:29:32 dell02.mousetech.com podman[997331]: 2025-04-21
09:29:32.252358201 -040
I'm coming in late so I don't know the whole story here, but that
name's indicative of a Managed (containerized) resource.
You can't manually construct, delete or change the systemd services for
such items. I learned that the hard way. The service
declaration/control files are dynamically created
reboot and in the process flush out any other
issues that might have arisen.
On Thu, 2025-02-27 at 15:47 -0500, Anthony D'Atri wrote:
>
>
> > On Feb 27, 2025, at 8:14 AM, Tim Holloway
> > wrote:
> >
> > System is now stable. The rebalancing was doing what it shou
ince I not only maintain Ceph, but
every other service on the farm, including appservers, LDAP, NFS, DNS,
and much more, I haven't had the luxury to dig into Ceph as deeply as
I'd like, so the fact that it works so well under such shoddy
administration is also a point in its favor.
then looking
> into/editing/removing ceph-config keys like 'mgr/cephadm/inventory'
> and 'mgr/cephadm/host.ceph07.internal.mousetech.com' that 'ceph
> config-key dump' output shows might help.
>
> Regards,
> Frédéric.
>
> - Le 25 Fév 25,
Ack. Another fine mess.
I was trying to clean things up and the process of tossing around OSD's
kept getting me reports of slow responses and hanging PG operations.
This is Ceph Pacific, by the way.
I found a deprecated server that claimed to have an OSD even though it
didn't show in either "cep
le or set of OSDs that it seemed to hang on, I just picked a server
with the most OSDs reported and rebooted that on. I suspect, however,
that any server would have done.
Thanks,
Tim
On Thu, 2025-02-27 at 08:28 +0100, Frédéric Nass wrote:
>
>
> - Le 26 Fév 25, à 16:40,
Only the stuff that defines the rgw daemon on dell02.
On 3/28/25 19:23, Eugen Block wrote:
Do you find anything related to dell02 in config dump?
ceph config dump | grep -C2 dell02
Zitat von Tim Holloway :
I'm guessing that the configuration issues come from the dashboard
wantin
other services.
Zitat von Tim Holloway :
Thanks for the info on removing stubborn dead OSDs. The actual syntax
required was:
cephadm rm-daemon --name osd.2 --fsid --force
On the "too many pgs", that's because I'm down 2 OSDs. I've got new
drives, but they were wait
rvice
not known'); too many PGs per OSD (648 > max 560)
[ERR] MGR_MODULE_ERROR: Module 'prometheus' has failed: gaierror(-2,
'Name or service not known')
Module 'prometheus' has failed: gaierror(-2, 'Name or service not
known')
[WRN] TOO_MANY_PGS:
server VMs, though!
On 3/28/25 16:53, Anthony D'Atri wrote:
On Mar 28, 2025, at 4:38 PM, Tim Holloway wrote:
We're glad to have been of help.
There is no One Size Fits All solution. For you, it seems that speed is more
important than high availability. For me, it's HA+redundan
Almost forgot to say. I switched out disks and got rid of the OSD
errors. I actually found a third independent location, so it should be a
lot more failure resistant now.
So now it's only the prometheus stuff that's still complaining.
Everything else is happy.
mgr mgr/prometheus/server_port 9095
Those should be:
ceph config get mgr mgr/prometheus/server_addr
0.0.0.0
ceph config get mgr mgr/prometheus/server_port
9283
I assume that's why the module is still failing. Can you give that a
try and report back?
Zitat von Tim Holloway :
OK. I didn'
theus node.
On 3/29/25 05:13, Eugen Block wrote:
How about this:
ceph config-key dump | grep -v history
Can you spot any key regarding dell02 that doesn't belong there?
Zitat von Tim Holloway :
Only the stuff that defines the rgw daemon on dell02.
On 3/28/25 19:23, Eugen Block wrote:
D
t 1 more
[ERR] MGR_MODULE_ERROR: 2 mgr modules have failed
Module 'cephadm' has failed: 'ceph06.internal.mousetech.com'
Module 'prometheus' has failed: gaierror(-2, 'Name or service not
known')
[WRN] TOO_MANY_PGS: too many PGs per OSD (648 > m
y other service (mon, mgr etc). To get the daemon
logs you need to provide the daemon name (prometheus.ceph02.andsopn),
not just the service name (prometheus).
Can you run the cephadm command I provided? It should show something
like I pasted in the previous message.
Zitat von Tim Hollo
I think that that information is located in the ceph configuration
database. Which is edited via the "ceph config set" command and which
should be readable via the "ceph config get" command and probably via
the config browser in the Ceph dashboard.
As I mentioned earlier, /etc/ceph doesn't car
I use Puppet for my complex servers, but my Ceph machines are
lightweight, and Puppet, for all its virtues does require a Puppet agent
to be installed on each target and have a corresponding node manifest.
For the Ceph machines I just use Ansible. Since all my /etc/ceph files
are identical on
h wrote:
Le 10/06/2025 à 09:56:34-0400, Tim Holloway a écrit
Hi,
I use Puppet for my complex servers, but my Ceph machines are lightweight,
and Puppet, for all its virtues does require a Puppet agent to be installed
on each target and have a corresponding node manifest.
For the Ceph machines
I think you're a bit confused. Then again, when it comes to LDAP, I'm
usually more than a bit confused myself.
Generally, there are 2 ways to authenticate to LDAP:
1. Connect via a binddn and do an LDAP lookup
2. Connect via a user search to test for found/not-found
Option 1 requires a "unive
101 - 147 of 147 matches
Mail list logo