from:"Harry G Coin"

[ceph-users] 18.2.4 regression: 'diskprediction_local' has failed: No module named 'sklearn'

2024-07-25 Thread Harry G Coin

Upgraded to 18.2.4 yesterday.  Healthy cluster reported a few minutes 
after the upgrade completed.  Next morning, this:


# ceph health detail
HEALTH_ERR Module 'diskprediction_local' has failed: No module named 
'sklearn'
[ERR] MGR_MODULE_ERROR: Module 'diskprediction_local' has failed: No 
module named 'sklearn'

   Module 'diskprediction_local' has failed: No module named 'sklearn'


Searching found this was a problem several years ago, then resolved, now 
returned.


Any ideas?

Harry Coin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 18.2.4 regression: 'diskprediction_local' has failed: No module named 'sklearn'

2024-07-27 Thread Harry G Coin


On 7/26/24 11:45, Rouven Seifert wrote:

Hello,

On 2024-07-25 16:39, Harry G Coin wrote:
Upgraded to 18.2.4 yesterday.  Healthy cluster reported a few minutes 
after the upgrade completed. Next morning, this:


# ceph health detail
HEALTH_ERR Module 'diskprediction_local' has failed: No module named 
'sklearn'
[ERR] MGR_MODULE_ERROR: Module 'diskprediction_local' has failed: No 
module named 'sklearn'

   Module 'diskprediction_local' has failed: No module named 'sklearn'


Searching found this was a problem several years ago, then resolved, 
now returned.


We encountered the same problem after an upgrade on our cluster and I 
dug a bit into this. It appears that [0] was the fix for the missing 
sklearn package back in 2021. That fix was seemingly specifically tied 
to centos 8.


Now that the container images are being built on centos 9, the 
relevant Dockerfile doesn't include the fix any more as it checks the 
OS version for centos 8. I wonder a bit why it was done this way.


That problem in relation to centos 9 seems to be known to the 
ceph-container managers. See for example [1].


[0] https://github.com/ceph/ceph-container/pull/1821/files
[1] 
https://github.com/ceph/ceph-container/blob/main/ceph-releases/ALL/centos/9/daemon-base/README.tmp


Best regards,
Rouven


Thanks!   I think there's a further issue as well.  The 
diskprediction_local code appears to be hard-coded to a specific 
version: scikit-learn==0.19.2. Something to do with included class 
libraries in 0.19.2 no longer part of later versions.  I tried to 
compile that version in rhel/centos9 but I couldn't get the version of 
mkl_rt to compile.   Whoever it is that's the maintainer of 
diskprediction_local has just a little bit of work to do to adapt it to 
the latest scikit-learn rev.


Best Regards,

Harry Coin


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Switch docker image?

2020-10-22 Thread Harry G. Coin

This has got to be ceph/docker "101" but I can't find the answer in the
docs and need help.

The latest docker octopus images support using the ntpsec time daemon. 
The default stable octopus image doesn't as yet.

I want to add a mon to a cluster that needs to use ntpsec  (just go with
it..), so I need the  ceph/daemon-base:octopus-latest docker image.

Could someone offer the [cephadm ?  ceph orch ? ]  command sequence
necessary to add the mon to an existing cluster using a specific docker
image that's not the one used elsewhere?

Thanks!


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] diskprediction_local to be retired or fixed or??

2020-12-11 Thread Harry G. Coin

Any idea whether 'diskprediction_local' will ever work in containers? 
I'm running 15.2.7 which contains a dependency on scikit-learn v 0.19.2
which isn't in the container.  It's been throwing that error for a year
now on all the octopus container versions I tried.  It used to be on the
baremetal versions

pip3 install --upgrade scikit-learn==0.19.2

would fix it, but that seems to break the mgr container.

Thanks

Harry


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] orch apply mon assigns wrong IP address?

2021-05-21 Thread Harry G. Coin

Is there a way to force '.. orch apply  *' to limit ip address selection
to addresses matching the hostname in dns or /etc/hosts, or to a
specific address given at 'host add' time?   I've hit a bothersome problem:

On v15, 'ceph orch apply mon ...' appears not to use the dns ip or
/etc/hosts when installing a monitor, but instead appears to select one
from the current list of interfaces up on the host.  99 of 100 times
this is correct as the host has but one address on the public cluster
subnet.  

However, this or that host might have a temporary added ip interface up
on that subnet (whether for one-of diagnostic purposes or some HA
assigned movable address like a time of day server).  Occasionally that
second 'unlisted' address is 'up' when the orchestrator decides then
would be a good time  to add a mon to that host.   Ceph somehow
occasionally prefers this 'unlisted but up' interface address, using
that permanently as the host address for the monitor docker image, even
though that 'secondary' interface is not in DNS for the host, nor in
/etc/hosts anywhere for that host.  

Is this a known issue?  Is there a way to direct the orchestrator to use
'dns resolvable only' or 'preferred' addresses?

Thanks

Harry Coin




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: orch apply mon assigns wrong IP address?

2021-05-21 Thread Harry G. Coin



On 5/21/21 9:49 AM, Eugen Block wrote:
> You can define the public_network [1]:
>
> ceph config set mon public_network **
>
> For example:
>
> ceph config set mon public_network 10.1.2.0/24
>
> Or is that already defined and it happens anyway?
The public network is defined, and it happens anyway (the temporary,
unlisted interface address is in the public network, but not either
forward or reverse resolveable to the host in dns or /etc/hosts).
>
> [1]
> https://docs.ceph.com/en/latest/cephadm/mon/#designating-a-particular-subnet-for-monitors
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] mons assigned via orch label 'committing suicide' upon reboot.

2021-05-28 Thread Harry G. Coin

FYI, I'm getting monitors assigned via '... apply label:mon' with
current and valid 'mon' tags:  'committing suicide' after surprise
reboots in the  'Pacific' 16.2.4 release.  The tag indicating a monitor
should be assigned to that host is present and never changed.

Deleting the mon tag, waiting a minute, then re-adding the 'mon' tag to
the host causes the monitor to redeploy and run properly.

I have 5 monitors assigned via the orchestrator's 'label:mon', all in
docker containers. Upon reboot that goes to 4 monitors deployed. On the
offending host in the logs I see this:

May 28 11:06:59 noc4 bash[10563]: debug 2021-05-28T16:06:59.771+
7f7a029bf700  0 using public_addr v2:[fc00:1002:c7::44]:0/0 ->
[v2:[fc00:1002:c7::44]:3300/0,v1:[fc00:1002:c7::44]:6789/0]
May 28 11:06:59 noc4 bash[10563]: debug 2021-05-28T16:06:59.771+
7f7a029bf700  0 starting mon.noc4 rank -1 at public addrs
[v2:[fc00:1002:c7::44]:3300/0,v1:[fc00:1002:c7::44]:6789/0] at bind
addrs [v2:[fc00:1002:c7::44]:3300/0,v1:[fc00:1002:c7::44]:6789/0]
mon_data /var/lib/ceph/mon/ceph-noc4 fsid
4067126d-01cb-40af-824a-881c130140f8
May 28 11:06:59 noc4 bash[10563]: debug 2021-05-28T16:06:59.775+
7f7a029bf700  1 mon.noc4@-1(???) e40 preinit fsid
4067126d-01cb-40af-824a-x
May 28 11:06:59 noc4 bash[10563]: debug 2021-05-28T16:06:59.775+
7f7a029bf700 -1 mon.noc4@-1(???) e40 not in monmap and have been in a
quorum before; must have been removed
May 28 11:06:59 noc4 bash[10563]: debug 2021-05-28T16:06:59.775+
7f7a029bf700 -1 mon.noc4@-1(???) e40 commit suicide!
May 28 11:06:59 noc4 bash[10563]: debug 2021-05-28T16:06:59.775+
7f7a029bf700 -1 failed to initialize

Seems odd.  And, you know as debug comments go, 'commit suicide!',
appears to have an 'extra coffee that day' aspect.

HC



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Why you might want packages not containers for Ceph deployments

2021-06-02 Thread Harry G. Coin

On 6/2/21 2:28 PM, Phil Regnauld wrote:
> Dave Hall (kdhall) writes:
>> But the developers aren't out in the field with their deployments
>> when something weird impacts a cluster and the standard approaches don't
>> resolve it.  And let's face it:  Ceph is a marvelously robust solution for
>> large scale storage, but it is also an amazingly intricate matrix of
>> layered interdependent processes, and you haven't got all of the bugs
>> worked out yet.
>   I think you hit a very important point here: the concern with
>   containerized deployments is that they may be a barrier to 
>   efficient troubleshooting and bug reporting by traditional methods
>   (strace et al) -- unless a well documented debugging and analysis
>   toolset/methodolgy is provided.
>
>   Paradoxically, containerized deployments certainly sound like they'd
>   free up lots of cycles from the developer side of things (no more
>   building packages for N distributions as was pointed out, easier
>   upgrade and regression testing), but it might make it more difficult
>   initially for the community to contribute (well, at least for us
>   dinosaurs that aren't born with docker brains).
>
>   Cheers,
>   Phil

I think there's great value in ceph devs doing QA and testing docker
images, releasing them as a 'known good thing'.  Why? Doing that avoids
dependency hell inducing fragility-- fragility which I've experienced in
other multi-host / multi-master packages.  Wherein one distro's
maintainer decides some new rev ought be pushed out as 'security update'
while another distro's maintainer decides it's a feature change, another
calls it a backport, etc.  There's no way to QA 'upgrades' across so
many grains of shifting sand.

While the devs and the rest of the bleeding-edge folks should enjoy the
benefits that come with tolerating and managing dependency hell, having
the orchestrator upgrade in a known good sequence from a known base to a
known release reduces fragility.

Thanks for ceph!

Harry

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Cephfs root/boot?

2021-06-07 Thread Harry G. Coin

Has anyone added the 'conf.d' modules (and on the centos/rhel/fedora
world done the selinux work) so that initramfs/dracut can 'direct kernel
boot' cephfs as a guest image root file system?  It took some work for
the nfs folks to manage being the root filesystem.

Harry



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] In theory - would 'cephfs root' out-perform 'rbd root'?

2021-06-11 Thread Harry G. Coin

On any given a properly sized ceph setup, for other than database end
use) theoretically shouldn't a ceph-fs root out-perform any fs atop a
rados block device root?

Seems to me like it ought to: moving only the 'interesting' bits of
files over the so-called 'public' network should take fewer, smaller
packets than the overhead associated with whole blocks that hold some
fraction of the 'interesting' bits?

Does it work that way in practice?




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: In theory - would 'cephfs root' out-perform 'rbd root'?

2021-06-13 Thread Harry G. Coin

On 6/12/21 4:39 PM, Nathan Fish wrote:
> I doubt it. The problem is that the CephFS MDS must perform
> distributed metadata transactions with ordering and locking.
>  Whereas a filesystem on rbd runs locally and doesn't have to worry
> about other computers writing to the same block device.
> Our bottleneck in production is usually the MDS CPU load.

Perhaps if an 'exclusive write mount' option existed, the mds could
delegate most of what it does to the client.

Moving 4K-at-least blocks around a network, even with 'jumbo frames' on
local segments, has got to take more processing and network bandwidth
than the 'known interesting only' parts of files.



>
> On Fri, Jun 11, 2021 at 12:31 PM Harry G. Coin  wrote:
>> On any given a properly sized ceph setup, for other than database end
>> use) theoretically shouldn't a ceph-fs root out-perform any fs atop a
>> rados block device root?
>>
>> Seems to me like it ought to: moving only the 'interesting' bits of
>> files over the so-called 'public' network should take fewer, smaller
>> packets than the overhead associated with whole blocks that hold some
>> fraction of the 'interesting' bits?
>>
>> Does it work that way in practice?
>>
>>
>>
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS design

2021-06-14 Thread Harry G. Coin

On 6/11/21 3:52 AM, Szabo, Istvan (Agoda) wrote:
> Hi,
>
> Can you suggest me what is a good cephfs design? I've never used it, only rgw 
> and rbd we have, but want to give a try. Howvere in the mail list I saw a 
> huge amount of issues with cephfs so would like to go with some let's say 
> bulletproof best practices.

You've read many practical answers to your question so far.   My
contribution is:  cephfs has to 'win' over the long term because moving
'known interesting' data over a network will always take less time than
having a client 'file system' move whole storage blocks over the fiber
or wire then have to sort out the bits the application actually wants. 
The only way that doesn't happen is if the 'wires' are dramatically
faster than the hosts and lightly loaded -- not what's expected.

So, long term: cephfs has the logical ability to out-perform other
block-backed (rbd/iscsi) choices.  But not today.  The thing that makes
it 'seem slow' now is dealing with the multi-user file/record level
contention block devices don't have to face.  Over time I expect
directory trees might be shared with a 'one user' flag that might allow
the client to interact with the mons/osds directly and require very
little mds traffic. That will win over rbd+fs designs because of 'more
of what the user wants per network packet' issues.

So, eventually (Year?  Years?  Decades?)  I think RadosGW and cephfs
will bear most of the ceph traffic.  But for today -- if a host is the
sole user of a directory tree -- rbd + xfs (ymmv)

HC

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Why does 'mute AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED 2w' expire in less than a day?

2021-07-07 Thread Harry G. Coin

Is this happening to anyone else?  After this command:

ceph health mute AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED 2w

The 'dashboard' shows 'Health OK',  then after a few hours (perhaps a
mon leadership change), it's back to 'degraded' and

'AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure
global_id reclaim'

Pacific, 16.2.4 all in docker containers.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Question re: replacing failed boot/os drive in cephadm / pacific cluster

2021-07-09 Thread Harry G. Coin

Hi

In a Pacific/container/cephadm setup, when a server boot/os drive fails
(unrelated to any osd actual storage):  Can the boot/OS drive be
replaced with a 'fresh install OS install' then simply setting up the
same networking addressing/ssh keys (assuming the necessary
docker/non-ceph pkgs are installed)?  

Up to now I've been backing up and restoring 'boot/os drives', but that
seems like it ought not be necessary any longer.   I have some hopes
setting up a 'basic OS with the correct ssh keys, and networking
address' should make it possible to, sort of  'stand well back and
watch' while cephadm installs the necessary containers to restore the
osds and whatever else.   It would be nice to have a stack of 'ready to
go boot/os drives' where all that's needed is a network, hostname and
'authorized keys' tweak.

Or is it 'better' to continue going to the trouble of
backing-up/restoring server boot drives?

???

Thanks

Harry Coin



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: name alertmanager/node-exporter already in use with v16.2.5

2021-07-10 Thread Harry G. Coin

Same problem here.  Hundreds of lines like

'    Updating node-exporter deployment (+4 -4 -> 5) (0s)
  []
'

And, similar to yours:

...

2021-07-10T16:26:30.432487-0500 mgr.noc4.tvhgac [ERR] Failed to apply
node-exporter spec MonitoringSpec({'placement':
PlacementSpec(host_pattern='*'), 'service_type': 'node-exporter',
'service_id': None, 'unmanaged': False, 'preview_only': False,
'networks': [], 'config': None, 'port': None}): name node-exporter.noc4
already in use
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 582, in
_apply_all_services
    if self._apply_service(spec):
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 743, in _apply_service
    rank_generation=slot.rank_generation,
  File "/usr/share/ceph/mgr/cephadm/module.py", line 613, in get_unique_name
    f'name {daemon_type}.{name} already in use')
orchestrator._interface.OrchestratorValidationError: name
node-exporter.noc4 already in use
...



On 7/8/21 5:06 PM, Bryan Stillwell wrote:
> I upgraded one of my clusters to v16.2.5 today and now I'm seeing these 
> messages from 'ceph -W cephadm':
>
> 2021-07-08T22:01:55.356953+ mgr.excalibur.kuumco [ERR] Failed to apply 
> alertmanager spec AlertManagerSpec({'placement': PlacementSpec(count=1), 
> 'service_type': 'alertmanager', 'service_id': None, 'unmanaged': False, 
> 'preview_only': False, 'networks': [], 'config': None, 'user_data': {}, 
> 'port': None}): name alertmanager.aladdin already in use
> Traceback (most recent call last):
>   File "/usr/share/ceph/mgr/cephadm/serve.py", line 582, in 
> _apply_all_services
> if self._apply_service(spec):
>   File "/usr/share/ceph/mgr/cephadm/serve.py", line 743, in _apply_service
> rank_generation=slot.rank_generation,
>   File "/usr/share/ceph/mgr/cephadm/module.py", line 613, in get_unique_name
> f'name {daemon_type}.{name} already in use')
> orchestrator._interface.OrchestratorValidationError: name 
> alertmanager.aladdin already in use
> 2021-07-08T22:01:55.372569+ mgr.excalibur.kuumco [ERR] Failed to apply 
> node-exporter spec MonitoringSpec({'placement': 
> PlacementSpec(host_pattern='*'), 'service_type': 'node-exporter', 
> 'service_id': None, 'unmanaged': False, 'preview_only': False, 'networks': 
> [], 'config': None, 'port': None}): name node-exporter.aladdin already in use
> Traceback (most recent call last):
>   File "/usr/share/ceph/mgr/cephadm/serve.py", line 582, in 
> _apply_all_services
> if self._apply_service(spec):
>   File "/usr/share/ceph/mgr/cephadm/serve.py", line 743, in _apply_service
> rank_generation=slot.rank_generation,
>   File "/usr/share/ceph/mgr/cephadm/module.py", line 613, in get_unique_name
> f'name {daemon_type}.{name} already in use')
> orchestrator._interface.OrchestratorValidationError: name 
> node-exporter.aladdin already in use
>
> Also my 'ceph -s' output keeps getting longer and longer (currently 517 
> lines) with messages like these:
>
> Updating node-exporter deployment (+6 -6 -> 13) (0s)
>   []
> Updating alertmanager deployment (+1 -1 -> 1) (0s)
>   []
>
> What's the best way to go about fixing this?  I've tried using 'ceph orch 
> daemon redeploy alertmanager.aladdin' and the same for node-exporter, but it 
> doesn't seem to help.
>
> Thanks,
> Bryan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: name alertmanager/node-exporter already in use with v16.2.5

2021-07-11 Thread Harry G. Coin

On 7/8/21 5:06 PM, Bryan Stillwell wrote:
> I upgraded one of my clusters to v16.2.5 today and now I'm seeing these 
> messages from 'ceph -W cephadm':
>
> 2021-07-08T22:01:55.356953+ mgr.excalibur.kuumco [ERR] Failed to apply 
> alertmanager spec AlertManagerSpec({'placement': PlacementSpec(count=1), 
> 'service_type': 'alertmanager', 'service_id': None, 'unmanaged': False, 
> 'preview_only': False, 'networks': [], 'config': None, 'user_data': {}, 
> 'port': None}): name alertmanager.aladdin already in use
> Traceback (most recent call last):
>   File "/usr/share/ceph/mgr/cephadm/serve.py", line 582, in 
> _apply_all_services
> if self._apply_service(spec):
>   File "/usr/share/ceph/mgr/cephadm/serve.py", line 743, in _apply_service
> rank_generation=slot.rank_generation,
>   File "/usr/share/ceph/mgr/cephadm/module.py", line 613, in get_unique_name
> f'name {daemon_type}.{name} already in use')
> orchestrator._interface.OrchestratorValidationError: name 
> alertmanager.aladdin already in use
> 2021-07-08T22:01:55.372569+ mgr.excalibur.kuumco [ERR] Failed to apply 
> node-exporter spec MonitoringSpec({'placement': 
> PlacementSpec(host_pattern='*'), 'service_type': 'node-exporter', 
> 'service_id': None, 'unmanaged': False, 'preview_only': False, 'networks': 
> [], 'config': None, 'port': None}): name node-exporter.aladdin already in use
> Traceback (most recent call last):
>   File "/usr/share/ceph/mgr/cephadm/serve.py", line 582, in 
> _apply_all_services
> if self._apply_service(spec):
>   File "/usr/share/ceph/mgr/cephadm/serve.py", line 743, in _apply_service
> rank_generation=slot.rank_generation,
>   File "/usr/share/ceph/mgr/cephadm/module.py", line 613, in get_unique_name
> f'name {daemon_type}.{name} already in use')
> orchestrator._interface.OrchestratorValidationError: name 
> node-exporter.aladdin already in use
>
> Also my 'ceph -s' output keeps getting longer and longer (currently 517 
> lines) with messages like these:
>
> Updating node-exporter deployment (+6 -6 -> 13) (0s)
>   []
> Updating alertmanager deployment (+1 -1 -> 1) (0s)
>   []
>
> What's the best way to go about fixing this?  I've tried using 'ceph orch 
> daemon redeploy alertmanager.aladdin' and the same for node-exporter, but it 
> doesn't seem to help.


Workaround (caution: temporarily disruptive),  Assuming this is the only
reported problem remaining after upgrade otherwise completes:

1.  ceph orch rm node-exporter  

Wait 30+ seconds.

2.  Stop all managers.

3.  Start all managers.

4.  ceph orch apply node-exporter '*'


>
> Thanks,
> Bryan
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Did standby dashboards stop redirecting to the active one?

2021-07-26 Thread Harry G. Coin

Somewhere between Nautilus and Pacific the hosts running standby
managers, which previously would redirect browsers to the currently
active mgr/dashboard, seem to have stopped doing that.   Is that a
switch somewhere?  Or was I just happily using an undocumented feature?

Thanks

Harry Coin


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Did standby dashboards stop redirecting to the active one?

2021-07-26 Thread Harry G. Coin


On 7/26/21 12:02 PM, Ernesto Puerta wrote:
> Hi Harry,
>
> No, that feature is still there. There's been a recent thread in this
> mailing list (please see "Pacific 16.2.5 Dashboard minor regression
> <https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/LWQKSRTO5HOAABFZDA26WGF3TL2JHLVI/>")
> about an unrelated change in cephadm that might impact this failover
> mechanism.
>
> What URL are you getting redirected to now? Are you using a reverse
> proxy/load balancer in front of the Dashboard
> <https://docs.ceph.com/en/latest/mgr/dashboard/#disable-the-redirection>
> (e.g.: HAProxy)?

No redirection, nothing. Just timeout on every manager other than the
active one.  Adding a HAproxy would be easily done, but seems redundant
to ceph internal capability -- that at one time worked, anyhow.



>
> Kind Regards,
> Ernesto
>
>
> On Mon, Jul 26, 2021 at 4:06 PM Harry G. Coin  <mailto:hgc...@gmail.com>> wrote:
>
> Somewhere between Nautilus and Pacific the hosts running standby
> managers, which previously would redirect browsers to the currently
> active mgr/dashboard, seem to have stopped doing that.   Is that a
> switch somewhere?  Or was I just happily using an undocumented
> feature?
>
> Thanks
>
> Harry Coin
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> <mailto:ceph-users@ceph.io>
> To unsubscribe send an email to ceph-users-le...@ceph.io
> <mailto:ceph-users-le...@ceph.io>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Docker container snapshots accumulate until disk full failure?

2021-08-11 Thread Harry G. Coin

Does ceph remove container subvolumes holding previous revisions of
daemon images after upgrades?

I have a couple servers using btrfs to hold the containers.   The number
of docker related sub-volumes just keeps growing, way beyond the number
of daemons running.  If I ignore this, I'll get disk-full related system
failures.

Is there a command to 'erase all non-live docker image subvolumes'?  Or
a way to at least get a list of what I need to delete manually ( !! )

Thanks

Harry Coin



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Bigger picture 'ceph web calculator', was Re: SATA vs SAS

2021-08-22 Thread Harry G. Coin

This topic comes up often enough, maybe it's time for one of those 'web
calculators'.  One that accepts the user who knows their goals but not
ceph-fu,  entering the importance of various factors (my suggested
factors:  read freq/stored tb, write freq/stored tb, unreplicated tb
needed, min target days between first failure and cluster failure). 
Then the handy calculator spits out a few ceph configs that shows an
'optimized' layout for their goal, what it would look like if each of
their factors was 'a little more' and 'a little less'.   The calculator
would spit out 'x ssds of size x, y 7200rpm MTBF q, z 5400, sas xx,
using aa hosts with not less than Y gb and P cores of not less than Ghz
single threaded performance/core; ceph configured as mirrors/erasure
etc. etc.  With target expected cost.  That would be a service folks
would pay for I think.   It would be the answer to the question 'what
would it take for  ceph to deliver X?'    Folks would notice quickly
whether they really need 1 cluster, or two of very different performance
goals, etc.

On 8/21/21 12:46 PM, Roland Giesler wrote:
> Hi all,
>
> (I asked this on the Proxmox forums, but I think it may be more
> appropriate here.)
>
> In your practical experience, when I choose new hardware for a
> cluster, is there any noticable difference between using SATA or SAS
> drives. I know SAS drives can have a 12Gb/s interface and I think SATA
> can only do 6Gb/s, but in my experience the drives themselves can't
> write at 12Gb/s anyway, so it makes little if any difference.
>
> I use a combination of SSD's and SAS drives in my current cluster (in
> different ceph pools), but I suspect that if I choose SATA enterprise
> class drives for this project, it will get the same level of
> performance.
>
> I think with ceph the hard error rate of drives becomes less relevant
> that if I had used some level of RAID.
>
> Also, if I go with SATA, I can use AMD Epyc processors (and I don't
> want to use a different supplier), which gives me a lot of extra cores
> per unit at a lesser price, which of course all adds up to a better
> deal in the end.
>
> I'd like to specifically hear from you what your experience is in this regard.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] after upgrade: HEALTH ERR ...'devicehealth' has failed: can't subtract offset-naive and offset-aware datetimes

2021-09-21 Thread Harry G. Coin

A cluster reporting no errors running 16.2.5 immediately after upgrade 
to 16.2.6 features what seems to be an entirely bug-related dramatic 
'Heath Err' on the dashboard:


Module 'devicehealth' has failed: can't subtract offset-naive and 
offset-aware datetimes


Looking at the bug tracking logs, others reported this upon upgrade to 
16.2.5 that  'went away' on upgrade to 16.2.6


Echos of Bilbo passing the ring to Frodo?

It would really be nice not to have to explain a dramatic scary 
dashboard feature for the months between .6 and .7 any help?


Thanks

Harry Coin



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] "Remaining time" under-estimates by 100x....

2021-09-22 Thread Harry G. Coin

Is there a way to re-calibrate the various 'global recovery event' and 
related 'remaining time' estimators?


For the last three days I've been assured that a 19h event will be over 
in under 3 hours...


Previously I think Microsoft held the record for the most incorrect 
'please wait' progress indicators.  Ceph may take that crown this year, 
unless...


Thanks

Harry


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Is this really an 'error'? "pg_autoscaler... has overlapping roots"

2021-09-23 Thread Harry G. Coin


Is there anything to be done about groups of log messages like

"pg_autoscaler ERROR root] pool  has overlapping roots"

The cluster reports it is healthy, and yet this is reported as an error, 
so-- is it an error that ought to have been reported, or is it not an error?


Thanks

Harry Coin


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Set some but not all drives as 'autoreplace'?

2021-09-28 Thread Harry G. Coin


Hi all,

I know Ceph offers a way to 'automatically' cause blank drives it 
detects to be spun up into osds, but I think that's an 'all or nothing' 
situation if I read the docs properly.


Is there a way to specify which slots, or even better, a way to specify 
not specific slots?  It sure would be nice to tell ceph 'it owns these 
drive slots on those hosts, and if ever there is a blank drive in there, 
assume the previous one if any is never coming back to that slot, and 
the new one should be auto-installed as an osd' without having to get 
involved in typing or clicking anything.


The reason is-- some hosts boot off the same controller as the osds use, 
I don't want a replacement boot drive getting spun into an osd.


Thanks

Harry



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Trying to understand what overlapped roots means in pg_autoscale's scale-down mode

2021-10-01 Thread Harry G. Coin


I asked as well, it seems nobody on the list knows so far.


On 9/30/21 10:34 AM, Andrew Gunnerson wrote:

Hello,

I'm trying to figure out what overlapping roots entails with the default
scale-down autoscaling profile in Ceph Pacific. My test setup involves a CRUSH
map that looks like:

 ID=-1  | root=default
 ID=-58 | rack=rack1
 ID=-70 | host=ssd-1
| 
 ID=-61 | rack=rack2
 ID=-55 | host=ssd-2
| 
 ID=-62 | rack=rack3
 ID=-52 | host=ssd-3
| 
 ID=-63 | rack=rack4
 ID=-19 | host=hdd-1
| 
| <15 more hosts>

The CRUSH rules I created are:

 # Rack failure domain for SSDs
 ceph osd crush rule create-replicated replicated_ssd default rack ssd
 # Host failure domain for HDDs
 ceph osd crush rule create-replicated replicated_hdd default host hdd
 ceph osd erasure-code-profile set erasure_hdd ruleset k=3 m=2 
crush-device-class=hdd crush-failure-domain=host

and the pools are:

 Pool   | CRUSH rule/profile | Overlapped roots error
 ---||---
 device_health_metrics  | replicated_rule| -1 (root=default)
 cephfs_metadata| replicated_ssd | -51 (root=default~ssd)
 cephfs_data_replicated_ssd | replicated_ssd | -51 (root=default~ssd)
 cephfs_data_replicated_hdd | replicated_hdd | -2 (root=default~hdd)
 cephfs_data_erasure_hdd| erasure_hdd| -1 (root=default)

With this setup, the autoscaler is getting disabled in every pool with the
following error:

 [pg_autoscaler WARNING root] pool  contains an overlapping root 
-... skipping scaling

There doesn't seem to be much documentation about overlapped roots. I think I'm
fundamentally not understanding what it means. Does it mean that the autoscaler
can't handle two different pools using OSDs under the same (shadow?) root in the
CRUSH map?

Is this setup something that's not possible using the scale-down autoscaler
profile? It seems that the scale-up profile doesn't have a concept of overlapped
roots.

Thank you,
Andrew
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] How to get ceph bug 'non-errors' off the dashboard?

2021-10-02 Thread Harry G. Coin

I need help getting two 'non errors' off the ceph dashboard so it stops 
falsely scaring people with the dramatic read "HEALTH_ERR" --- and 
masking what could be actual errors of immediate importance.


The first is a bug where the devs try to do date arithmetic between 
incompatible variables.  The second is a bug where the information 
collector reports a link-local ipv6 address as 'the interface address' 
when the same interface has a site-local address  (It's random, which 
address the interface listing system lists first, the info gatherer 
needs to ignore the fe80:: address).  These two bugs in the latest 
pacific lead to:


 * MGR_MODULE_ERROR: Module 'devicehealth' has failed: can't subtract
   offset-naive and offset-aware datetimes
 * CEPHADM_CHECK_NETWORK_MISSING: Public/cluster network defined, but
   can not be found on any host

So I suppose they will get fixed in due course.  Meanwhile though, I 
need a way to clear those off the dashboard so it will report all is 
well unless there is an actual error like an OSD down, or something 'real'.


Any help?

Thanks

Harry Coin



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: How to get ceph bug 'non-errors' off the dashboard?

2021-10-03 Thread Harry G. Coin


Worked very well!  Thank you.

Harry Coin


On 10/2/21 11:23 PM, 胡 玮文 wrote:

Hi Harry,

Please try these commands in CLI:

ceph health mute MGR_MODULE_ERROR
ceph health mute CEPHADM_CHECK_NETWORK_MISSING

Weiwen Hu



在 2021年10月3日，05:37，Harry G. Coin  写道：

I need help getting two 'non errors' off the ceph dashboard so it stops falsely scaring 
people with the dramatic read "HEALTH_ERR" --- and masking what could be actual 
errors of immediate importance.

The first is a bug where the devs try to do date arithmetic between 
incompatible variables.  The second is a bug where the information collector 
reports a link-local ipv6 address as 'the interface address' when the same 
interface has a site-local address  (It's random, which address the interface 
listing system lists first, the info gatherer needs to ignore the fe80:: 
address).  These two bugs in the latest pacific lead to:

* MGR_MODULE_ERROR: Module 'devicehealth' has failed: can't subtract
   offset-naive and offset-aware datetimes
* CEPHADM_CHECK_NETWORK_MISSING: Public/cluster network defined, but
   can not be found on any host

So I suppose they will get fixed in due course.  Meanwhile though, I need a way 
to clear those off the dashboard so it will report all is well unless there is 
an actual error like an OSD down, or something 'real'.

Any help?

Thanks

Harry Coin



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Permanent KeyError: 'TYPE' ->17.2.7: return self.blkid_api['TYPE'] == 'part'

2023-11-07 Thread Harry G Coin

These repeat for every host, only after upgrading from prev release 
Quincy to 17.2.7.   As a result, the cluster is always warned, never 
indicates healthy.


root@noc1:~# ceph health detail

HEALTH_WARN failed to probe daemons or devices
[WRN] CEPHADM_REFRESH_FAILED: failed to probe daemons or devices
   host sysmon1 `cephadm ceph-volume` failed: cephadm exited with an 
error code: 1, stderr: Inferring config 
/var/lib/ceph/4067126d-01cb-40af-824a-881c130140f8/mon.sysmon1/config
Non-zero exit code 1 from /usr/bin/docker run --rm --ipc=host 
--stop-signal=SIGTERM --ulimit nofile=1048576 --net=host --entrypoint 
/usr/sbin/ceph-volume --privileged --group-add=disk --init -e 
CONTAINER_IMAGE=quay.io/ceph/ceph@sha2
56:92e8fa7d8ca17a7a5bbfde6e596fdfecc8e165fcb94d86493f4e6c7b1f326e4e -e 
NODE_NAME=sysmon1 -e CEPH_USE_RANDOM_NONCE=1 -e 
CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v 
/var/run/ceph/4067126d-01cb-40af-824a-881c130140f8:/var
/run/ceph:z -v 
/var/log/ceph/4067126d-01cb-40af-824a-881c130140f8:/var/log/ceph:z -v 
/var/lib/ceph/4067126d-01cb-40af-824a-881c130140f8/crash:/var/lib/ceph/crash:z 
-v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lv
m -v /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v 
/tmp/ceph-tmpl1e27bun:/etc/ceph/ceph.conf:z 
quay.io/ceph/ceph@sha256:92e8fa7d8ca17a7a5bbfde6e596fdfecc8e165fcb94d86493f4e6c7b1f326e4e 
inventory --format=json-pretty --filter-for-batch
/usr/bin/docker: stderr  stderr: Unknown device, --name=, --path=, or 
absolute path in /dev/ or /sys expected.
/usr/bin/docker: stderr  stderr: Unknown device, --name=, --path=, or 
absolute path in /dev/ or /sys expected.
/usr/bin/docker: stderr  stderr: Unknown device, --name=, --path=, or 
absolute path in /dev/ or /sys expected.
/usr/bin/docker: stderr  stderr: Unknown device, --name=, --path=, or 
absolute path in /dev/ or /sys expected.
/usr/bin/docker: stderr  stderr: Unknown device, --name=, --path=, or 
absolute path in /dev/ or /sys expected.
/usr/bin/docker: stderr  stderr: Unknown device, --name=, --path=, or 
absolute path in /dev/ or /sys expected.

/usr/bin/docker: stderr Traceback (most recent call last):
/usr/bin/docker: stderr   File "/usr/sbin/ceph-volume", line 11, in 

/usr/bin/docker: stderr load_entry_point('ceph-volume==1.0.0', 
'console_scripts', 'ceph-volume')()
/usr/bin/docker: stderr   File 
"/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 41, in 
__init__

/usr/bin/docker: stderr self.main(self.argv)
/usr/bin/docker: stderr   File 
"/usr/lib/python3.6/site-packages/ceph_volume/decorators.py", line 59, 
in newfunc

/usr/bin/docker: stderr return f(*a, **kw)
/usr/bin/docker: stderr   File 
"/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 153, in main

/usr/bin/docker: stderr terminal.dispatch(self.mapper, subcommand_args)
/usr/bin/docker: stderr   File 
"/usr/lib/python3.6/site-packages/ceph_volume/terminal.py", line 194, in 
dispatch

/usr/bin/docker: stderr instance.main()
/usr/bin/docker: stderr   File 
"/usr/lib/python3.6/site-packages/ceph_volume/inventory/main.py", line 
60, in main

/usr/bin/docker: stderr list_all=self.args.list_all))
/usr/bin/docker: stderr   File 
"/usr/lib/python3.6/site-packages/ceph_volume/util/device.py", line 50, 
in __init__

/usr/bin/docker: stderr sys_info.devices.keys()]
/usr/bin/docker: stderr   File 
"/usr/lib/python3.6/site-packages/ceph_volume/util/device.py", line 49, 
in 

/usr/bin/docker: stderr all_devices_vgs=all_devices_vgs) for k in
/usr/bin/docker: stderr   File 
"/usr/lib/python3.6/site-packages/ceph_volume/util/device.py", line 147, 
in __init__
/usr/bin/docker: stderr self.available_lvm, 
self.rejected_reasons_lvm = self._check_lvm_reject_reasons()
/usr/bin/docker: stderr   File 
"/usr/lib/python3.6/site-packages/ceph_volume/util/device.py", line 646, 
in _check_lvm_reject_reasons
/usr/bin/docker: stderr 
rejected.extend(self._check_generic_reject_reasons())
/usr/bin/docker: stderr   File 
"/usr/lib/python3.6/site-packages/ceph_volume/util/device.py", line 601, 
in _check_generic_reject_reasons

/usr/bin/docker: stderr if self.is_acceptable_device:
/usr/bin/docker: stderr   File 
"/usr/lib/python3.6/site-packages/ceph_volume/util/device.py", line 502, 
in is_acceptable_device
/usr/bin/docker: stderr return self.is_device or self.is_partition 
or self.is_lv
/usr/bin/docker: stderr   File 
"/usr/lib/python3.6/site-packages/ceph_volume/util/device.py", line 482, 
in is_partition

/usr/bin/docker: stderr return self.blkid_api['TYPE'] == 'part'
/usr/bin/docker: stderr KeyError: 'TYPE'
Traceback (most recent call last):
 File 
"/var/lib/ceph/4067126d-01cb-40af-824a-881c130140f8/cephadm.8b92cafd937eb89681ee011f9e70f85937fd09c4bd61ed4a59981d275a1f255b", 
line 9679, in 

   main()
 File 
"/var/lib/ceph/4067126d-01cb-40af-824a-881c130140f8/cephadm.8b92cafd937eb89681ee011f9e70f85937fd09c4bd61ed4a59981d275a1f255b", 
line 9667, in main

   r = ctx.func(ctx)
 File

[ceph-users] Howto: 'one line patch' in deployed cluster?

2023-12-14 Thread Harry G Coin

Is there a 'Howto' or 'workflow' to implement a one-line patch in a 
running cluster?  With full understanding it will be gone on the next 
upgrade?


Hopefully without having to set up an entire packaging/development 
environment?


Thanks!


To implement:

 * /Subject/: Re: Permanent KeyError: 'TYPE' ->17.2.7: return
   self.blkid_api['TYPE'] == 'part'
 * /From/: Sascha Lucas 

Problem found: in my case this is caused by DRBD secondary block 
devices, which can not be read until promoted to primary.


ceph_volume/util/disk.py runs in blkid():

$ blkid -c /dev/null -p /dev/drbd4
blkid: error: /dev/drbd4: Wrong medium type

but does not care about its return code.

A quick fix is to use the get() method to automatically fall back to 
None for non existing keys:


--- a/ceph_volume/util/device.py 2023-11-10 07:00:01.552497107 +
+++ b/ceph_volume/util/device.py 2023-11-10 08:54:40.320718690 +

@@ -476,13 +476,13 @@
 @property
 def is_partition(self):
 self.load_blkid_api()
 if self.disk_api:
 return self.disk_api['TYPE'] == 'part'
 elif self.blkid_api:
-return self.blkid_api['TYPE'] == 'part'
+return self.blkid_api.get('TYPE') == 'part'
 return False

Don't know why this is triggered in 17.2.7.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] 18.2.2 dashboard really messed up.

2024-03-11 Thread Harry G Coin

Looking at ceph -s, all is well.  Looking at the dashboard, 85% of my 
capacity is 'warned', and 95% is 'in danger'.   There is no hint given 
as to the nature of the danger or reason for the warning.  Though 
apparently with merely 5% of my ceph world 'normal', the cluster reports 
'ok'.  Which, you know, seems contradictory.  I've used just under 40% 
of capacity.


Further down the dashboard, all the subsections of 'Cluster Utilization' 
are '1' and '0.5' with nothing whatever in the graphics area.


Previous versions of ceph presented a normal dashboard.

It's just a little half rack, 5 hosts, a few physical drives each, been 
running ceph for a couple years now.  Orchestrator is cephadm.  It's 
just about as 'plain vanilla' at it gets.  I've had to mute one alert, 
because cephadm refresh aborts when it finds drives on any host that 
have nothing to do with ceph that don't have a blkid_ip 'TYPE' key.  
Seems unrelated to a totally messed up dashboard.  (The tracker for that 
is here: https://tracker.ceph.com/issues/63502 ).


Any idea what the steps are to get useful stuff back on the dashboard?   
Any idea where I can learn what my 85% danger and 95% warning is 
'about'?  (You'd think 'danger' (The volcano is blowing up now!)  would 
be worse than 'warning' (the volcano might blow up soon) , so how can 
warning+danger > 100%, or if not additive how can warning < danger?)


 Here's a bit of detail:

root@noc1:~# ceph -s
 cluster:
   id: 4067126d-01cb-40af-824a-881c130140f8
   health: HEALTH_OK
   (muted: CEPHADM_REFRESH_FAILED)

 services:
   mon: 5 daemons, quorum noc4,noc2,noc1,noc3,sysmon1 (age 70m)
   mgr: noc2.yhyuxd(active, since 82m), standbys: noc4.tvhgac, 
noc3.sybsfb, noc1.jtteqg

   mds: 1/1 daemons up, 3 standby
   osd: 27 osds: 27 up (since 20m), 27 in (since 2d)

 data:
   volumes: 1/1 healthy
   pools:   16 pools, 1809 pgs
   objects: 12.29M objects, 17 TiB
   usage:   44 TiB used, 67 TiB / 111 TiB avail
   pgs: 1793 active+clean
9    active+clean+scrubbing
7    active+clean+scrubbing+deep

 io:
   client:   5.6 MiB/s rd, 273 KiB/s wr, 41 op/s rd, 58 op/s wr

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 18.2.2 dashboard really messed up.

2024-03-13 Thread Harry G Coin

Thanks!  Oddly, all the dashboard checks you suggest appear normal, yet 
the result remains broken.


Before I used your instruction about the dashboard, I have this result:

root@noc3:~# ceph dashboard get-prometheus-api-host
http://noc3.1.quietfountain.com:9095
root@noc3:~# netstat -6nlp | grep 9095
tcp6   0  0 :::9095:::* 
   LISTEN  80963/prometheus

root@noc3:~#

To check it, I tried setting it to something random, the browser aimed 
at the dashboard site reported no connection.  The error message ended 
when I restored the above.  But the graphs remain empty, the numbers 1 
and 0.5 on each.


Regarding the used storage, notice the overall usage is 43.6 of 111 
TiB.    Seems quite a distance from the trigger warning points of 85 and 
95?  The default values are in use.  All the OSDs are between 37% to 42% 
usage.   What am I missing?


Thanks!



On 3/12/24 02:07, Nizamudeen A wrote:

Hi,

The warning and danger indicator in the capacity chart points to the 
nearful and full ratio set to the cluster and
the default values for them are 85% and 95% respectively. You can do a 
`ceph osd dump | grep ratio` and see those.


When this got introduced, there was a blog post 
<https://ceph.io/en/news/blog/2023/landing-page/#capacity-card>explaining 
how this is mapped in the chart. But when your used storage
crosses that 85% mark, the chart is colored with yellow to indicate 
the user, and when it crosses 95% (or the full ratio) the
chart is colored with red to tell that. But that doesn't mean the 
cluster is in bad shape but its a visual indicator to tell you

you are running out of storage.

Regarding the Cluster Utilization chart, it gets metrics directly from 
prometheus so that it can be used to show a time-series
data in UI rather than the metrics at current point in time (which was 
used before). So if you have prometheus configured in
dashboard and its url is provided in the dashboard settings `ceph 
dashboard set-prometheus-api-host `

then you should be able to see the metrics.

In case you need to read more about the new page you can check here 
<https://docs.ceph.com/en/latest/mgr/dashboard/#overview-of-the-dashboard-landing-page>.


Regards,
Nizam



On Mon, Mar 11, 2024 at 11:47 PM Harry G Coin  wrote:

Looking at ceph -s, all is well.  Looking at the dashboard, 85% of my
capacity is 'warned', and 95% is 'in danger'.   There is no hint
given
as to the nature of the danger or reason for the warning. Though
apparently with merely 5% of my ceph world 'normal', the cluster
reports
'ok'.  Which, you know, seems contradictory.  I've used just under
40%
of capacity.

Further down the dashboard, all the subsections of 'Cluster
Utilization'
are '1' and '0.5' with nothing whatever in the graphics area.

Previous versions of ceph presented a normal dashboard.

It's just a little half rack, 5 hosts, a few physical drives each,
been
running ceph for a couple years now.  Orchestrator is cephadm.  It's
just about as 'plain vanilla' at it gets.  I've had to mute one
alert,
because cephadm refresh aborts when it finds drives on any host that
have nothing to do with ceph that don't have a blkid_ip 'TYPE' key.
Seems unrelated to a totally messed up dashboard.  (The tracker
for that
is here: https://tracker.ceph.com/issues/63502 ).

Any idea what the steps are to get useful stuff back on the
dashboard?
Any idea where I can learn what my 85% danger and 95% warning is
'about'?  (You'd think 'danger' (The volcano is blowing up now!) 
would
be worse than 'warning' (the volcano might blow up soon) , so how can
warning+danger > 100%, or if not additive how can warning < danger?)

  Here's a bit of detail:

root@noc1:~# ceph -s
  cluster:
id: 4067126d-01cb-40af-824a-881c130140f8
health: HEALTH_OK
(muted: CEPHADM_REFRESH_FAILED)

  services:
mon: 5 daemons, quorum noc4,noc2,noc1,noc3,sysmon1 (age 70m)
mgr: noc2.yhyuxd(active, since 82m), standbys: noc4.tvhgac,
noc3.sybsfb, noc1.jtteqg
mds: 1/1 daemons up, 3 standby
osd: 27 osds: 27 up (since 20m), 27 in (since 2d)

  data:
volumes: 1/1 healthy
pools:   16 pools, 1809 pgs
objects: 12.29M objects, 17 TiB
usage:   44 TiB used, 67 TiB / 111 TiB avail
pgs: 1793 active+clean
 9    active+clean+scrubbing
 7    active+clean+scrubbing+deep

  io:
client:   5.6 MiB/s rd, 273 KiB/s wr, 41 op/s rd, 58 op/s wr

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] 17.2.6 fs 'ls' ok, but 'cat' 'operation not permitted' puzzle

2023-05-02 Thread Harry G Coin

In 17.2.6 is there a security requirement that pool names supporting a 
ceph fs filesystem match the filesystem name.data for the data and 
name.meta for the associated metadata pool? (multiple file systems are 
enabled)


I have filesystems from older versions with the data pool name matching 
the filesystem and appending _metadata for that,


and even older filesystems with the pool name as in 'library' and 
'library_metadata' supporting a filesystem called 'libraryfs'


The pools all have the cephfs tag.

But using the documented:

ceph fs authorize libraryfs client.basicuser / rw

command allows the root user to mount and browse the library directory 
tree, but fails with 'operation not permitted' when even reading any file.


However, changing the client.basicuser osd auth to 'allow rw' instead of 
'allow rw tag...' allows normal operations.


So:

[client.basicuser]
   key = ==
   caps mds = "allow rw fsname=libraryfs"
   caps mon = "allow r fsname=libraryfs"
   caps osd = "allow rw"

works, but the same with

   caps osd = "allow rw tag cephfs data=libraryfs"

leads to the 'operation not permitted' on read, or write or any actual 
access.


It remains a puzzle.  Help appreciated!

Were there upgrade instructions about that, any help pointing me to them?

Thanks

Harry Coin
Rock Stable Systems

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 17.2.6 fs 'ls' ok, but 'cat' 'operation not permitted' puzzle

2023-05-02 Thread Harry G Coin

This problem of inaccessible file systems post upgrade by other than 
client.admin date back from v14 carries on through v17.  It also applies 
to any case of specifying other than the default pool names for new file 
systems.  Solved because Curt remembered link on this list.  (Thanks 
Curt!) Here's what the official ceph docs ought have provided, for 
others who hit this.  YMMV:


   IF

   you have ceph file systems which have data and meta data pools that
   were specified in the 'ceph fs new' command (meaning not left to the
   defaults which create them for you),

   OR

   you have an existing ceph file system and are upgrading to a new
   major version of ceph

   THEN

   for the documented 'ceph fs authorize...' commands to do as
   documented (and avoid strange 'operation not permitted' errors when
   doing file I/O or similar security related problems for all but such
   as the client.admin user), you must first run:

   ceph osd pool application set  cephfs
   metadata 

   and

   ceph osd pool application set  cephfs data
   

   Otherwise when the OSD's get a request to read or write data (not
   the directory info, but file data) they won't know which ceph file
   system name to look up, nevermind the names you may have chosen for
   the pools,  as the 'defaults' themselves changed in the major
   releases,  from

   data pool=fsname
   metadata pool=fsname_metadata

   to

   data pool=fsname.data and
   metadata pool=fsname.meta

   as the ceph revisions came and went.  Any setup that just used
   'client.admin' for all mounts didn't see the problem as the admin
   key gave blanket permission.

   A temporary 'fix' is to change mount requests to the 'client.admin'
   and associated key.  A less drastic but still half-fix is to change
   the osd cap for your user to just 'caps osd = "allow rw"  and delete
   "tag cephfs data="

The only documentation I could find for this upgrade security-related 
ceph-ending catastrophe was in the NFS, not cephfs docs:


https://docs.ceph.com/en/latest/cephfs/nfs/

and the Genius level much appreciated pointer from Curt here:


On 5/2/23 14:21, Curt wrote:
This thread might be of use, it's an older version of ceph 14, but 
might still apply, 
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/23FDDSYBCDVMYGCUTALACPFAJYITLOHJ/ 
?


On Tue, May 2, 2023 at 11:06 PM Harry G Coin  wrote:

In 17.2.6 is there a security requirement that pool names
supporting a
ceph fs filesystem match the filesystem name.data for the data and
name.meta for the associated metadata pool? (multiple file systems
are
enabled)

I have filesystems from older versions with the data pool name
matching
the filesystem and appending _metadata for that,

and even older filesystems with the pool name as in 'library' and
'library_metadata' supporting a filesystem called 'libraryfs'

The pools all have the cephfs tag.

But using the documented:

ceph fs authorize libraryfs client.basicuser / rw

command allows the root user to mount and browse the library
directory
tree, but fails with 'operation not permitted' when even reading
any file.

However, changing the client.basicuser osd auth to 'allow rw'
instead of
'allow rw tag...' allows normal operations.

So:

[client.basicuser]
key = ==
caps mds = "allow rw fsname=libraryfs"
caps mon = "allow r fsname=libraryfs"
caps osd = "allow rw"

works, but the same with

    caps osd = "allow rw tag cephfs data=libraryfs"

leads to the 'operation not permitted' on read, or write or any
actual
access.

It remains a puzzle.  Help appreciated!

Were there upgrade instructions about that, any help pointing me
to them?

Thanks

Harry Coin
Rock Stable Systems

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] ls: cannot access '/cephfs': Stale file handle

2023-05-17 Thread Harry G Coin

I have two autofs entries that mount the same cephfs file system to two 
different mountpoints.  Accessing the first of the two fails with 'stale 
file handle'.  The second works normally. Other than the name of the 
mount point, the lines in autofs are identical.   No amount of 'umount 
-f' or restarting autofs resolves it.


Any ideas?

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] RHEL / CephFS / Pacific / SELinux unavoidable "relabel inode" error?

2023-08-02 Thread Harry G Coin

Hi!  No matter what I try, using the latest cephfs on an all 
ceph-pacific setup, I've not been able to avoid this error message, 
always similar to this on RHEL family clients:


SELinux: inode=1099954719159 on dev=ceph was found to have an invalid 
context=system_u:object_r:unlabeled_t:s0.  This indicates you may need 
to relabel the inode or the filesystem in question.


What's the answer?


Thanks

Harry Coin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Puzzle re 'ceph: mds0 session blocklisted"

2023-08-08 Thread Harry G Coin


Can anyone help me understand seemingly contradictory cephfs error messages?

I have a RHEL ceph client that mounts a cephfs file system via autofs.  
Very typical.  After boot, when a user first uses the mount, for example 
'ls /mountpoint' , all appears normal to the user.  But on the system 
console I get, every time after first boot:


...

[  412.762310] Key type dns_resolver registered
[  412.912107] Key type ceph registered
[  412.925268] libceph: loaded (mon/osd proto 15/24)
[  413.110488] ceph: loaded (mds proto 32)
[  413.124870] libceph: mon3 (2)[fc00:1002:c7::44]:3300 session established
[  413.128298] libceph: client56471655 fsid 
4067126d-01cb-40af-824a-881c130140f8

[  413.355716] ceph: mds0 session blocklisted

...

The autofs line is

/mountpoint 
-fstype=ceph,fs=cephlibraryfs,fsid=really-big-number,name=cephuser,secretfile=/etc/ceph/secret1.key,ms_mode=crc,relatime,recover_session=clean,mount_timeout=15,fscontext="system_u:object_r:cephfs_t:s0" 
[fc00:1002:c7::41]:3300,[fc00:1002:c7::42]:3300,[fc00:1002:c7::43]:3300,[fc00:1002:c7::44]:3300:/


'blocklisting' is, well, 'bad'... but there's no obvious user effect.  
Is there an 'unobvious' problem?   What am I missing?  Ceph pacific 
everywhere latest.


Thanks

Harry Coin



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] libcephfs init hangs, is there a 'timeout' argument?

2023-08-09 Thread Harry G Coin

Libcephfs's 'init' call hangs when passed arguments that once worked 
normally, but later refer to a cluster that's either broken, is on its 
way out of service, has too few mons, etc.  At least the python 
libcephfs wrapper hangs on init.


Of course mount and session timeouts work, but is there a way to error 
out a failed init call and not just hang the client?


Thanks!

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] A middle ground between containers and 'lts distros'?

2021-11-18 Thread Harry G. Coin

I sense the concern about ceph distributions via containers generally 
has to do with what you might call a feeling of 'opaqueness'.   The 
feeling is amplified as most folks who choose open source solutions 
prize being able promptly to address the particular concerns affecting 
them without having to wait for 'the next (might as well be opaque) 
release'.


An easy way forward might be if the ceph devs would document an approved 
set of steps that would add to the current ability to 'ssh in' to a 
container to  make on the fly changes.  If there was a cephadm command 
to 'save the current state of a container' in a format that adds a 
'.lcl.1' '.lcl.2' , then smooth the command line process to allow the 
cephadm upgrade process to use those saved local images with the local 
changes as targets, so as to automate pushing out the changes to the rest?


??

Harry Coin



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] "Just works" no-typing drive placement howto?

2022-01-21 Thread Harry G. Coin

There's got to be some obvious way I haven't found for this common ceph 
use case, that happens at least once every couple weeks.   I hope 
someone on this list knows and can give a link.  The scenario goes like 
this, on a server with a drive providing boot capability, the rest osds:


1. First, some capable person, probably off-site, identifies a drive 
owned by an OSD that needs replacing.    Maybe the drive totally failed, 
maybe it fails intermittently, or maybe it's working but just past the 
safe number of hours.  The person sends a text to a 'good with tools, 
but not software' person on site to replace a specific drive.


2. Second, the on-site person clicks nothing, types nothing, pulls the 
correct drive, does the screwdriver thing to put a new disk in the 
caddy, pops the drive back into the server.


3. Third, and this is the step that needs to go away:  An off-site 
person does a bunch of typing to bring that drive online.


Ceph has a way to 'make any disk it finds into an OSD', which is great 
until the drive that got replaced isn't meant for ceph such as the boot 
drive(s).


I feel this has been solved but I can't find it.  Any help?

Thanks

Harry Coin








___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] How to avoid 'bad port / jabber flood' = ceph killer?

2022-01-27 Thread Harry G. Coin

I would really appreciate advice because I bet many of you have 'seen 
this before' but I can't find a recipe.


There must be a 'better way' to respond to this situation:  It starts 
with a well working small ceph cluster with 5 servers and no apparent 
change to the workflow  suddenly starts reporting lagging ops on three 
or four OSDs and mon.  They wave in and out. Then a file system is 
'degraded' owing to many PGs stuck 'peering', in the 'throttle' stage, 
often for hours.  In short the whole composite platform becomes 
effectively useless. The dashboard works, the command line ops on all 
the hosts still work.  Strangely 'dd if=/dev/sd of=/dev/null 
bs=4096 count=100' can take 15 seconds without regard to the drive or 
host the first call.  Lots of free space both memory and storage.  No 
hardware related drive or controller issues in any of the logs.


The problem was resolved almost immediately and all functions returned 
to normal after detaching a cable linked to a wifi access point from the 
'front side' commercial grade 32 port switch all the hosts also 
connected to.  The wifi access point was lightly loaded with clients, no 
immediately obvious new devices or 'wardrivers'.


The problem appears to be a not an abruptly failing, but slowly failing 
ethernet port and/or cable and/or "IOT" device.


1: What's a better way at 'mid-failure diagnosis time' to know directly 
which cable to pull instead of 'one by one until the offender is found'?


2: Related, in the same spirit as ceph's 'devicehealth', is there a way 
to profile 'usual and customary' traffic then alert when a 'known 
connection' exceeds their baseline?


Thanks in advance, I bet a good answer will help many people.

Harry Coin








___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: v17.2.0 Quincy released

2022-04-19 Thread Harry G. Coin

Great news!  Any notion when the many pending bug fixes will show up in 
Pacific?  It's been a while.


On 4/19/22 20:36, David Galloway wrote:
We're very happy to announce the first stable release of the Quincy 
series.


We encourage you to read the full release notes at 
https://ceph.io/en/news/blog/2022/v17-2-0-quincy-released/


Getting Ceph

* Git at git://github.com/ceph/ceph.git
* Tarball at https://download.ceph.com/tarballs/ceph-17.2.0.tar.gz
* Containers at https://quay.io/repository/ceph/ceph
* For packages, see 
https://docs.ceph.com/docs/master/install/get-packages/

* Release git sha1: 43e2e60a7559d3f46c9d53f1ca875fd499a1e35e

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] How to make ceph syslog items approximate ceph -w ?

2022-05-05 Thread Harry G. Coin

Using Quincy I'm getting a much worse lag owing to ceph syslog message 
volume, though without obvious system errors.


In the usual case of no current/active hardware errors and no software 
crashes:  what config settings can I pick so that what appears in syslog 
is as close to what would appear in ceph -w as is possible for the 
subsystem type?


Thanks!



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] [progress WARNING root] complete: ev ... does not exist, oh my!

2022-05-06 Thread Harry G. Coin

I tried searching for the meaning of a ceph Quincy all caps WARNING 
message, and failed.  So I need help.   Ceph tells me my cluster is 
'healthy', yet emits a bunch of 'progress WARNING root] comlete ev' ... 
messages.  Which I score right up there with the helpful dmesg "yama, 
becoming mindful",


Should I care, and if I should, what is to be done?    Here's the log snip:

May  6 07:48:51 noc3 bash[3206]: cluster 2022-05-06T12:48:49.294641+ 
mgr.noc3.sybsfb (mgr.14574839) 20656 : cluster [DBG] pgmap v19338: 1809 
pgs: 2 active+clean+scrubbing+deep, 1807 active+clean; 16 TiB data, 41 
TiB used, 29 TiB / 70 TiB avail; 469 KiB/s rd, 4.7 KiB/s wr, 2 op/s
May  6 07:48:51 noc3 bash[3206]: audit 2022-05-06T12:48:49.313491+ 
mon.noc1 (mon.3) 336 : audit [DBG] from='mgr.14574839 
[fc00:1002:c7::43]:0/501702592' entity='mgr.noc3.sybsfb' cmd=[{"prefix":

"config dump", "format": "json"}]: dispatch
May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.224+ 
7f2e20629700  0 [progress WARNING root] complete: ev 
dc5810d7-7a30-4c8f-bafa-3158423c49f3 does not exist
May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.224+ 
7f2e20629700  0 [progress WARNING root] complete: ev 
c81b591e-6498-41bd-98bb-edbf80c690f8 does not exist
May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.224+ 
7f2e20629700  0 [progress WARNING root] complete: ev 
a9632817-10e7-4a60-ae5c-a4220d7ca00b does not exist
May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.224+ 
7f2e20629700  0 [progress WARNING root] complete: ev 
29a7ca4d-6e2a-423a-9530-3f61c0dcdbfe does not exist
May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.228+ 
7f2e20629700  0 [progress WARNING root] complete: ev 
68de11a0-92a4-48b6-8420-752bcdd79182 does not exist
May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.228+ 
7f2e20629700  0 [progress WARNING root] complete: ev 
a9437122-8ff8-4de9-a048-8a3c0262b02c does not exist
May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.228+ 
7f2e20629700  0 [progress WARNING root] complete: ev 
f15c0540-9089-4a96-884e-d75668f84796 does not exist
May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.228+ 
7f2e20629700  0 [progress WARNING root] complete: ev 
eeaf605a-9c55-44c9-9c69-8c7c35ca7591 does not exist
May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.228+ 
7f2e20629700  0 [progress WARNING root] complete: ev 
ba0ff860-4fc5-4c84-b337-1c8c616b5fbd does not exist
May  6 07:48:52 noc3 bash[3203]: debug 2022-05-06T12:48:52.228+ 
7f2e20629700  0 [progress WARNING root] complete: ev 
656fcf28-3ce1-4d6d-8ec2-eac5b6f0a233 does not exist
May  6 07:48:52 noc3 bash[3203]: :::10.12.112.66 - - 
[06/May/2022:12:48:52] "GET /metrics HTTP/1.1" 200 421310 "" 
"Prometheus/2.33.4"
May  6 07:48:53 noc3 bash[3206]: audit 2022-05-06T12:48:51.273954+ 
mon.noc1 (mon.3) 337 : audit [INF] from='mgr.14574839 
[fc00:1002:c7::43]:0/501702592' entity='mgr.noc3.sybsfb' cmd=[{"prefix":
"config rm", "format": "json", "who": "client", "name": 
"mon_cluster_log_file_level"}]: dispatch


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: reinstalled node with OSD

2022-05-11 Thread Harry G. Coin

bbk, It did help! Thank you.

Here's a slightly more 'with the osd-fsid details filled in' procedure
for moving a 'dockerized' / container-run OSD set of drives to a
replacement server/motherboard (or the same server with blank/new/fresh
reinstalled OS). For occasions when the 'new setup' will have the same
hostname as the retired/replaced one. Also for when you'd rather not
just wait for redundancy procedures to use other copies to refill fresh
or freshly wiped drives.

1. Get the new or new-os server entirely current, up and running
including validating the host is 'ceph ready' with the same hostname as
the old.

cephadm prepare-host

Make sure the ceph public key is in /root/.ssh/authorized_keys

ceph cephadm get-pub-key > ~/ceph.pub
ssh-copy-id -f -i ~//ceph.pub /root@/TargetHost/

Be sure you can 'ssh in' from a few other ceph cluster hosts.

If you previously had mons, mds, mgrs & etc in your ceph config to run
on that host, you should notice after a couple minutes ceph has got them
back into the cluster. Not that this is a good idea, to have a bunch of
such running on the same host as osd's, but just in case. To gain
confidence this will 'work', don't do the further steps until everything
checks out and the only thing left to do is restore the OSD's. (ps axu,
see the related mon, mgr, mds or other containers if any running).

2. Install the OSD drives. Reboot (there will be lvm pv/vgs on the OSD
drives, but no ceph containers attached to them).

3. do ceph config generate-minimal-conf
then from it, use those details to make a template file that looks like
this:

osd.X.json:
{
"config": "# minimal ceph.conf for
4067126d-or-whatever\n[global]\n\tfsid =
4067126d-or-whatever\n\tmon_host = [v2:[fc00:..etcetcetc]\n",

"keyring": "[osd.X]\n\tkey = X\n"
}

Note the parsers for the above are really, really picky about spaces, so
get it exactly right.

4. cephadm ceph-volume list
you should see the OSD list there of what's plugged into the system.
What you want is to copy the osd-fsid (later on).

5. for each osd without a running container, do: ceph auth get osd.[ID]
6. cp osd.X.json osd.[ID].json
7. edit osd.[ID].json, change the key to the result of step 5 and the X
in [osd.X] to the osd number.

7. copy the osd-fsid for the correct volume from step 4.
8. fixup this command to match your situation:
cephadm deploy --name osd.X --fsid like-4067126d-whatever --osd-fsid
FOR-THAT-SPECIFIC_OSD_X_from-step-4 --config-json osd.X.json
Changing the fsid, osd.X and osd-fsid and osd.X.json to match your
situation.
That will create a container with the OSD code in it, and restore it to
the cluster.

HTH

Harry

On 12/10/21 04:05, bbk wrote:

Hi,

i like to answer to myself :-) I finally found the rest of my documentation...
So after reinstalling the OS also the osd config must be created.

Here is what i have done, maybe this helps someone:

Get the informations:

```
cephadm ceph-volume lvm list
ceph config generate-minimal-conf
ceph auth get osd.[ID]
```

Now create a minimal osd config:

```
vi osd.[ID].json
```

```
{
"config": "# minimal ceph.conf for
6d0ecf22-9155-4684-971a-2f6cde8628c8\n[global]\n\tfsid =
6d0ecf22-9155-4684-971a-2f6cde8628c8\n\tmon_host = [v2:192.168.6.21:3300/0,v1:192.168.6.21:6789/0]
[v2:192.168.6.22:3300/0,v1:192.168.6.22:6789/0] [v2:192.168.6.23:3300/0,v1:192.168.6.23:6789/0]
[v2:192.168.6.24:3300/0,v1:192.168.6.24:6789/0]
[v2:192.168.6.25:3300/0,v1:192.168.6.25:6789/0]\n",
"keyring": "[osd.XXX]\n\tkey = \n"
}
```

Deploy the OSD daemon:

```
cephadm deploy --fsid 6d0ecf22-9155-4684-971a-2f6cde8628c8 --osd-fsid [ID]
--name osd.[ID] --config-json osd.[ID].json
```

Yours,
bbk

On Thu, 2021-12-09 at 18:35 +0100, bbk wrote:

After reading my mail it may not be clear that i reinstalled the OS of
a node with OSDs.

On Thu, 2021-12-09 at 18:10 +0100, bbk wrote:

Hi,

the last time i have reinstalled a node with OSDs, i added the disks
with the following command. But unfortunatly this time i ran into a
error.

It seems like this time the command doesn't create the container, i
am able to run `cephadm shell`, and other daemons (mon,mgr,mds) are
running.

I don't know if that is the right way to do it?

~# cephadm deploy --fsid 6d0ecf22-9155-4684-971a-2f6cde8628c8 --osd-
fsid 941c6cb6-6898-4aa2-a33a-cec3b6a95cf1 --name osd.9

Non-zero exit code 125 from /usr/bin/podman container inspect --
format {{.State.Status}} ceph-6d0ecf22-9155-4684-971a-2f6cde8628c8-
osd-9
/usr/bin/podman: stderr Error: error inspecting object: no such
container ceph-6d0ecf22-9155-4684-971a-2f6cde8628c8-osd-9
Non-zero exit code 125 from /usr/bin/podman container inspect --
format {{.State.Status}} ceph-6d0ecf22-9155-4684-971a-2f6cde8628c8-
osd.9
/usr/bin/podman: stderr Error: error inspecting object: no such
container ceph-6d0ecf22-9155-4684-971a-2f6cde8628c8-osd.9
Deploy daemon osd.9 ...
Non-zero exit c

[ceph-users] The last 15 'degraded' items take as many hours as the first 15K?

2022-05-11 Thread Harry G. Coin

Might someone explain why the count of degraded items can drop 
thousands, sometimes tens of thousands in the same number of hours it 
takes to go from 10 to 0?  For example, when an OSD or a host with a few 
OSD's goes offline for a while, reboots.


Sitting at one complete and entire degraded object out of millions for 
longer than it took to write this post.


Seems the fewer the number of degraded objects, the less interested the 
cluster is in fixing it!


HC



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: The last 15 'degraded' items take as many hours as the first 15K?

2022-05-11 Thread Harry G. Coin

It's a little four host, 4 OSD/host HDD cluster with a 5th doing the 
non-osd work.  Nearly entirely cephfs load.


On 5/11/22 17:47, Josh Baergen wrote:

Is this on SSD or HDD? RGW index, RBD, or ...? Those all change the
math on single-object recovery time.

Having said that...if the object is not huge and is not RGW index
omap, that slow of a single-object recovery would have me checking
whether I have a bad disk that's presenting itself as significantly
underperforming.

Josh

On Wed, May 11, 2022 at 4:03 PM Harry G. Coin  wrote:

Might someone explain why the count of degraded items can drop
thousands, sometimes tens of thousands in the same number of hours it
takes to go from 10 to 0?  For example, when an OSD or a host with a few
OSD's goes offline for a while, reboots.

Sitting at one complete and entire degraded object out of millions for
longer than it took to write this post.

Seems the fewer the number of degraded objects, the less interested the
cluster is in fixing it!

HC



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: The last 15 'degraded' items take as many hours as the first 15K?

2022-05-12 Thread Harry G. Coin


On 5/12/22 02:05, Janne Johansson wrote:

Den tors 12 maj 2022 kl 00:03 skrev Harry G. Coin :

Might someone explain why the count of degraded items can drop
thousands, sometimes tens of thousands in the same number of hours it
takes to go from 10 to 0?  For example, when an OSD or a host with a few
OSD's goes offline for a while, reboots.

Sitting at one complete and entire degraded object out of millions for
longer than it took to write this post.

Seems the fewer the number of degraded objects, the less interested the
cluster is in fixing it!

If (which is likely) different PGs take a different amount of time/IO
to recover based on size, or amount of metadata attached to it and so
on, then it would probably
be the case that some of the PGs you see early on as part of the "35
PGs are backfilling" contain the slow ones but also the faster ones
too, where the faster ones are replaced over as they finish. When all
the easy work is done, only the slow ones remain, making it look like
it waited until the end and then "don't want to work as hard on those
as the first ones" when in fact the sum of work was always going to
take a long time. (we had SMR drives on gig-eth boxes, when one of
those crashed it took .. ges to fix). It's just that the easy
parts pass by very fast due to the parallelism in the repairs, leaving
you to see the hard parts but they were never equal to begin with.

Thanks Janne and all for the insights!  The reason why I half-jokingly 
suggested the cluster 'lost interest' in those last few fixes is that 
the recovery statistics' included in ceph -s reported near to zero 
activity for so long.  After a long while those last few 'were fixed' 
--- but if the cluster was moving metadata around to fix the 'holdout 
repairs' that traffic wasn't in the stats.  Those last few objects/pgs 
to be repaired seemingly got fixed 'by magic that didn't include moving 
data counted in the ceph -s stats'.





___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Grafana host overview -- "no data"?

2022-05-12 Thread Harry G. Coin

I've a 'healthy' cluster with a dashboard where Grafana correctly 
reports the number of osds on a host and the correct raw capacity -- and 
'no data' for any time period, for any of the osd's (dockerized 
Quincy).  Meanwhile the top level dashboard cluster reports reasonable 
client throughput read/write Mi/B and Iops. What setup steps have I 
missed?  Help!


Thanks

Harry Coin


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Recovery throughput inversely linked with rbd_cache_xyz?

2020-04-23 Thread Harry G. Coin


Hello,

A couple days ago I increased the rbd cache size from the default to 
256MB/osd on a small 4 node, 6 osd/node setup in a test/lab setting.  
The rbd volumes are all vm images with writeback cache parameters and 
steady if only a few mb/sec writes going on. Logging mostly.    I 
noticed the recovery throughput went down 10x - 50x .  Using Ceph 
nautilus.  Am I seeing a coincidence or should recovery throughput tank 
when rbd cache sizes go up?  The underlying pools are mirrored on three 
disks each on a different nodes.


Thanks!

Harry Coin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] layout help: need chassis local io to minimize net links

2020-06-29 Thread Harry G. Coin

Hi

I have a few servers each with 6 or more disks, with a storage workload
that's around 80% done entirely within each server.   From a
work-to-be-done perspective there's no need for 80% of the load to
traverse network interfaces, the rest needs what ceph is all about.   So
I cooked up a set of crush maps and pools, one map/pool for each server
and one map/pool for the whole.  Skipping the long story, the
performance remains network link speed bound and has got to change. 
"Chassis local" io is too slow.   I even tried putting a mon within each
server.    I'd like to avoid having to revert to some other HA
filesystem per server with ceph at the chassis layer if I can help it.   

Any notions that would allow 'chassis local' rbd traffic to avoid or
mostly avoid leaving the box?

Thanks!




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re layout help: need chassis local io to minimize net links

2020-06-29 Thread Harry G. Coin

I need exactly what ceph is for a whole lot of work, that work just
doesn't represent a large fraction of the total local traffic.  Ceph is
the right choice.  Plainly ceph has tremendous support for replication
within a chassis, among chassis and among racks.  I just need
intra-chassis traffic to not hit the net much.   Seems not such an
unreasonable thing given the intra-chassis crush rules and all.  After
all.. ceph's name wasn't chosen for where it can't go

On 6/29/20 1:57 PM, Marc Roos wrote:
> I wonder if you should not have chosen a different product? Ceph is 
> meant to distribute data across nodes, racks, data centers etc. For a 
> nail use a hammer, for a screw use a screw driver.
>  
>
> -Original Message-
> To: ceph-users@ceph.io
> Subject: *SPAM* [ceph-users] layout help: need chassis local io 
> to minimize net links
>
> Hi
>
> I have a few servers each with 6 or more disks, with a storage workload 
> that's around 80% done entirely within each server.   From a 
> work-to-be-done perspective there's no need for 80% of the load to 
> traverse network interfaces, the rest needs what ceph is all about.   So 
> I cooked up a set of crush maps and pools, one map/pool for each server 
> and one map/pool for the whole.  Skipping the long story, the 
> performance remains network link speed bound and has got to change. 
> "Chassis local" io is too slow.   I even tried putting a mon within each 
> server.    I'd like to avoid having to revert to some other HA 
> filesystem per server with ceph at the chassis layer if I can help 
> it.   
>
> Any notions that would allow 'chassis local' rbd traffic to avoid or 
> mostly avoid leaving the box?
>
> Thanks!
>
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Re layout help: need chassis local io to minimize net links

2020-06-29 Thread Harry G. Coin

Thanks for the thinking.  By 'traffic' I mean:  when a user space rbd
write has as a destination three replica osds in the same chassis, does
the whole write get shipped out to the mon and then back, or just the
write metadata to the mon,  with the actual write data content not
having to cross a physical ethernet cable but directly to the
chassis-local osds via the 'virtual' internal switch?  I thought when I
read the layout of how ceph works only the control traffic goes to the
mons, the data directly from the generator to the osds.  Did I get that
wrong?   

All the 'usual suspects' like lossy ethernets and miswirings, etc. have
been checked.   It's actually painful to sit and wait while
'update-initramfs' can take over a minute when the vm is chassis-local
to the osds getting the write info.

On 6/29/20 9:55 PM, Anthony D'Atri wrote:
> What does “traffic” mean?  Reads?   Writes will have to hit the net 
> regardless of any machinations.   
>
>> On Jun 29, 2020, at 7:31 PM, Harry G. Coin  wrote:
>>
>> I need exactly what ceph is for a whole lot of work, that work just
>> doesn't represent a large fraction of the total local traffic.  Ceph is
>> the right choice.  Plainly ceph has tremendous support for replication
>> within a chassis, among chassis and among racks.  I just need
>> intra-chassis traffic to not hit the net much.   Seems not such an
>> unreasonable thing given the intra-chassis crush rules and all.  After
>> all.. ceph's name wasn't chosen for where it can't go
>>
>>>>> On 6/29/20 1:57 PM, Marc Roos wrote:
>>> I wonder if you should not have chosen a different product? Ceph is
>>> meant to distribute data across nodes, racks, data centers etc. For a
>>> nail use a hammer, for a screw use a screw driver.
>>> -Original Message-
>>> To: ceph-users@ceph.io
>>> Subject: *SPAM* [ceph-users] layout help: need chassis local io
>>> to minimize net links
>>> Hi
>>> I have a few servers each with 6 or more disks, with a storage workload
>>> that's around 80% done entirely within each server.   From a
>>> work-to-be-done perspective there's no need for 80% of the load to
>>> traverse network interfaces, the rest needs what ceph is all about.   So
>>> I cooked up a set of crush maps and pools, one map/pool for each server
>>> and one map/pool for the whole.  Skipping the long story, the
>>> performance remains network link speed bound and has got to change.
>>> "Chassis local" io is too slow.   I even tried putting a mon within each
>>> server.I'd like to avoid having to revert to some other HA
>>> filesystem per server with ceph at the chassis layer if I can help
>>> it.   
>>> Any notions that would allow 'chassis local' rbd traffic to avoid or
>>> mostly avoid leaving the box?
>>> Thanks!
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
>>> email to ceph-users-le...@ceph.io
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Re layout help: need chassis local io to minimize net links

2020-06-29 Thread Harry G. Coin

Anthony asked about the 'use case'.  Well, I haven't gone into details
because I worried it wouldn't help much.  From a 'ceph' perspective, the
sandbox layout goes like this:  4 pretty much identical old servers,
each with 6 drives, and a smaller server just running a mon to break
ties.  Usual front-side lan, separate back-side networking setup.  Each
of the servers is running a few vms, all more or less identical for the
test case. Each of the vms is supported by a rbd via user space libvirt
(not kernel mapped).  Each rbd belongs to a pool that is entirely local
to the chassis, presently a replica on 3 of the osds.  One of the
littler vms runs a mon+mgr per chassis.   Of course what's important is
there's a pool that spans the chassis and does all the usual things for
userland ceph is good at.  But for these tests I just unplugged all
that.  So, do any process that involves a bunch of little writes -- like
installing a package or updating a initramfs and be ready to sit for a
long time.  All the drives are 7200 rpm SATA spinners.  CPU's are not
overloaded (fewer vms than cores), no swapping, memory left over.  All
write-back caching, virtio drives.  Ceph octopus latest, though it's no
better than nautilus performance wise in this case.  Ubuntu
LTS/focal/20.04 I think.  Checked all the networking stats, no dropped
packets, no overflow buffers and anyhow there shouldn't be any important
traffic on the front side and only ceph owns the back end.  No ceph
problems reported, all pgs active, nothing misplaced, no erasure coded
pools.  

So, there's a tiny novel, thanks for sticking with it!

On 6/29/20 11:12 PM, Anthony D'Atri wrote:
>> Thanks for the thinking.  By 'traffic' I mean:  when a user space rbd
>> write has as a destination three replica osds in the same chassis
> eek.
>
>>  does the whole write get shipped out to the mon and then back
> Mons are control-plane only.
>
>> All the 'usual suspects' like lossy ethernets and miswirings, etc. have
>> been checked.   It's actually painful to sit and wait while
>> 'update-initramfs' can take over a minute when the vm is chassis-local
>> to the osds getting the write info.
> You have shared almost none of your hardware or use-case.  We know that 
> you’re doing convergence, with unspecified CPU, memory, drives.  We also 
> don’t know how heavy your colocated compute workload is.  Since you mention 
> update-initramfs, I’ll guess that your workload is VMs with RBD volumes 
> attached to libvirt/QEMU?  With unspecified RBD cache configuration.  We also 
> know nothing of your network setup and saturation.
>
> I have to suspect that either you’re doing something fundamentally wrong, or 
> should just set up a RAID6 volume and carve out LVMs.
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Are there 'tuned profiles' for various ceph scenarios?

2020-07-01 Thread Harry G. Coin

Hi

Are there any 'official' or even 'works for us' pointers to 'tuned
profiles' for such common uses as

'ceph baremetal osd host'

'ceph osd + libvirt host'

'ceph mon/mgr'

'guest vm based on a kernel-mounted rbd'

'guest vm based on a direct virtio->rados link'

I suppose there are a few other common configurations, but you get the idea.

If you haven't used or know of 'tuned'-- it's a nice way to collect a
great whole lot of sysctl and other low level configuration options in
one spot. https://tuned-project.org/

Thanks

Harry Coin




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: SPAM Are there 'tuned profiles' for various ceph scenarios?

2020-07-01 Thread Harry G. Coin

Marc:

Here's a template that works here.  You'll need to do some steps to
create the 'secret' and make the block devs and so on:


  
  
    
  
  
    
    
    
    
  
  
  
  


Glad I could contribute something.   Sure would appreciate leads for the
suggested sysctls/etc either apart or as tuned profiles.

Harry


On 7/1/20 2:44 PM, Marc Roos wrote:
>  
> Just curious, how does the libvirt xml part look like of a 'direct 
> virtio->rados link' and 'kernel-mounted rbd'
>
>
>
>
>
> -Original Message-
> To: ceph-users@ceph.io
> Subject: *SPAM* [ceph-users] Are there 'tuned profiles' for 
> various ceph scenarios?
>
> Hi
>
> Are there any 'official' or even 'works for us' pointers to 'tuned 
> profiles' for such common uses as
>
> 'ceph baremetal osd host'
>
> 'ceph osd + libvirt host'
>
> 'ceph mon/mgr'
>
> 'guest vm based on a kernel-mounted rbd'
>
> 'guest vm based on a direct virtio->rados link'
>
> I suppose there are a few other common configurations, but you get the 
> idea.
>
> If you haven't used or know of 'tuned'-- it's a nice way to collect a 
> great whole lot of sysctl and other low level configuration options in 
> one spot. https://tuned-project.org/
>
> Thanks
>
> Harry Coin
>
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Are there 'tuned profiles' for various ceph scenarios?

2020-07-01 Thread Harry G. Coin

[Resent to correct title]
Marc:

Here's a template that works here.  You'll need to do some steps to
create the 'secret' and make the block devs and so on:


  
  
    
  
  
    
    
    
    
  
  
  
  


Glad I could contribute something.   Sure would appreciate leads for the
suggested sysctls/etc either apart or as tuned profiles.

Harry

On 7/1/20 2:44 PM, Marc Roos wrote:
>  
> Just curious, how does the libvirt xml part look like of a 'direct 
> virtio->rados link' and 'kernel-mounted rbd'
>
>
>
>
>
> -Original Message-
> To: ceph-users@ceph.io
> Subject: *SPAM* [ceph-users] Are there 'tuned profiles' for 
> various ceph scenarios?
>
> Hi
>
> Are there any 'official' or even 'works for us' pointers to 'tuned 
> profiles' for such common uses as
>
> 'ceph baremetal osd host'
>
> 'ceph osd + libvirt host'
>
> 'ceph mon/mgr'
>
> 'guest vm based on a kernel-mounted rbd'
>
> 'guest vm based on a direct virtio->rados link'
>
> I suppose there are a few other common configurations, but you get the 
> idea.
>
> If you haven't used or know of 'tuned'-- it's a nice way to collect a 
> great whole lot of sysctl and other low level configuration options in 
> one spot. https://tuned-project.org/
>
> Thanks
>
> Harry Coin
>
>
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an 
> email to ceph-users-le...@ceph.io
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: v14.2.3 Nautilus released

2019-09-04 Thread Harry G. Coin

Does anyone know if the change to disable spdk by default  (so as to 
remove the corei7 dependency when running on intel platforms) made it in 
to 14.2.3?   The spdk version only required core2 in 14.2.1, the change 
to require corei7 in 14.2.2 killed all the osds on older systems flat.



On 9/4/19 8:45 AM, Abhishek Lekshmanan wrote:

This is the third bug fix release of Ceph Nautilus release series. This
release fixes a security issue. We recommend all Nautilus users upgrade
to this release. For upgrading from older releases of ceph, general
guidelines for upgrade to nautilus must be followed

Notable Changes
---
* CVE-2019-10222 - Fixed a denial of service vulnerability where an
   unauthenticated client of Ceph Object Gateway could trigger a crash from an
   uncaught exception
* Nautilus-based librbd clients can now open images on Jewel clusters.
* The RGW `num_rados_handles` has been removed. If you were using a value of
   `num_rados_handles` greater than 1, multiply your current
   `objecter_inflight_ops` and `objecter_inflight_op_bytes` parameters by the
   old `num_rados_handles` to get the same throttle behavior.
* The secure mode of Messenger v2 protocol is no longer experimental with this
   release. This mode is now the preferred mode of connection for monitors.
* "osd_deep_scrub_large_omap_object_key_threshold" has been lowered to detect an
   object with large number of omap keys more easily.

For a detailed changelog please refer to the official release notes
entry at the ceph blog: https://ceph.io/releases/v14-2-3-nautilus-released/

Getting Ceph


* Git at git://github.com/ceph/ceph.git
* Tarball at http://download.ceph.com/tarballs/ceph-14.2.3.tar.gz
* For packages, see http://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 0f776cf838a1ae3130b2b73dc26be9c95c6ccc39


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Help with "27 osd(s) are not reachable" when also "27 osds: 27 up.. 27 in"

2024-10-14 Thread Harry G Coin

I need help to remove a useless "HEALTH ERR" in 19.2.0 on a fully dual 
stack docker setup with ceph using ip v6, public and private nets 
separated, with a few servers.   After upgrading from an error free v18 
rev, I can't get rid of the 'health err' owing to the report that all 
osds are unreachable.  Meanwhile ceph -s reports all osds up and in and 
the cluster otherwise operates normally.   I don't care if it's 'a real 
fix'  I just need to remove the false error report.   Any ideas?


Thanks

Harry Coin

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Help with "27 osd(s) are not reachable" when also "27 osds: 27 up.. 27 in"

2024-10-16 Thread Harry G Coin


Hi Frédéric

All was normal in v18, after 19.2 the problem remains even though the 
addresses are different:


cluster_network global: fc00:1000:0:b00::/64

public_network global: fc00:1002:c7::/64

Also, after rebooting everything in sequence, it only complains that the 
27 osd that are both up, in and working normally remain also "not 
reachable".


~# ceph -s
  cluster:
    id: ...
    health: HEALTH_ERR
    27 osds(s) are not reachable

  services:
...

    osd: 27 osds: 27 up (since 6m), 27 in (since 12d)

On 10/16/24 03:44, Frédéric Nass wrote:

Hi Harry,

Do you have a 'cluster_network' set to the same subnet as the 'public_network' 
like in the issue [1]? Doesn't make much sens setting up a cluster_network when 
it's not different than the public_network.
Maybe that's what triggers the OSD_UNREACHABLE recently coded here [2] (even 
though it seems the code only considers IPv4 addresses, which seems odd, btw.)

I suggest removing the cluster_network and restart a single OSD to see if the 
counter decreases.

Regards,
Frédéric.

[1]https://tracker.ceph.com/issues/67517
[2]https://github.com/ceph/ceph/commit/5b70a6b92079f9e9d5d899eceebc1a62dae72997

- Le 16 Oct 24, à 3:02, Harry G coinhgc...@gmail.com a écrit :


Thanks for the notion!  I did that, the result was no change to the
problem, but with the added ceph -s complaint "Public/cluster network
defined, but can not be found on any host"  -- with otherwise totally
normal cluster operations.  Go figure.  How can ceph -s be so totally
wrong, the dashboard reporting critical problems -- except there are
none.   Makes me really wonder whether any actual testing on ipv6 is
ever done before releases are marked 'stable'.

HC


On 10/14/24 21:04, Anthony D'Atri wrote:

Try failing over to a standby mgr


On Oct 14, 2024, at 9:33 PM, Harry G Coin wrote:

I need help to remove a useless "HEALTH ERR" in 19.2.0 on a fully dual stack
docker setup with ceph using ip v6, public and private nets separated, with a
few servers.   After upgrading from an error free v18 rev, I can't get rid of
the 'health err' owing to the report that all osds are unreachable.  Meanwhile
ceph -s reports all osds up and in and the cluster otherwise operates normally.
I don't care if it's 'a real fix'  I just need to remove the false error
report.   Any ideas?

Thanks

Harry Coin

___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an emailtoceph-users-le...@ceph.io

___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: v19 & IPv6: unable to convert chosen address to string

2024-10-04 Thread Harry G Coin

Same errors as below on latest ceph / latest Ubuntu LTS / noble when 
updating from reef to squid.   The same 'ceph -s' that reports all osd's 
are 'up' and 'in' also reports all of them are 'unreachable'.  I hate it 
when that happens.   All OSD/mon/mgr hosts are dual stack, but ceph uses 
just ip6 with different subnets/interfaces for public & ceph internals.



On 10/1/24 09:43, Sascha Frey wrote:

Hi,

after upgrading our Ceph cluster from 18.2.2 to 19.2.0, I get the
following error messages:

'ceph status' shows HEALTH_ERR 504 osds(s) are not reachable
but, luckily, everything works fine.

I see these related messages in the monitor logs (these messages show up for 
each OSD):
2024-10-01T16:26:36.185+0200 7ffbd5600640 -1 log_channel(cluster) log [ERR] :   
  osd.101's public address is not in '2001:638:504:2011::/64' subnet
...
2024-10-01T16:34:01.011+0200 7ffbdba00640 -1 unable to convert chosen address 
to string: 2001:638:504:2011:9:4:1:1
...

IPv6 is enabled in ceph.conf:
ms bind ipv4 = false
ms bind ipv6 = true
public network = 2001:638:504:2011::/64


Is this a bug?


Thanks,
Sascha
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Help with "27 osd(s) are not reachable" when also "27 osds: 27 up.. 27 in"

2024-10-15 Thread Harry G Coin

Thanks for the notion!  I did that, the result was no change to the 
problem, but with the added ceph -s complaint "Public/cluster network 
defined, but can not be found on any host"  -- with otherwise totally 
normal cluster operations.  Go figure.  How can ceph -s be so totally 
wrong, the dashboard reporting critical problems -- except there are 
none.   Makes me really wonder whether any actual testing on ipv6 is 
ever done before releases are marked 'stable'.


HC


On 10/14/24 21:04, Anthony D'Atri wrote:

Try failing over to a standby mgr


On Oct 14, 2024, at 9:33 PM, Harry G Coin wrote:

I need help to remove a useless "HEALTH ERR" in 19.2.0 on a fully dual stack 
docker setup with ceph using ip v6, public and private nets separated, with a few 
servers.   After upgrading from an error free v18 rev, I can't get rid of the 'health 
err' owing to the report that all osds are unreachable.  Meanwhile ceph -s reports all 
osds up and in and the cluster otherwise operates normally.   I don't care if it's 'a 
real fix'  I just need to remove the false error report.   Any ideas?

Thanks

Harry Coin

___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] 19.2.1: HEALTH_ERR 27 osds(s) are not reachable. (Yet working normally...)

2025-02-07 Thread Harry G Coin

19.2.1 complains of all osd's being unreachable, as their public address 
isn't in the public subnet.  However, they all are within the subnet, 
and are working normally as well.


It's embarrassing for the dashboard to glow red of a totally crippled 
osd roster --- while all is working normally.  This existed in the 
previous, but was working prior to 19.


Detail:

Notice, for osd.0, the dashboard lists

public_addr
[fc00:1002:c7::44]:6807/4160993080

But, we have in the logs:

7/2/25 03:35 PM[ERR] osd.0's public address is not in 
'fc00:1002:c7::/64' subnet


7/2/25 03:35 PM[ERR][ERR] OSD_UNREACHABLE: 27 osds(s) are not reachable

7/2/25 03:35 PM[ERR]Health detail: HEALTH_ERR 27 osds(s) are not reachable

However, as per the osd.0 attributes, the public address for osd.0 is 
well inside the stated public subnet.


All the osd's are similarly configured, working, and held to be 
unreachable at the same time, for the same reason.


Tell me there's a way to fix this without waiting a further half year

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Squid 19.2.1 dashboard javascript error

2025-02-10 Thread Harry G Coin

In the same code area:  If all the alerts are silenced, nevertheless the 
dashboard will not show 'green', but red or yellow depending on the 
nature of the silenced alerts.


On 2/10/25 04:18, Nizamudeen A wrote:

Thank you Chris,

I was able to reproduce this. We will look into it and send out a fix.

Regards,
Nizam

On Fri, Feb 7, 2025 at 10:35 PM Chris Palmer  wrote:


Firstly thank you so much for the 19.2.1 release. Initial testing
suggests that the blockers that we had in 19.2.0 have all been resolved,
so we are proceeding with further testing.

We have noticed one small problem in 19.2.1 that was not present in
19.2.0 though. We use the older-style dashboard
(mgr/dashboard/FEATURE_TOGGLE_DASHBOARD  false). The problem happens on
the Dashboard screen when health changes to WARN. If you click on WARN
you get a small empty dropdown instead of the list of warnings. A
javascript error is logged, and using browser inspection there is the
additional bit of info that it happens in polyfill:

2025-02-07T15:59:00.970+ 7f1d63877640  0 [dashboard ERROR
frontend.error] (https://:8443/#/dashboard): NG0901
   Error: NG0901
  at d.find (https://
:8443/main.7869bccdd1b73f3c.js:3:3342365)
  at le.ngDoCheck
(https://:8443/main.7869bccdd1b73f3c.js:3:3173112)
  at Qe (https://:8443/main.7869bccdd1b73f3c.js:3:3225586)
  at bt (https://:8443/main.7869bccdd1b73f3c.js:3:3225341)
  at cs (https://:8443/main.7869bccdd1b73f3c.js:3:3225051)
  at $m (https://:8443/main.7869bccdd1b73f3c.js:3:3259043)
  at jf (https://:8443/main.7869bccdd1b73f3c.js:3:3266563)
  at S1 (https://:8443/main.7869bccdd1b73f3c.js:3:3259790)
  at $m (https://:8443/main.7869bccdd1b73f3c.js:3:3259801)
  at fg (https://:8443/main.7869bccdd1b73f3c.js:3:3267248)

Also, after this happens, no dropdowns work again until the page is
forcibly refreshed.

Environment is RPM install on Centos 9 Stream.

I've created issue [0].

Thanks, Chris

[0] https://tracker.ceph.com/issues/69867
<
https://tracker.ceph.com/issues/69867?next_issue_id=69865&prev_issue_id=90
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 19.2.1: HEALTH_ERR 27 osds(s) are not reachable. (Yet working normally...)

2025-02-10 Thread Harry G Coin


Hi Frédéric,

Another half year added to the previous half year wait for basic IP6 
clusters then.   If only 'ceph health mute' accomplished the goal as a 
workaround.  Notice even when all complaints are 'suppressed' -- the 
dashboard continues to offer the 'flashing red warning dot', and the ! 
Cluster critical advice.


I think that bug has two levels,  first: even when other warnings/errors 
are suppressed, the error that complains of being in a heath error for 
more than 5 minutes remains.   Second, even when the 'things have been 
bad for 5 minutes' warning is also silenced, the ! Critical advice 
remains and the flashing red 'ceph is broken' dot.  This while under 
'observability' the Alerts shows all is well.


Ceph is good in the engine room, but the steering wheel and dashboard 
needs some work to match the advertising and quality of the rest!


Harry


On 2/7/25 16:24, Frédéric Nass wrote:

Hi Harry,

It's a inoffensive bug [1] related to IPv6 clusters. It will be fixed 
in v19.2.2. The workaround is to mute the error with 'ceph health mute 
...'. It's all you can do for now.


Regards,
.

----
*De :* Harry G Coin 
*Envoyé :* vendredi 7 février 2025 22:52
*À :* ceph-users
*Objet :* [ceph-users] 19.2.1: HEALTH_ERR 27 osds(s) are not 
reachable. (Yet working normally...)


19.2.1 complains of all osd's being unreachable, as their public address
isn't in the public subnet.  However, they all are within the subnet,
and are working normally as well.

It's embarrassing for the dashboard to glow red of a totally crippled
osd roster --- while all is working normally.  This existed in the
previous, but was working prior to 19.

Detail:

Notice, for osd.0, the dashboard lists

public_addr
[fc00:1002:c7::44]:6807/4160993080

But, we have in the logs:

7/2/25 03:35 PM[ERR] osd.0's public address is not in
'fc00:1002:c7::/64' subnet

7/2/25 03:35 PM[ERR][ERR] OSD_UNREACHABLE: 27 osds(s) are not reachable

7/2/25 03:35 PM[ERR]Health detail: HEALTH_ERR 27 osds(s) are not 
reachable


However, as per the osd.0 attributes, the public address for osd.0 is
well inside the stated public subnet.

All the osd's are similarly configured, working, and held to be
unreachable at the same time, for the same reason.

Tell me there's a way to fix this without waiting a further half year

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Squid 19.2.1 dashboard javascript error

2025-02-10 Thread Harry G Coin


Hi Nizam

Answers interposed below.

On 2/10/25 11:56, Nizamudeen A wrote:

Hey Harry,

Do you see that for every alert or for some of them? If some, what are 
those? I just tried a couple of them locally and saw the dashboard 
went to a happy state.
My sanbox/dev array has three chronic 'warnings/errors'.  The first is a 
PG imbalance I'm aware of.  The second is that all 27 osds are 
unreachable.  The third is that the array has been in an error state for 
more than 5 minutes.  Silencing/suppressing all of them still gives the 
'red flashing broken dot' on the dashboard, the !Cluster status, notice 
of Alerts listing the previously suppressed errors/warnings.   Under 
'observability' we see no indications of errors/warnings under the 
'alerts' menu option -- so you got that one right.


Can you tell me how the ceph health or ceph health detail looks like 
after the muted alert? And also does ceph -s reports HEALTH_OK?

root@noc1:~# ceph -s
 cluster:
   id: 40671140f8
   health: HEALTH_ERR
   27 osds(s) are not reachable

 services:
   mon: 5 daemons, quorum noc4,noc2,noc1,noc3,sysmon1 (age 10m)
   mgr: noc1.j(active, since 37m), standbys: noc2.yhx, 
noc3.b, noc4.tc

   mds: 1/1 daemons up, 3 standby
   osd: 27 osds: 27 up (since 14m), 27 in (since 5w)


Ceph's actual core operations are otherwise normal.

It's hard to sell ceph as a concept when showing all the storage is at 
once unreachable and up and in as well.   Not a big confidence builder.







Regards,
Nizam

On Mon, Feb 10, 2025 at 9:00 PM Harry G Coin  wrote:

In the same code area:  If all the alerts are silenced,
nevertheless the
dashboard will not show 'green', but red or yellow depending on the
nature of the silenced alerts.

On 2/10/25 04:18, Nizamudeen A wrote:
> Thank you Chris,
>
> I was able to reproduce this. We will look into it and send out
a fix.
>
> Regards,
> Nizam
>
> On Fri, Feb 7, 2025 at 10:35 PM Chris Palmer
 wrote:
>
>> Firstly thank you so much for the 19.2.1 release. Initial testing
>> suggests that the blockers that we had in 19.2.0 have all been
resolved,
>> so we are proceeding with further testing.
>>
>> We have noticed one small problem in 19.2.1 that was not present in
>> 19.2.0 though. We use the older-style dashboard
>> (mgr/dashboard/FEATURE_TOGGLE_DASHBOARD  false). The problem
happens on
>> the Dashboard screen when health changes to WARN. If you click
on WARN
>> you get a small empty dropdown instead of the list of warnings. A
>> javascript error is logged, and using browser inspection there
is the
>> additional bit of info that it happens in polyfill:
>>
>> 2025-02-07T15:59:00.970+ 7f1d63877640  0 [dashboard ERROR
>> frontend.error] (https://:8443/#/dashboard): NG0901
>>    Error: NG0901
>>       at d.find (https://
>> :8443/main.7869bccdd1b73f3c.js:3:3342365)
>>       at le.ngDoCheck
>> (https://:8443/main.7869bccdd1b73f3c.js:3:3173112)
>>       at Qe
(https://:8443/main.7869bccdd1b73f3c.js:3:3225586)
>>       at bt
(https://:8443/main.7869bccdd1b73f3c.js:3:3225341)
>>       at cs
(https://:8443/main.7869bccdd1b73f3c.js:3:3225051)
>>       at $m
(https://:8443/main.7869bccdd1b73f3c.js:3:3259043)
>>       at jf
(https://:8443/main.7869bccdd1b73f3c.js:3:3266563)
>>       at S1
(https://:8443/main.7869bccdd1b73f3c.js:3:3259790)
>>       at $m
(https://:8443/main.7869bccdd1b73f3c.js:3:3259801)
>>       at fg
(https://:8443/main.7869bccdd1b73f3c.js:3:3267248)
>>
>> Also, after this happens, no dropdowns work again until the page is
>> forcibly refreshed.
>>
>> Environment is RPM install on Centos 9 Stream.
>>
>> I've created issue [0].
>>
>> Thanks, Chris
>>
>> [0] https://tracker.ceph.com/issues/69867
>> <
>>
https://tracker.ceph.com/issues/69867?next_issue_id=69865&prev_issue_id=90
<https://tracker.ceph.com/issues/69867?next_issue_id=69865&prev_issue_id=90>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Squid 19.2.1 dashboard javascript error

2025-02-10 Thread Harry G Coin

Yes, all the errors and warnings list as 'suppressed'.   Doesn't affect 
the bug as reported below.

Of some interest, "OSD_UNREACHABLE" is not listed on the dashboard alert 
roster of problems,  but is in the command line health detail.

But really, when all the errors list as 'suppressed', whatever they are, 
then the dashboard should show green.  Instead it flashes red, along 
with !Critical as detailed below.

I suspect what's really going on is the detection method for showing the 
'red / yellow / green' decision and !Critical decision, is different 
than whether the length of unsilenced errors is >0.   Even allowing for 
the possibility that many errors exist which could trigger HEALTH_ERR 
for which no entry in the roster of alerts exists.

I wish I knew whether mon/mgr host changes erases all the mutings.   I 
have just now given the command "ceph health mute OSD_UNREACHABLE 180d' 
-- for the second time this week, and now the board shows green.  I 
scrolled back through the command list to very I did this.  Indeed it 
was there.   Is there a command line that lists active mutings -- one 
that's not used by the dashboard apparently?

On 2/10/25 14:00, Eugen Block wrote:

Hi,

did you also mute the osd_unreachable warning?

ceph health mute OSD_UNREACHABLE 10w

Should bring the cluster back to HEALTH_OK for 10 weeks.

Zitat von Harry G Coin :

Hi Nizam

Answers interposed below.

On 2/10/25 11:56, Nizamudeen A wrote:

Hey Harry,

Do you see that for every alert or for some of them? If some, what 
are those? I just tried a couple of them locally and saw the 
dashboard went to a happy state.
My sanbox/dev array has three chronic 'warnings/errors'.  The first 
is a PG imbalance I'm aware of.  The second is that all 27 osds are 
unreachable.  The third is that the array has been in an error state 
for more than 5 minutes.  Silencing/suppressing all of them still 
gives the 'red flashing broken dot' on the dashboard, the !Cluster 
status, notice of Alerts listing the previously suppressed 
errors/warnings.   Under 'observability' we see no indications of 
errors/warnings under the 'alerts' menu option -- so you got that one 
right.

Can you tell me how the ceph health or ceph health detail looks like 
after the muted alert? And also does ceph -s reports HEALTH_OK?

root@noc1:~# ceph -s
 cluster:
   id: 40671140f8
   health: HEALTH_ERR
   27 osds(s) are not reachable

 services:
   mon: 5 daemons, quorum noc4,noc2,noc1,noc3,sysmon1 (age 10m)
   mgr: noc1.j(active, since 37m), standbys: noc2.yhx, 
noc3.b, noc4.tc

   mds: 1/1 daemons up, 3 standby
   osd: 27 osds: 27 up (since 14m), 27 in (since 5w)

Ceph's actual core operations are otherwise normal.

It's hard to sell ceph as a concept when showing all the storage is 
at once unreachable and up and in as well.   Not a big confidence 
builder.

Regards,
Nizam

On Mon, Feb 10, 2025 at 9:00 PM Harry G Coin  wrote:

   In the same code area:  If all the alerts are silenced,
   nevertheless the
   dashboard will not show 'green', but red or yellow depending on the
   nature of the silenced alerts.

   On 2/10/25 04:18, Nizamudeen A wrote:
   > Thank you Chris,
   >
   > I was able to reproduce this. We will look into it and send out
   a fix.
   >
   > Regards,
   > Nizam
   >
   > On Fri, Feb 7, 2025 at 10:35 PM Chris Palmer
    wrote:
   >
   >> Firstly thank you so much for the 19.2.1 release. Initial testing
   >> suggests that the blockers that we had in 19.2.0 have all been
   resolved,
   >> so we are proceeding with further testing.
   >>
   >> We have noticed one small problem in 19.2.1 that was not 
present in

   >> 19.2.0 though. We use the older-style dashboard
   >> (mgr/dashboard/FEATURE_TOGGLE_DASHBOARD  false). The problem
   happens on
   >> the Dashboard screen when health changes to WARN. If you click
   on WARN
   >> you get a small empty dropdown instead of the list of warnings. A
   >> javascript error is logged, and using browser inspection there
   is the
   >> additional bit of info that it happens in polyfill:
   >>
   >> 2025-02-07T15:59:00.970+ 7f1d63877640  0 [dashboard ERROR
   >> frontend.error] (https://:8443/#/dashboard): NG0901
   >>    Error: NG0901
   >>       at d.find (https://
   >> :8443/main.7869bccdd1b73f3c.js:3:3342365)
   >>       at le.ngDoCheck
   >> (https://:8443/main.7869bccdd1b73f3c.js:3:3173112)
   >>       at Qe
(https://:8443/main.7869bccdd1b73f3c.js:3:3225586)
   >>       at bt
(https://:8443/main.7869bccdd1b73f3c.js:3:3225341)
   >>       at cs
(https://:8443/main.7869bccdd1b73f3c.js:3:3225051)
   >>       at $m
(https://:8443/main.7869bcc

[ceph-users] Re: squid 19.2.1 RC QE validation status

2024-12-18 Thread Harry G Coin

Any chance for this one or one that fixes 'all osd's unreachable' when 
ipv6 in use?


https://github.com/ceph/ceph/pull/60881

On 12/18/24 11:35, Ilya Dryomov wrote:

On Mon, Dec 16, 2024 at 6:27 PM Yuri Weinstein  wrote:

Details of this release are summarized here:

https://tracker.ceph.com/issues/69234#note-1

Release Notes - TBD
LRC upgrade - TBD
Gibba upgrade -TBD

Please provide tracks for failures so we avoid duplicates.
Seeking approvals/reviews for:

rados - Radek, Laura
rgw - Eric, Adam E
fs - Venky
orch - Adam King
rbd, krbd - Ilya

Hi Yuri,

Consider rbd and krbd approved, but if there is a respin I'd like
to include https://github.com/ceph/ceph/pull/61095 and a backport
of https://github.com/ceph/ceph/pull/61129 (yet to merge).

Thanks,

 Ilya
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] 19.2.1 dashboard OSD column sorts do nothing?

2025-03-21 Thread Harry G Coin

Has anyone else tried to change the sort order of columns in the 
cluster/osd display on 19.2.1?   While the header changes to indicate 
'increasing/descending' and 'selected', the rows stay fixed on an 
ascending order by ID.


???

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: v19.2.2 Squid released

2025-04-10 Thread Harry G Coin


19.2.2 Installed!


# ceph -s
 cluster:
   id:     ,,,
   health: HEALTH_ERR
   27 osds(s) are not reachable

...

   osd: 27 osds: 27 up (since 32m), 27 in (since 5w)

...

It's such a 'bad look' something so visible, in such an often given command.

10/4/25 06:00 PM[ERR]osd.27's public address is not in 
'fc00:1002:c7::/64' subnet


But

# ceph config get osd.27

..
global    basic public_network fc00:1002:c7::/64

...

ifconfig of osd.27

...

   inet6 fc00:1002:c7::43/64 scope global
  valid_lft forever preferred_lft forever

...


..similar for all the other osds, although of course on different hosts.



On 4/10/25 15:08, Yuri Weinstein wrote:

We're happy to announce the 2nd backport release in the Squid series.

https://ceph.io/en/news/blog/2025/v19-2-2-squid-released/

Notable Changes
---
- This hotfix release resolves an RGW data loss bug when CopyObject is
used to copy an object onto itself.
   S3 clients typically do this when they want to change the metadata
of an existing object.
   Due to a regression caused by an earlier fix for
https://tracker.ceph.com/issues/66286,
   any tail objects associated with such objects are erroneously marked
for garbage collection.
   RGW deployments on Squid are encouraged to upgrade as soon as
possible to minimize the damage.
   The experimental rgw-gap-list tool can help to identify damaged objects.

Getting Ceph

* Git atgit://github.com/ceph/ceph.git
* Tarball athttps://download.ceph.com/tarballs/ceph-19.2.2.tar.gz
* Containers athttps://quay.io/repository/ceph/ceph
* For packages, seehttps://docs.ceph.com/en/latest/install/get-packages/
* Release git sha1: 0eceb0defba60152a8182f7bd87d164b639885b8
___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Upgrade from 18.2.4 to 19.x or even 20

2025-04-21 Thread Harry G Coin


Frédéric's policy below I think is very good advice.  The only reason to 
upgrade sooner than his advice is when you need a missing feature or fear 
hitting a fixed bug -- or just like living on the edge.

On 4/19/25 16:45, Frédéric Nass wrote:

Hi e3gh75 :-)

There's no dumb questions. Here's a sum up of what I've learned:

- You can skip point releases in the same major version (18.2.2 -> 18.2.4).
- You can skip minor releases in the same major version (18.0.1 -> 18.2.2).
- You can skip one major release (at max) while also skipping minor releases and point 
releases (17.1.1 -> 19.2.2) but it's always better to upgrade to the latest major 
release (major+minor+point) before upgrading to the latest up-to-date major release 
(17.1.1 -> 17.2.8 -> 19.2.2).

Regarding major releases, you can go from Pacific (16) to Reef (18) or Quincy 
(17) to Squid (19) without concerns.

Choosing a stable release (x.2.z) is always preferable to x.0.z and x.1.z.

Waiting 3-4 weeks to jump to the latest point release is also good practice.

Regards,
.


- Le 18 Avr 25, à 18:21,  e3g...@gmail.com a écrit :


Hello,

I have dumb question but hopefully simple question. In the cephadm documentation
it states you can upgrade from a point release to another point release without
problem, 15.2.2 to 15.2.3. What about jumping up a version or two, like from
18.2.4 to 19.2.2? What have others experience been? Any advice? Thank you for
your time.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 19.2.2: Warning,Smartctl has received an unknown argument (error code -22)

2025-05-31 Thread Harry G Coin


Anthony,

Thanks.  Yes to all.  We see on all host servers identical to below.  
Notice smartctl appears normal from the command line within the osd.


On the baremetal:

# apt list smartmontools
Listing... Done
smartmontools/noble,now 7.4-2build1 amd64 [installed]
root@noc3

And then in the shell

# cephadm shell --name osd.1

root@noc3:/# dnf list smartmontools

Installed Packages
smartmontools.x86_64  1:7.2-9.el9 
@System

root@noc3:/# ls -l /dev/sd?
brw-rw 1 root disk 8,   0 May 30 14:03 /dev/sda
brw-rw 1 root disk 8,  16 May 30 14:04 /dev/sdb
brw-rw 1 root disk 8,  32 May 30 14:04 /dev/sdc
brw-rw 1 root disk 8,  48 May 30 14:04 /dev/sdd
brw-rw 1 root disk 8,  64 May 30 14:04 /dev/sde
brw-rw 1 root disk 8,  80 May 30 14:04 /dev/sdf
brw-rw 1 root disk 8,  96 May 30 14:04 /dev/sdg
brw-rw 1 root disk 8, 112 May 30 14:04 /dev/sdh


root@noc3:/# smartctl -j /dev/sdg
{
 "json_format_version": [
   1,
   0
 ],
 "smartctl": {
   "version": [
 7,
 2
   ],
   "svn_revision": "5155",
   "platform_info": "x86_64-linux-6.8.0-60-generic",
   "build_info": "(local build)",
   "argv": [
 "smartctl",
 "-j",
 "/dev/sdg"
   ],
   "exit_status": 0
 },
 "device": {
   "name": "/dev/sdg",
   "info_name": "/dev/sdg [SAT]",
   "type": "sat",
   "protocol": "ATA"
 }
}
root@noc3:/#


So, it's a puzzle.



On 5/30/25 19:12, Anthony D'Atri wrote:

Do you have 7.0+?  That’s when JSON output was added for Ceph.  Are your drives 
natively visible to the kernel, not hidden behind a RAID HBA?


On May 30, 2025, at 6:53 PM, Harry G Coin wrote:

Using 19.2.2, we notice under cluster/osds/'device health' on the dashboard, 
for all osds no matter the server:

 Warning
Smartctl has received an unknown argument (error code -22). You may be using an 
incompatible version of smartmontools. Version >= 7.0 of smartmontools is 
required to successfully retrieve data.  That error code resolves to 'unknown 
attribute' in the smartctl docs.  However, the same result occurs whether the 
drive is HGST, Seagate, or Western Digital.

"State of Health" is always "Stale"

"Life Expectancy" is always "n/a> 6 weeks"

Of course the 'diskprediction_local" module has been broken for over a year, as 
it requires a no-longer-distributed rev of a sub-package,  but that shouldn't stop 
the smartctl command from normal operations.

Any ideas?

Thanks!

Harry Coin


___
ceph-users mailing list --ceph-users@ceph.io
To unsubscribe send an email toceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] 19.2.2: Warning,Smartctl has received an unknown argument (error code -22)

2025-05-30 Thread Harry G Coin

Using 19.2.2, we notice under cluster/osds/'device health' on the 
dashboard, for all osds no matter the server:


    Warning
Smartctl has received an unknown argument (error code -22). You may be 
using an incompatible version of smartmontools. Version >= 7.0 of 
smartmontools is required to successfully retrieve data.  That error 
code resolves to 'unknown attribute' in the smartctl docs.  However, the 
same result occurs whether the drive is HGST, Seagate, or Western Digital.


"State of Health" is always "Stale"

"Life Expectancy" is always "n/a> 6 weeks"

Of course the 'diskprediction_local" module has been broken for over a 
year, as it requires a no-longer-distributed rev of a sub-package,  but 
that shouldn't stop the smartctl command from normal operations.


Any ideas?

Thanks!

Harry Coin


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: squid 19.2.3 QE validation status

2025-07-07 Thread Harry G Coin

Do the qualification tests check whether ipv6 osds are now 'found' by 
the dashboard/healthchecks?  Or are they still reported as all missing 
while nevertheless working normally?


On 7/7/25 09:31, Yuri Weinstein wrote:

Seeking approvals/reviews for:

rados - Radek, Laura
rgw- Adam Emerson
fs - Venky
orch - Adam King approved
rbd, krbd - Ilya approved
quincy-x, reef-x - Laura, Neha (can we make it less noisy?)
crimson-rados - N/A
ceph-volume - Guillaume

On Thu, Jul 3, 2025 at 7:56 AM Yuri Weinstein  wrote:

Corrected the subject line

On Thu, Jul 3, 2025 at 7:36 AM Yuri Weinstein  wrote:

Details of this release are summarized here:

https://tracker.ceph.com/issues/71912#note-1

Release Notes - TBD
LRC upgrade - TBD

Seeking approvals/reviews for:

rados - Radek, Laura
rgw- Adam Emerson
fs - Venky
orch - Adam King
rbd, krbd - Ilya
quincy-x, reef-x - Laura, Neha (can we make it less noisy?)
crimson-rados - N/A
ceph-volume - Guillaume

Pls let me know if any tests were missed from this list.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: CephFS with Ldap

2025-06-30 Thread Harry G Coin

To get ldap working, we had to set up samba to manage the shares (it has 
the ability to do ldap auth, connecting the smb accounts to the linux 
ownership/permission space).


It would be a very nice help if ceph would include a native, secondary 
ldap option, if only just for anything doing file or block device sharing.




On 6/30/25 11:26, gagan tiwari wrote:

Hi Guys,
   We have a Ldap server with all users login details.

We have to mount data stored in Ceph to several client nodes via CephFS so
that users can access that data and start using that data in
their processes.  But we need to grant permission / ownership to users to
enable them to access that data.

like chown user:group  /dirs (  on Linux )

How will cephfs recognize users , groups that are in Ldap ?

Will I need to set-up Ldap authentication on all nodes in Ceph cluster for
this purpose ( ceph mgr , ceph mons , ceph mds and  all ods nodes )

Please advise.

Thanks,
Gagan
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Upgrade from 19.2.2 to .3 pauses on 'phantom' duplicate osd?

2025-07-29 Thread Harry G Coin

Need a clue about what appears to be a phantom duplicate osd 
automagically created/discovered via the upgrade process -- which blocks 
the upgrade.


The upgrade process on a known-good 19.2.2 to 19.2.3 proceeded normally 
through the mgrs and mons.  It upgraded most of the osds, then stopped 
with the complaint "Error: UPGRADE_REDEPLOY_DAEMON: Upgrading daemon 
osd.1 on host noc3 failed."    The roster in the "Daemon Versions" table 
on the dashboard looks normal except:


There are two entries for 'osd.1'  One of them has the correct version 
number, 19.2.2, the other is blank.


The upgrade appears 'stuck'.  An attempt to 'resume' resulted in the 
same error.  The cluster operations are normal with all osds up and in.  
The cluster is ipv6.  Oddly ceph -s reports:



root@noc1:~# ceph -s
  cluster:
    id: 406xxx0f8
    health: HEALTH_WARN
    Public/cluster network defined, but can not be found on any 
host

    Upgrading daemon osd.1 on host noc3 failed.

  services:
    mon: 5 daemons, quorum noc4,noc2,noc1,noc3,sysmon1 (age 39m)
    mgr: noc2.yhyuxd(active, since 4h), standbys: noc3.sybsfb, 
noc4.tvhgac, noc1.jtteqg

    mds: 1/1 daemons up, 3 standby
    osd: 27 osds: 27 up (since 3h), 27 in (since 10d)

  data:
    volumes: 1/1 healthy
    pools:   16 pools, 1809 pgs
    objects: 14.77M objects, 20 TiB
    usage:   52 TiB used, 58 TiB / 111 TiB avail
    pgs: 1808 active+clean
 1    active+clean+scrubbing

  io:
    client:   835 KiB/s rd, 1.0 MiB/s wr, 24 op/s rd, 105 op/s wr

  progress:
    Upgrade to 19.2.3 (4h)
  [] (remaining: 4h)

Related log entry:

29/7/25 02:40 PM[ERR]cephadm exited with an error code: 1, stderr: 
Non-zero exit code 1 from /usr/bin/docker container inspect --format 
{{.State.Status}} ceph-4067126d-01cb-40af-824a-881c130140f8-osd-1 
/usr/bin/docker: stdout /usr/bin/docker: stderr Error response from 
daemon: No such container: ceph-4067126dXXX40f8-osd-1 
Non-zero exit code 1 from /usr/bin/docker container inspect --format 
{{.State.Status}} ceph-4067126dXXX40f8-osd.1 
/usr/bin/docker: stdout /usr/bin/docker: stderr Error response from 
daemon: No such container: ceph-4067126dXXX40f8-osd.1 
Reconfig daemon osd.1 ... Traceback (most recent call last): File "", 
line 198, in _run_module_as_main File "", line 88, in _run_code File 
"/var/lib/ceph/4067126dXXX40f8/cephadm.1a8853661a9c1798390b8e8d13c27688c1b1327a075745af2ee40ac466f0ac36/__main__.py", 
line 5581, in File 
"/var/lib/ceph/4067126dXXX40f8/cephadm.1a8853661a9c1798390b8e8d13c27688c1b1327a075745af2ee40ac466f0ac36/__main__.py", 
line 5569, in main File 
"/var/lib/ceph/4067126dXXX40f8/cephadm.1a8853661a9c1798390b8e8d13c27688c1b1327a075745af2ee40ac466f0ac36/__main__.py", 
line 3051, in command_deploy_from File 
"/var/lib/ceph/4067126dXXX40f8/cephadm.1a8853661a9c1798390b8e8d13c27688c1b1327a075745af2ee40ac466f0ac36/__main__.py", 
line 3086, in _common_deploy File 
"/var/lib/ceph/4067126dXXX40f8/cephadm.1a8853661a9c1798390b8e8d13c27688c1b1327a075745af2ee40ac466f0ac36/__main__.py", 
line 3106, in _deploy_daemon_container File 
"/var/lib/ceph/4067126dXXX40f8/cephadm.1a8853661a9c1798390b8e8d13c27688c1b1327a075745af2ee40ac466f0ac36/__main__.py", 
line 1077, in deploy_daemon File 
"/var/lib/ceph/4067126dXXX40f8/cephadm.1a8853661a9c1798390b8e8d13c27688c1b1327a075745af2ee40ac466f0ac36/__main__.py", 
line 765, in create_daemon_dirs File 
"/usr/lib/python3.12/contextlib.py", line 144, in __exit__ 
next(self.gen) File 
"/var/lib/ceph/4067126dXXX40f8/cephadm.1a8853661a9c1798390b8e8d13c27688c1b1327a075745af2ee40ac466f0ac36/cephadmlib/file_utils.py", 
line 52, in write_new IsADirectoryError: [Errno 21] Is a directory: 
'/var/lib/ceph/4067126dXXX40f8/osd.1/config.new' -> 
'/var/lib/ceph/4067126dXXX40f8/osd.1/config' Traceback 
(most recent call last): File "/usr/share/ceph/mgr/cephadm/serve.py", 
line 1145, in _check_daemons self.mgr._daemon_action(daemon_spec, 
action=action) File "/usr/share/ceph/mgr/cephadm/module.py", line 2545, 
in _daemon_action return self.wait_async( File 
"/usr/share/ceph/mgr/cephadm/module.py", line 815, in wait_async return 
self.event_loop.get_result(coro, timeout) File 
"/usr/share/ceph/mgr/cephadm/ssh.py", line 136, in get_result return 
future.result(timeout) File 
"/lib64/python3.9/concurrent/futures/_base.py", line 446, in result 
return self.__get_result() File 
"/lib64/python3.9/concurrent/futures/_base.py", line 391, in 
__get_result raise self._exception File 
"/usr/share/ceph/mgr/cephadm/serve.py", line 1381, in _create_daemon 
out, err, code = await self._run_cephadm( File 
"/usr/share/ceph/mgr/cephadm/serve.py", line 1724, in _run_cephadm raise 
OrchestratorError( orchestrator._interface.OrchestratorError: cephadm 
exited

75 matches

Mail list logo