Hej, 
i've been struggling with CEPH and kerberized NFS4 and HA for a while but now 
seem to have a working solution with CEPH 18.2.4 ... right now in stress 
testing phase. 
Manually set up ganesha servers - no cephadm, no containers. HA is done with 
keepalived. 
The current solution works with only one ganesha server running at a time. 
keepalived switches ip adresses and starts a new ganesha server on another host 
in case of host or ganesha service failure. Failover and switchback works 
within seconds. 
Setup was a bit cumbersome and trial and error: compiling matchings versions of 
CEPH and Ganesha (6.5), scripts for keepalived state logic to monitor and 
start/stop ganesha services. 
Probably could be done much more elegant ... work in progress ... 
Cheers, toBias 



From: "Devin A. Bougie" <devin.bou...@cornell.edu> 
To: "ceph-users" <ceph-users@ceph.io> 
Sent: Wednesday, 23 April, 2025 00:19:01 
Subject: [ceph-users] Help with HA NFS 

Hello, 

We’ve found that if we lose one of the nfs.cephfs service daemons in our 
cephadm 19.2.2 cluster, all NFS traffic is blocked until either: 
- the down nfs.cephfs daemon is restarted 
- or we reconfigure the placement of the nfs.cephs service to not use the 
affected host. After this, the ingress.nfs.cephfs service is automatically 
reconfigured and everything resumes 

Our current setup follows the "HIGH-AVAILABILITY NFS” documentation, which 
gives us an ingress.nfs.cephfs service with the haproxy and keepalived daemons 
and an nfs.cephfs service for the actual nfs daemons. This service was deployed 
using: 
ceph nfs cluster create cephfs "label:_admin" --ingress --virtual_ip virtual_ip 

And then we updated the ingress.nfs.cephfs service to only deploy a single 
service (which in this case, results in two daemons on a single host). 

This gives us the following: 
——— 
[root@cephman1 ~]# ceph orch ls --service_name=ingress.nfs.cephfs --export 
service_type: ingress 
service_id: nfs.cephfs 
service_name: ingress.nfs.cephfs 
placement: 
count: 1 
label: _admin 
spec: 
backend_service: nfs.cephfs 
first_virtual_router_id: 50 
frontend_port: 2049 
monitor_port: 9049 
virtual_ip: 128.84.45.48/22 

[root@cephman1 ~]# ceph orch ls --service_name=nfs.cephfs --export 
service_type: nfs 
service_id: cephfs 
service_name: nfs.cephfs 
placement: 
label: _admin 
spec: 
port: 12049 
——— 

Can anyone show us the config for a true “HA” nfs service where they can lose 
any single host without impacting access to the NFS export from clients? I 
would expect to be able to lose the host running the ingress.nfs.cephfs 
service, and have it automatically restarted on a different host. Likewise, I 
would expect to be able to lose an nfs.cephs daemon without impacting access to 
the export. 

Or should we be taking a completely different approach and move our NFS service 
out of Ceph and into our pacemaker / corosync cluster? 

Sorry if this sounds redundant to questions I’ve previously asked, but we’ve 
reconfigured things a little and it feels like we’re getting closer with each 
attempt? 

Many thanks, 
Devin 

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to