Hej, i've been struggling with CEPH and kerberized NFS4 and HA for a while but now seem to have a working solution with CEPH 18.2.4 ... right now in stress testing phase. Manually set up ganesha servers - no cephadm, no containers. HA is done with keepalived. The current solution works with only one ganesha server running at a time. keepalived switches ip adresses and starts a new ganesha server on another host in case of host or ganesha service failure. Failover and switchback works within seconds. Setup was a bit cumbersome and trial and error: compiling matchings versions of CEPH and Ganesha (6.5), scripts for keepalived state logic to monitor and start/stop ganesha services. Probably could be done much more elegant ... work in progress ... Cheers, toBias
From: "Devin A. Bougie" <devin.bou...@cornell.edu> To: "ceph-users" <ceph-users@ceph.io> Sent: Wednesday, 23 April, 2025 00:19:01 Subject: [ceph-users] Help with HA NFS Hello, We’ve found that if we lose one of the nfs.cephfs service daemons in our cephadm 19.2.2 cluster, all NFS traffic is blocked until either: - the down nfs.cephfs daemon is restarted - or we reconfigure the placement of the nfs.cephs service to not use the affected host. After this, the ingress.nfs.cephfs service is automatically reconfigured and everything resumes Our current setup follows the "HIGH-AVAILABILITY NFS” documentation, which gives us an ingress.nfs.cephfs service with the haproxy and keepalived daemons and an nfs.cephfs service for the actual nfs daemons. This service was deployed using: ceph nfs cluster create cephfs "label:_admin" --ingress --virtual_ip virtual_ip And then we updated the ingress.nfs.cephfs service to only deploy a single service (which in this case, results in two daemons on a single host). This gives us the following: ——— [root@cephman1 ~]# ceph orch ls --service_name=ingress.nfs.cephfs --export service_type: ingress service_id: nfs.cephfs service_name: ingress.nfs.cephfs placement: count: 1 label: _admin spec: backend_service: nfs.cephfs first_virtual_router_id: 50 frontend_port: 2049 monitor_port: 9049 virtual_ip: 128.84.45.48/22 [root@cephman1 ~]# ceph orch ls --service_name=nfs.cephfs --export service_type: nfs service_id: cephfs service_name: nfs.cephfs placement: label: _admin spec: port: 12049 ——— Can anyone show us the config for a true “HA” nfs service where they can lose any single host without impacting access to the NFS export from clients? I would expect to be able to lose the host running the ingress.nfs.cephfs service, and have it automatically restarted on a different host. Likewise, I would expect to be able to lose an nfs.cephs daemon without impacting access to the export. Or should we be taking a completely different approach and move our NFS service out of Ceph and into our pacemaker / corosync cluster? Sorry if this sounds redundant to questions I’ve previously asked, but we’ve reconfigured things a little and it feels like we’re getting closer with each attempt? Many thanks, Devin
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io