> On Oct 22, 2015, at 10:19 PM, John-Paul Robinson <j...@uab.edu> wrote:
> 
> A few clarifications on our experience:
> 
> * We have 200+ rbd images mounted on our RBD-NFS gateway.  (There's
> nothing easier for a user to understand than "your disk is full".)

Same here, and agreed. It sounds like our situations are similar except for my 
blocking on an apparently healthy cluster issue. 

> * I'd expect more contention potential with a single shared RBD back
> end, but with many distinct and presumably isolated backend RBD images,
> I've always been surprised that *all* the nfsd task hang.  This leads me
> to think  it's an nfsd issue rather than and rbd issue.  (I realize this
> is an rbd list, looking for shared experience. ;) )

It's definitely possible. I've experienced exactly the behavior you're seeing. 
My guess is that when an nfsd thread blocks and goes dark, affected clients 
(even if it's only one) will retransmit their requests thinking there's a 
network issue causing more nfsds to go dark until all the server threads are 
stuck (that could be hogwash, but it fits the behavior). Or perhaps there are 
enough individual clients writing to the affected NFS volume that they consume 
all the available nfsd threads (I'm not sure about your client to FS and nfsd 
thread ratio, but that is plausible in my situation).  I think some testing 
with xfs_freeze and non-critical nfs server/clients is called for. 

I don't think this part is related to ceph except that it happens to be 
providing the underlying storage. I'm fairly certain that my problems with an 
apparently healthy cluster blocking writes is a ceph problem, but I haven't 
figured out what the source of that is. 

> * I haven't seen any difference between reads and writes.  Any access to
> any backing RBD store from the NFS client hangs.

All NFS clients are hung, but in my situation, it's usually only 1-3 local file 
systems that stop accepting writes. NFS is completely unresponsive, but local 
and remote-samba operations on the unaffected file systems are totally happy. 

I don't have a solution to NFS issue, but I've seen it all too often. I wonder 
whether setting a huge number of threads and or playing with client retransmit 
times would help, but I suspect this problem is just intrinsic to Linux NFS 
servers. 

Ryan
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to