Jianan Wang created HDFS-14555:
----------------------------------

             Summary: HDFS NFS gateway read Input/output error
                 Key: HDFS-14555
                 URL: https://issues.apache.org/jira/browse/HDFS-14555
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Jianan Wang


I have enabled the HDFS NFS gateway on our HDFS cluster through official 
documentation. Everything works well except for one Ubuntu 16.04 server 
machine. The following is the kernel, `mount` and machine's `sysctl -a` output 
information.

```
root@Linux:~$ uname -a
Linux xxx-server-001 4.15.0-46-generic #49~16.04.1-Ubuntu SMP Tue Feb 12 
17:45:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

root@Linux:~$ mount | grep hdfs
10.30.200.100:/ on /hdfs type nfs 
(rw,relatime,sync,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.30.200.100,mountvers=3,mountport=4242,mountproto=tcp,local_lock=all,addr=10.30.200.100)

root@Linux:~$ sysctl -a | grep nfs
fs.nfs.idmap_cache_timeout = 2
fs.nfs.nfs_callback_tcpport = 0
fs.nfs.nfs_congestion_kb = 259136
fs.nfs.nfs_mountpoint_timeout = 500
fs.nfs.nlm_grace_period = 0
fs.nfs.nlm_tcpport = 0
fs.nfs.nlm_timeout = 10
fs.nfs.nlm_udpport = 0
fs.nfs.nsm_local_state = 3
fs.nfs.nsm_use_hostnames = 0
sunrpc.nfs_debug = 0xffff
sunrpc.nfsd_debug = 0x0000
```

The symptoms are the following:

1. It could `ls /hdfs` folders with very few files in it without error, but it 
failed with `Input/output error` when the folder it tried to read from contains 
many files (more than 100 or so).

2. When enabled the NFS debugging information through `sudo rpcdebug -m nfs -c 
all` on the machine, I observed the following error logs in `dmesg` when I hit 
the `Input/ouput error` through `ls` as the following. I have checked the 
source code [here][1], and it looks like some buffer overflows issue. Does that 
mean it is a kernel bug for NFS?
```
[2538707.003904] NFS: dentry_delete(1232344325/sss.123.txt, 4808cc)
[2538707.003907] NFS: decode_fattr3 prematurely hit the end of our receive 
buffer. Remaining buffer length is 0 words.
[2538707.003914] NFS: readdir(b200/095900) returns -5
```

3. When using other laptops or servers to mount the HDFS NFS gateway through ` 
sudo mount -t nfs -o vers=3,proto=tcp,nolock,noacl,sync 10.30.200.100:/ /hdfs`, 
it does not have any issue. This means it is probably not the issue on the NFS 
gateway server itself. However, I have tried installing the `4.15.0-46-generic` 
kernel on my own laptop but I could not reproduce this issue.

4. This issue is not constantly reproducible, and sometimes it will work in the 
second or third time of retry after the gateway is just mounted. However, the 
failure rate would be 90%+ so it is still not usable.

Please let me know if there is any direction I could debug for this weird 
situation. Thanks in advance!


 [1]: 
https://android.googlesource.com/kernel/msm/+/android-wear-5.1.1_r0.6/fs/nfs/nfs3xdr.c?autodive=0%2F%2F#125



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to