make number of IPC accepts configurable
---------------------------------------

                 Key: HADOOP-6308
                 URL: https://issues.apache.org/jira/browse/HADOOP-6308
             Project: Hadoop Common
          Issue Type: Improvement
          Components: ipc
    Affects Versions: 0.20.0
         Environment: Linux, running Yahoo-based 0.20
            Reporter: Andrew Ryan


We were recently seeing issues in our environments where HDFS clients would 
experience RST's from the NN when trying to do RPC to get file info, which 
would cause the task to fatal out. After some debugging we identified this to 
be that the IPC server listen queue -- ipc.server.listen.queue.size -- was far 
too low, we had been using the default value of 128 and found we needed to bump 
it up to 10240 before resets went away (although this value is a bit suspect, 
as I will explain later in the issue).

When a large map job starts, lots of clients very quickly start to issue RPC 
requests to the namenode, which creates this listen queue filling up problem, 
because clients are opening connections faster than Hadoop's RPC server can 
process them. We went back to our 0.17 cluster and instrumented that with 
tcpdump and found that we had been sending RST's for a long time there, but the 
retry handling was implemented differently back in 0.17 so a single TCP failure 
wasn't task-fatal.

In our environment we have our TCP stack set to explicitly send resets when the 
listen queue gets overflowed (syctl net.ipv4.tcp_abort_on_overflow = 1), 
default linux behavior is to start dropping SYN packets and let the client 
retransmit. Other people may be experiencing this issue and not noticing it 
because they are using the default behavior, which is to let the NN drop 
packets on the floor and let clients retransmit.

So we've identified (at least) 3 improvements that can be made here:
1) In src/core/org/apache/hadoop/ipc/Server.java, Listener.doAccept() is 
currently hardcoded to do 10 accept()'s at a time, then it will start to read. 
We feel that it would be better to allow the server to be configured to support 
more than 10 accept's at one time using a configurable parameter. We can still 
leave 10 as the default.
2) Increase the default value of ipc.server.listen.queue.size from 128, or at 
least document that people with larger clusters starting thousands of mappers 
at once should increase this value. I wonder if a lot of people running larger 
clusters are dropping packets and don't realize it because TCP is covering them 
up. One one hand, yay TCP, on the other hand, those are needless delays and 
retries because the server can handle more connections.
3) Document that ipc.server.listen.queue.size may be limited to the value of 
SOMAXCONN (linux sysctl net.core.somaxconn ; default 4096 on our systems). The 
Java docs are not completely clear about this, and it's difficult to test 
because you can't query the backlog of a listening socket. We were under some 
time pressure in our case and tried 1024 which was not enough, and 10240 which 
worked, so we stuck with that.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to