Haoze Wu created HDFS-15869:
-------------------------------
Summary: Network issue while FSEditLogAsync is executing
RpcEdit.logSyncNotify can cause the namenode to hang
Key: HDFS-15869
URL: https://issues.apache.org/jira/browse/HDFS-15869
Project: Hadoop HDFS
Issue Type: Bug
Components: fs async, namenode
Affects Versions: 3.2.2
Reporter: Haoze Wu
*Description*
We were doing some testing of the latest Hadoop stable release 3.2.2 and
found some network issue can cause the namenode to hang even with the async
edit logging (FSEditLogAsync). We found that this issue is similar to
HDFS-15486, but we have a more comprehensive study here.
The workflow of the FSEditLogAsync thread is basically:
# get EditLog from a queue (line 229)
# do the transaction (line 232)
# sync the log if doSync (line 243)
# do logSyncNotify (line 248)
```
//hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
@Override
public void run() {
try {
while (true) {
boolean doSync;
Edit edit = dequeueEdit(); //
line 229
if (edit != null) {
// sync if requested by edit log.
doSync = edit.logEdit(); //
line 232
syncWaitQ.add(edit);
} else {
// sync when editq runs dry, but have edits pending a sync.
doSync = !syncWaitQ.isEmpty();
}
if (doSync) {
// normally edit log exceptions cause the NN to terminate, but tests
// relying on ExitUtil.terminate need to see the exception.
RuntimeException syncEx = null;
try {
logSync(getLastWrittenTxId()); // line
243
} catch (RuntimeException ex) {
syncEx = ex;
}
while ((edit = syncWaitQ.poll()) != null) {
edit.logSyncNotify(syncEx); //
line 248
}
}
}
} catch (InterruptedException ie) {
LOG.info(Thread.currentThread().getName() + " was interrupted, exiting");
} catch (Throwable t) {
terminate(t);
}
}
```
In terms of the step 4, FSEditLogAsync$RpcEdit.logSyncNotify is essentially
doing some network write (line 365).
```
//hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java
private static class RpcEdit extends Edit {
//…
@Override
public void logSyncNotify(RuntimeException syncEx) {
try {
if (syncEx == null) {
call.sendResponse(); // line 365
} else {
call.abortResponse(syncEx);
}
} catch (Exception e) {} // don't care if not sent.
}
//...
}
```
If the sendResponse operation in line 365 gets stuck, then the whole
FSEditLogAsync thread is not able to proceed. In this case, the critical
logSync (line 243) can’t be executed, for the incoming transactions. Then the
namenode hangs. This is undesirable because FSEditLogAsync’s key feature is
asynchronous edit logging that is supposed to tolerate slow I/O.
To see why the sendResponse operation in line 365 may get stuck, here is
the stack trace:
```
'(org.apache.hadoop.ipc.Server,channelWrite,3593)',
'(org.apache.hadoop.ipc.Server,access$1700,139)',
'(org.apache.hadoop.ipc.Server$Responder,processResponse,1657)',
'(org.apache.hadoop.ipc.Server$Responder,doRespond,1727)',
'(org.apache.hadoop.ipc.Server$Connection,sendResponse,2828)',
'(org.apache.hadoop.ipc.Server$Connection,access$300,1799)',
'(org.apache.hadoop.ipc.Server$RpcCall,doResponse,1111)',
'(org.apache.hadoop.ipc.Server$Call,doResponse,903)',
'(org.apache.hadoop.ipc.Server$Call,sendResponse,889)',
'(org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync$RpcEdit,logSyncNotify,365)',
'(org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync,run,248)',
'(java.lang.Thread,run,748)'
```
The `channelWrite` function is defined as follows:
```
//hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
private int channelWrite(WritableByteChannel channel,
ByteBuffer buffer) throws IOException {
int count = (buffer.remaining() <= NIO_BUFFER_LIMIT) ?
channel.write(buffer) : channelIO(null, channel, buffer);
// line 3594
if (count > 0) {
rpcMetrics.incrSentBytes(count);
}
return count;
}
```
The `channel.write(buffer)` operation in line 3594 may be slow. Although
for this specific stack trace, the channel is initialized in the non-blocking
mode, there is still a chance of being slow depending on native write
implementation in the OS (e.g., a kernel issue). Furthermore, the channelIO
invocation in line 3594 may also get stuck, since it waits until the buffer is
drained:
```
//hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Server.java
private static int channelIO(...) throws IOException {
...
while (buf.remaining() > 0) {
// line 3637
try {
int ioSize = Math.min(buf.remaining(), NIO_BUFFER_LIMIT);
buf.limit(buf.position() + ioSize);
ret = (readCh == null) ? writeCh.write(buf) : readCh.read(buf); //
line 3642
if (ret < ioSize) {
break;
}
} finally {
buf.limit(originalLimit);
}
}
int nBytes = initialRemaining - buf.remaining();
return (nBytes > 0) ? nBytes : ret;
}
```
For example, if the payload is split in two batches, the second batch will
have to wait for the first batch to be sent out, which may encounter high
packet loss rate and thus slow I/O.
In summary, FSEditLogAsync is a critical service, but the potential delay
occurring in `edit.logSyncNotify(syncEx)` can block the FSEditLogAsync thread.
According to the comment
([https://github.com/apache/hadoop/blob/rel/release-3.2.2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSEditLogAsync.java#L369]),
“don't care if not sent”, FSEditLogAsync thread in fact does not really need
to be concerned with the sync notification.
*Reproduction*
To show that the potential delay introduced by
`FSEditLogAsync$RpcEdit.logSyncNotify` can cause the namenode to hang, we
provide the scripts to reproduce the bug.
The script basically blocks the FSEditLogAsync thread when it’s invoking
`call.sendResponse()` in `FSEditLogAsync$RpcEdit#logSyncNotify`. Our
reproduction scripts guarantee that the delay is injected once and only once.
The reproduction script is provided in a gist ...
*Fix*
Since the logSyncNotify is not a critical operation for `FSEditLogAsync`,
we propose to put the `FSEditLogAsync$RpcEdit.logSyncNotify` invocation to a
separate thread in FSEditLogAsync. In this way, even if the notifications get
stuck, they will not affect the edit logging.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]