I have a Spark Streaming app (1.5.2 compile for hadoop 2.6) that occasionally
has problem writing its checkpoint file. This is a YARN (yarn cluster) app
running as user mapred.
What I see in my streaming app logs are:
App log
App log
15/12/15 00:14:08 server51:
/mnt/hdfs/hdfs01/data4/yarn-log/application_1450104983109_0273/container_1450104983109_0273_01_000004/stderr:15/12/15
00:14:08 INFO hdfs.DFSClient: Could not complete
/user/hadoop/yarn-app/session_svc/checkpoint/420fb269-490f-41fe-94a9-7c60be7a0931/rdd-665/.part-00022-attempt-0
retrying...
15/12/15 00:14:20 server51:
/mnt/hdfs/hdfs01/data4/yarn-log/application_1450104983109_0273/container_1450104983109_0273_01_000004/stderr:15/12/15
00:14:20 INFO hdfs.DFSClient: Could not complete
/user/hadoop/yarn-app/session_svc/checkpoint/420fb269-490f-41fe-94a9-7c60be7a0931/rdd-665/.part-00025-attempt-0
retrying...
15/12/15 00:14:24 server51:
/mnt/hdfs/hdfs01/data4/yarn-log/application_1450104983109_0273/container_1450104983109_0273_01_000004/stderr:15/12/15
00:14:24 INFO hdfs.DFSClient: Could not complete
/user/hadoop/yarn-app/session_svc/checkpoint/420fb269-490f-41fe-94a9-7c60be7a0931/rdd-665/.part-00025-attempt-0
retrying...
15/12/15 00:14:50 server51:
/mnt/hdfs/hdfs01/data4/yarn-log/application_1450104983109_0273/container_1450104983109_0273_01_000004/stderr:15/12/15
00:14:50 INFO hdfs.DFSClient: Could not complete
/user/hadoop/yarn-app/session_svc/checkpoint/420fb269-490f-41fe-94a9-7c60be7a0931/rdd-665/.part-00046-attempt-0
retrying...
15/12/15 00:14:55 server51:
/mnt/hdfs/hdfs01/data4/yarn-log/application_1450104983109_0273/container_1450104983109_0273_01_000004/stderr:15/12/15
00:14:55 INFO hdfs.DFSClient: Could not complete
/user/hadoop/yarn-app/session_svc/checkpoint/420fb269-490f-41fe-94a9-7c60be7a0931/rdd-665/.part-00046-attempt-0
retrying…
server51 = xx.xx.xx.199
In the name node logs during that same time, I see a bunch of these messages:
[2015-12-15T00:14:08.433Z,INFO,org.apache.hadoop.hdfs.server.namenode.FSNamesystem:,BLOCK*
checkFileProgress: blk_2161610531_1087887542{blockUCState=COMMITTED,
truncateBlock=null, primaryNodeIndex=-1,
replicas=[ReplicaUnderConstruction[[DISK]DS-45109dfc-c3e5-4fd7-a6de-9cec1db8e301:NORMAL:xx.xx.xx.138:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-61c84f08-e038-442d-8f64-238a9bbd8803:NORMAL:xx.xx.xx.199:50010|RBW],
ReplicaUnderConstruction[[DISK]DS-bc8b67eb-2715-4300-bc44-7ba5b54e80c8:NORMAL:xx.xx.xx.119:50010|RBW]]}
has not reached minimal replication 1]
If I filter those messages out, the remaining messages during that time frame
from xx.xx.xx.199 are:
[2015-12-15T00:00:08.449Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler
17 on 8020: skipped org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease
from xx.xx.xx.199:46602 Call#2746 Retry#0]
[2015-12-15T00:10:29.609Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler
47 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from
xx.xx.xx.199:59576 Call#38454 Retry#0:
org.apache.hadoop.security.AccessControlException: Permission denied:
user=yarn, access=EXECUTE, inode="/app-logs/mapred/logs":mapred:hdfs:drwxrwx---]
[2015-12-15T00:13:24.556Z,INFO,org.apache.hadoop.ipc.Server:,Socket Reader #1
for port 8020: readAndProcess from client xx.xx.xx.199 threw exception
[java.io.IOException: Connection reset by peer]]
[2015-12-15T00:13:24.673Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler
75 on 8020: skipped org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo
from xx.xx.xx.199:59984 Call#4765 Retry#0]
[2015-12-15T00:13:24.683Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler
70 on 8020: skipped
org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from
xx.xx.xx.199:59984 Call#4766 Retry#0]
[2015-12-15T00:13:24.699Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler
64 on 8020: skipped
org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from
xx.xx.xx.199:59984 Call#4767 Retry#0]
[2015-12-15T00:13:24.721Z,INFO,org.apache.hadoop.ipc.Server:,IPC Server handler
57 on 8020: skipped
org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from
xx.xx.xx.199:59984 Call#4768 Retry#0]
[2015-12-15T00:13:42.641Z,INFO,org.apache.hadoop.ipc.Server:,Socket Reader #1
for port 8020: readAndProcess from client xx.xx.xx.199 threw exception
[java.io.IOException: Connection reset by peer]]
The user mapred can see the directory - but it is an empty directory (e.g.,
hdfs dfs -ls /app–logs/mapred/logs/application_1450104983109_0273/ returns
nothing)
The message about execute permission denied seems off, I am wondering if some
configuration is dialed down too low and maybe the permission error is just a
red herring.
Has anyone else run into a problem where the checkpoint has IOExceptions
running as a YARN app?
Thanks,
Robert