I have been running into this as well, but I am using S3 for checkpointing so I chalked it up to network partitioning with s3-isnt-hdfs as my storage location. But it seems that you are indeed using hdfs, so I wonder if there is another underlying issue.
On Wed, Mar 28, 2018 at 8:21 AM, Jone Zhang <joyoungzh...@gmail.com> wrote: > The spark streaming job running for a few days,then fail as below > What is the possible reason? > > *18/03/25 07:58:37 ERROR yarn.ApplicationMaster: User class threw > exception: org.apache.spark.SparkException: Job aborted due to stage > failure: Task 16 in stage 80018.0 failed 4 times, most recent failure: Lost > task 16.3 in stage 80018.0 (TID 7318859, 10.196.155.153): > java.io.FileNotFoundException: > /data/hadoop_tmp/nm-local-dir/usercache/mqq/appcache/application_1521712903594_6152/blockmgr-7aa2fb13-25d8-4145-a704-7861adfae4ec/22/shuffle_40009_16_0.data.574b45e8-bafd-437d-8fbf-deb6e3a1d001 > (No such file or directory)* > > Thanks! > > -- *Lucas Kacher*Senior Engineer - vsco.co <https://www.vsco.co/> New York, NY 818.512.5239