And if it helps, I'm running on flink 1.2.1. I saw this ticket: https://issues.apache.org/jira/browse/FLINK-5828 It only started happening when I was running all 50 flows at the same time. However, it looks like it's not an issue with creating the cache directory but with running out of space there? But what's in there is also tiny.
bash-4.1$ hdfs dfs -du -h hdfs://d191291/user/delp/.flink/application_1510733430616_2098853 1.1 K hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/5c71e4b6-2567-4d34-98dc-73b29c502736-taskmanager-conf.yaml 1.4 K hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/flink-conf.yaml 93.5 M hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/flink-dist_2.10-1.2.1.jar 264.8 M hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/lib 1.9 K hdfs://d191291/user/delp/.flink/application_1510733430616_2098853/log4j.properties From: Chan, Regina [Tech] Sent: Tuesday, December 12, 2017 1:56 AM To: 'user@flink.apache.org' Subject: ProgramInvocationException: Could not upload the jar files to the job manager / No space left on device Hi, I'm currently submitting 50 separate jobs to a 50TM, 1 slot set up. Each job has 1 parallelism. There's plenty of space left in my cluster and on that node. It's not clear to me what's happening. Any pointers? On the client side, when I try to execute, I see the following: org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Could not upload the jar files to the job manager. at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427) at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400) at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387) at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62) at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:926) at com.gs.ep.da.lake.refinerlib.flink.FlowData.execute(FlowData.java:143) at com.gs.ep.da.lake.refinerlib.flink.FlowData.flowPartialIngestionHalf(FlowData.java:107) at com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:72) at com.gs.ep.da.lake.refinerlib.flink.FlowData.call(FlowData.java:39) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.flink.runtime.client.JobSubmissionException: Could not upload the jar files to the job manager. at org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:150) at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:95) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.io.IOException: Could not retrieve the JobManager's blob port. at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:745) at org.apache.flink.runtime.jobgraph.JobGraph.uploadUserJars(JobGraph.java:565) at org.apache.flink.runtime.client.JobSubmissionClientActor$1.call(JobSubmissionClientActor.java:148) ... 9 more Caused by: java.io.IOException: PUT operation failed: Connection reset at org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:512) at org.apache.flink.runtime.blob.BlobClient.put(BlobClient.java:374) at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:771) at org.apache.flink.runtime.blob.BlobClient.uploadJarFiles(BlobClient.java:740) ... 11 more Caused by: java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.apache.flink.runtime.blob.BlobClient.putInputStream(BlobClient.java:499) ... 14 more On the job manager logs I see this: 2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:345) at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314) at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113) 2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:345) at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314) at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113) 2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:345) at org.apache.flink.runtime.blob.BlobServerConnection.put(BlobServerConnection.java:314) at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:113) 2017-12-12 01:42:47,608 ERROR org.apache.flink.runtime.blob.BlobServerConnection - PUT operation failed java.io.IOException: No space left on device Regina Chan Goldman Sachs - Enterprise Platforms, Data Architecture 30 Hudson Street, 37th floor | Jersey City, NY 07302 * (212) 902-5697