It would also help if you could send us the DEBUG logs of the run Mark. Including the logs from the client because they contain information about which timestamp is used for the upload. One more question which could help pinpointing the problem: Does the problem start occurring with Flink 1.10.0? My suspicion is that we might have broken something with the second PR for FLINK-8801 [1]. It looks that we no longer try to set the local timestamp via FileSystem.setTimes if we cannot fetch the remote timestamp. However, this should only be a problem for eventual consistent filesystems.
[1] https://issues.apache.org/jira/browse/FLINK-8801 Cheers, Till On Mon, Jan 18, 2021 at 11:04 AM Xintong Song <tonysong...@gmail.com> wrote: > Hi Mark, > > Two quick questions that might help us understand what's going on. > - Does this error happen for every of your dataset jobs? For a problematic > job, does it happen for every container? > - What is the `jobs.jar`? Is it under `lib/`, `opt` of your client side > filesystem, or specified as `yarn.ship-files`, `yarn.ship-archives` or > `yarn.provided.lib.dirs`? This helps us to locate the code path that this > file went through. > > Thank you~ > > Xintong Song > > > > On Sun, Jan 17, 2021 at 10:32 PM Mark Davis <moda...@protonmail.com> > wrote: > >> Hi all, >> I am upgrading my DataSet jobs from Flink 1.8 to 1.12. >> After the upgrade I started to receive the errors like this one: >> >> 14:12:57,441 INFO >> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager - >> Worker container_e120_1608377880203_0751_01_000112 is terminated. >> Diagnostics: Resource >> hdfs://bigdata/user/hadoop/.flink/application_1608377880203_0751/jobs.jar >> changed on src filesystem (expected 1610892446439, was 1610892446971 >> java.io.IOException: Resourceh >> dfs://bigdata/user/hadoop/.flink/application_1608377880203_0751/jobs.jar >> changed on src filesystem (expected 1610892446439, was 1610892446971 >> at >> org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:257) >> at >> org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63) >> at >> org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361) >> at >> org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:422) >> at >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) >> at >> org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359) >> at >> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:228) >> at >> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:221) >> at >> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:209) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> >> I understand it is somehow related to FLINK-12195, but this time it comes >> from the Hadoop code. I am running a very old version of the HDP platform >> v.2.6.5 so it might be the one to blame. >> But the code was working perfectly fine before the upgrade, so I am >> confused. >> Could you please advise. >> >> Thank you! >> Mark >> >