It would also help if you could send us the DEBUG logs of the run Mark.
Including the logs from the client because they contain information about
which timestamp is used for the upload. One more question which could help
pinpointing the problem: Does the problem start occurring with Flink
1.10.0? My suspicion is that we might have broken something with the second
PR for FLINK-8801 [1]. It looks that we no longer try to set the
local timestamp via FileSystem.setTimes if we cannot fetch the remote
timestamp. However, this should only be a problem for eventual consistent
filesystems.

[1] https://issues.apache.org/jira/browse/FLINK-8801

Cheers,
Till

On Mon, Jan 18, 2021 at 11:04 AM Xintong Song <tonysong...@gmail.com> wrote:

> Hi Mark,
>
> Two quick questions that might help us understand what's going on.
> - Does this error happen for every of your dataset jobs? For a problematic
> job, does it happen for every container?
> - What is the `jobs.jar`? Is it under `lib/`, `opt` of your client side
> filesystem, or specified as `yarn.ship-files`, `yarn.ship-archives` or
> `yarn.provided.lib.dirs`? This helps us to locate the code path that this
> file went through.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Sun, Jan 17, 2021 at 10:32 PM Mark Davis <moda...@protonmail.com>
> wrote:
>
>> Hi all,
>> I am upgrading my DataSet jobs from Flink 1.8 to 1.12.
>> After the upgrade I started to receive the errors like this one:
>>
>> 14:12:57,441 INFO
>> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager  -
>> Worker container_e120_1608377880203_0751_01_000112 is terminated.
>> Diagnostics: Resource
>> hdfs://bigdata/user/hadoop/.flink/application_1608377880203_0751/jobs.jar
>> changed on src filesystem (expected 1610892446439, was 1610892446971
>> java.io.IOException: Resourceh
>> dfs://bigdata/user/hadoop/.flink/application_1608377880203_0751/jobs.jar
>> changed on src filesystem (expected 1610892446439, was 1610892446971
>>         at
>> org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:257)
>>         at
>> org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:63)
>>         at
>> org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:361)
>>         at
>> org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at javax.security.auth.Subject.doAs(Subject.java:422)
>>         at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
>>         at
>> org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:359)
>>         at
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:228)
>>         at
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:221)
>>         at
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:209)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>         at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>         at java.lang.Thread.run(Thread.java:745)
>>
>> I understand it is somehow related to FLINK-12195, but this time it comes
>> from the Hadoop code. I am running a very old version of the HDP platform
>> v.2.6.5 so it might be the one to blame.
>> But the code was working perfectly fine before the upgrade, so I am
>> confused.
>> Could you please advise.
>>
>> Thank you!
>>   Mark
>>
>

Reply via email to