[ https://issues.apache.org/jira/browse/FLINK-9144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16435876#comment-16435876 ]
ASF GitHub Bot commented on FLINK-9144: --------------------------------------- GitHub user NicoK opened a pull request: https://github.com/apache/flink/pull/5842 [FLINK-9144][network] fix SpillableSubpartition causing jobs to hang when spilling ## What is the purpose of the change This should fix various scenarios where the backlog accounting in the `SpillableSubpartition` was wrong during spilling buffers and where empty buffers have been spilled (unnecessarily). ## Brief change log - improve logging in `SpillableSubpartition` and a minor optimisation when getting a `MemorySegment` - make sure that backlog accounting is right in `SpillableSubpartition#spillFinishedBufferConsumers()` in various cases: -- empty buffers -- buffers being spilled multiple times (currently code led to buffers being spilled twice, once empty, and then the final contents) - do not spill empty buffers - always spill all buffers when finishing a stream ## Verifying this change This change added tests and can be verified as follows: - adapt `SpillableSubpartitionTest` to cover more spilling scenarios, also with partial buffers and with the same pattern that the `RecordWriter` is actually using it - adapt the batch e2e tests to cover the scenario for this fix (it was blocking) ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): **no** - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **no** - The serializers: **no** - The runtime per-record code paths (performance sensitive): **no** (per-buffer only) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: **no** - The S3 file system connector: **no** ## Documentation - Does this pull request introduce a new feature? **no** - If yes, how is the feature documented? **JavaDocs** You can merge this pull request into a Git repository by running: $ git pull https://github.com/NicoK/flink flink-9144 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5842.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5842 ---- commit 9463553d7aae31fb54099cf91234472af3eff3db Author: Nico Kruber <nico@...> Date: 2018-04-06T17:35:44Z [hotfix][checkstyle] fix warnings in LocalBufferPool commit 01c27d091715a4d2672d430d42cd79947458d999 Author: Nico Kruber <nico@...> Date: 2018-04-06T17:36:15Z [hotfix][network] extend logging message in SpillableSubpartition commit 8f6bc82722076718a75dc5234545bc7fbd917dbf Author: Nico Kruber <nico@...> Date: 2018-04-06T17:36:33Z [hotfix][network] minor optimisation in LocalBufferPool commit b759456b8f16d17b7918da4df7398c28205793ab Author: Nico Kruber <nico@...> Date: 2018-04-06T17:34:44Z [FLINK-9144][network] fix SpillableSubpartition causing jobs to hang when spilling ---- > Spilling batch job hangs > ------------------------ > > Key: FLINK-9144 > URL: https://issues.apache.org/jira/browse/FLINK-9144 > Project: Flink > Issue Type: Bug > Components: Network > Affects Versions: 1.5.0, 1.6.0 > Reporter: Nico Kruber > Assignee: Nico Kruber > Priority: Blocker > Fix For: 1.5.0 > > > A user on the mailing list reported that his batch job stops to run with > Flink 1.5 RC1: > https://lists.apache.org/thread.html/43721934405019e7255fda627afb7c9c4ed0d04fb47f1c8f346d4194@%3Cdev.flink.apache.org%3E -- This message was sent by Atlassian JIRA (v7.6.3#76005)