[ https://issues.apache.org/jira/browse/FLINK-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377099#comment-16377099 ]
ASF GitHub Bot commented on FLINK-8755: --------------------------------------- GitHub user NicoK opened a pull request: https://github.com/apache/flink/pull/5581 [FLINK-8755][FLINK-8786][network] fix two bugs in spilled and spillable subpartition views ## What is the purpose of the change 1) `SpilledSubpartitionView#getNextBuffer()` relies on the backlog to signal further data availability. However, if there are only events left in the buffer queue, their buffers are not included in the backlog counting and therefore, `isMoreAvailable` will be wrongly false here. 2) When processing the last in-memory buffer in `SpillableSubpartitionView#getNextBuffer`, we always set the `isMoreAvailable` flag of the returned `BufferAndBacklog` to false irrespective of what may be in the spill writer. This PR fixes both issues and heavily extends the unit tests in this regard, hence the two were combined in a single PR. Please also note that this PR is built upon #5549, #5550, #5551, and #5572 to reduce possible merge conflicts - everything starting after FLINK-8694 is new. ## Brief change log - rename `RecordWriter#closeBufferConsumer()` to `closeBufferBuilder()` (internal method, we switched to buffer builders a while ago) - make `AwaitableBufferAvailablityListener` (used by tests only) thread-safe - fix `SpilledSubpartitionView#getNextBuffer()` to not only rely on the backlog - fix `SpillableSubpartitionView#getNextBuffer()` returning wrong `isMoreAvailable` when processing the last in-memory buffer - extended overall subpartition tests to also verify several other flags that were added in the past but not covered appropriately, e.g. `BufferAndBacklog#isMoreAvailable()` or `ResultSubpartitionView#isAvailable()` - some minor code and documentation improvements (more details in the individual commits) ## Verifying this change This change added tests and can be verified as follows: - added several checks to `PipelinedSubpartitionTest` and `SpillableSubpartitionTest` via helper methods in `SubpartitionTestBase` ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): **no** - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: **no** - The serializers: **no** - The runtime per-record code paths (performance sensitive): **no** - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: **no** - The S3 file system connector: **no** ## Documentation - Does this pull request introduce a new feature? (yes / no) - If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented) You can merge this pull request into a Git repository by running: $ git pull https://github.com/NicoK/flink flink-8786 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5581.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5581 ---- commit 40e18c85563e1ef45ce89709fa3aa7613439e12d Author: Nico Kruber <nico@...> Date: 2018-02-20T17:04:12Z [FLINK-8733][network] fix SpillableSubpartition#spillFinishedBufferConsumers() not counting spilled bytes commit 2721cabce0dd2be4bb4da4097ff4e6c7749498c1 Author: Nico Kruber <nico@...> Date: 2018-02-20T17:05:54Z [FLINK-8734][network] fix partition bytes counting and re-enable in tests commit 375de6118a5d84d21b40b7c23438d09204ad664b Author: Nico Kruber <nico@...> Date: 2018-02-20T17:06:41Z [hotfix][network] remove PowerMockRunner from RecordWriterTest commit d0d3c7b026d5af3a57c47892501ab0e74e7172b2 Author: Nico Kruber <nico@...> Date: 2018-02-20T17:07:02Z [hotfix][network] various minor improvements commit dbcfe73c41618c70e16884f2c723fc9a6a9dca4f Author: Nico Kruber <nico@...> Date: 2018-02-21T15:30:53Z [hotfix][network] initialize SingleInputGate#enqueuedInputChannelsWithData with the right size commit 13f2e09240b9efb8163bb93dad52486fc2af65ac Author: Nico Kruber <nico@...> Date: 2018-02-21T16:09:31Z [FLINK-8736][network] fix memory segment offsets for slices of slices being wrong commit f8363154ef2e03b99a471f627028ad50fc1271ab Author: Nico Kruber <nico@...> Date: 2018-02-23T17:07:31Z fixup! [hotfix][network] various minor improvements commit 8cf861f06ddc9c79fc61407ebe426213d1740ef7 Author: Piotr Nowojski <piotr.nowojski@...> Date: 2018-02-23T10:20:21Z [FLINK-8760][runtime] Correctly propagate moreAvailable flag through SingleInputGate Previously if we SingleInputGate was re-eqnqueuing an input channel, isMoreAvailable might incorrectly return false. This might caused some dead locks. commit c7cda5463e7bba1d2f3f62006f6e4a71246efccb Author: Piotr Nowojski <piotr.nowojski@...> Date: 2018-02-23T10:37:37Z [hotfix][tests] Deduplicate code in SingleInputGateTest commit 954c019a70cf881de7410762977e033e2768c11e Author: Piotr Nowojski <piotr.nowojski@...> Date: 2018-02-23T11:11:14Z [hotfix][runtime] Remove duplicated check commit 55832a9ac1355e48fc9665967074ebc262da2f65 Author: Piotr Nowojski <piotr.nowojski@...> Date: 2018-02-23T10:27:54Z [hotfixu][tests] Do not hide original exception in SuccessAfterNetworkBuffersFailureITCase commit b6d98e99dd5cbf7cc0554cd83b81f3d2621e0057 Author: Piotr Nowojski <piotr.nowojski@...> Date: 2018-02-23T10:28:20Z [FLINK-8694][runtime] Fix notifyDataAvailable race condition Before there was a race condition that might resulted in igonoring some notifyDataAvailable calls. This fixes the problem by moving buffersAvailable handling to Supartitions and adds stress test for flushAlways (without this fix this test is dead locking). commit d65ffe94414e8bc4a6457955c8ca89bf68f537cb Author: Nico Kruber <nico@...> Date: 2018-02-26T14:07:49Z fixup! [FLINK-8694][runtime] Fix notifyDataAvailable race condition commit 26508abe76d7436c13b8415cd0411560cb31f4d4 Author: Nico Kruber <nico@...> Date: 2018-02-26T15:39:01Z fixup! [FLINK-8694][runtime] Fix notifyDataAvailable race condition commit 00e9b252d218932dda3daf28f95cc8fd2e34cac8 Author: Nico Kruber <nico@...> Date: 2018-02-22T13:11:13Z [hotfix][network] rename RecordWriter#closeBufferConsumer() to closeBufferBuilder() commit 5ecd6f7215e164e6971cd305f16974639fccc7a3 Author: Nico Kruber <nico@...> Date: 2018-02-22T13:17:06Z [hotfix][network] various minor improvements commit e3cff13a7785fe61236a735bb37a0fd4345fe13b Author: Nico Kruber <nico@...> Date: 2018-02-23T09:35:41Z [hotfix][network][tests] make AwaitableBufferAvailablityListener thread-safe This is called asynchronously by the spill writer and thus may need synchronization on incrementing the counter but definately had visibility issues with the counter. Using an AtomicLong fixes that. commit 8a8470a6225bbdf8e0c962ba0d9bbbd75354d309 Author: Nico Kruber <nico@...> Date: 2018-02-23T09:19:58Z [FLINK-8755][network] fix SpilledSubpartitionView relying on the backlog for determining whether more data is available Fix SpilledSubpartitionView#getNextBuffer() to not only rely on the backlog: instead it is sufficient to also return true if the next buffer is an event since either there is a real buffer enqueued (reflected by the backlog) or at least one event. commit 96dd3ffe5cbb6581c2bc80c86853c7249973c15e Author: Nico Kruber <nico@...> Date: 2018-02-23T11:13:20Z [FLINK-8755][FLINK-8786][network] add and improve subpartition tests + also improve the subpartition tests in general to reduce some duplication commit 603b1562a6ad257376057a1a3fc507604d83ffcd Author: Nico Kruber <nico@...> Date: 2018-02-26T15:27:44Z [FLINK-8786][network] fix SpillableSubpartitionView#getNextBuffer returning wrong isMoreAvailable when processing last in-memory buffer When processing the last in-memory buffer in SpillableSubpartitionView#getNextBuffer while the rest of the buffers are spilled, need to rely on the spilled view's isAvailable instead of always setting the isMoreAvailable flag of the returned BufferAndBacklog to false. ---- > SpilledSubpartitionView wrongly relys on the backlog for determining whether > more data is available > --------------------------------------------------------------------------------------------------- > > Key: FLINK-8755 > URL: https://issues.apache.org/jira/browse/FLINK-8755 > Project: Flink > Issue Type: Sub-task > Components: Network > Reporter: Nico Kruber > Assignee: Nico Kruber > Priority: Blocker > Fix For: 1.5.0 > > > {code} > public BufferAndBacklog getNextBuffer() throws IOException, > InterruptedException { > //... > int newBacklog = parent.decreaseBuffersInBacklog(current); > return new BufferAndBacklog(current, newBacklog > 0, newBacklog, > nextBufferIsEvent); > {code} > relies on the backlog to signal further data availability. However, if there > are only events left in the buffer queue, their buffers are not included in > the backlog counting and therefore, {{isMoreAvailable}} will be wrongly > {{false}} here. -- This message was sent by Atlassian JIRA (v7.6.3#76005)