Re: Primitive Read not working with Flink portable runner

Jan Lukavský Sun, 22 Aug 2021 13:47:07 -0700

Hi,

looks like I iterated to a solution [1]. The change should be theminimal, there seem to be no (relevant) changes needed in core. Almosteverything is located in the code of the FlinkRunner. There is stillsomething weird, which probably signals a bug somewhere. Without thisstatement [2] the test fails with the already mentioned exception of"PCollection being consumed but never produced".

Could anyone help with both reviewing and possibly suggesting what couldbe causing the exception?


 Jan

[1] https://github.com/apache/beam/pull/15370

[2]https://github.com/apache/beam/pull/15370/files#diff-b1ec58edff6c096481ff336f6fc96e7ba5bcb740dff56c72606ff4f8f0bf85f3R111


On 8/20/21 5:27 PM, Jan Lukavský wrote:

Hi,
I've tried to build a better understanding of what is really happeningand how, could someone validate my lines of thinking?
a) under normal circumstances ExecutableStage has two pieces - arunner side and SDK side, passing data between these two is done overgRPC channel, the runner side is not supposed to understand the dataand therefore runners-core-construction-java replaces coders forpassing data between SDK harness and the runner withLengthPrefixCoder(ByteArrayCoder) - that means that the data is passedas opaque bytes
b) the proto representation of the Pipeline contains the actualcoders, without the specifying how should the data be passed betweenSDK harness and the runner (which seems correct, only the runner knowsthe environment, and it is therefore the runner's duty to build theactual wire harness coders)
c) because of that, there are utility classes that injectLengthPrefixCoder where appropriate - most of the code is inWireCoders, but unfortunately ProcessBundleDescriptors does some workin this regard as well
d) the problem arises when a runner decides to inline a PTransformthat was supposed to be ExecutableStage and run it within the contextof the runner - that is the case of Flink's primitive Read. In thatcase the Coders of how the runner encodes the PCollection on outputfrom Read and how is then consumed in a (non-inlined) ExecutableStagedo not match.
I tried to modify the ModelCoders, or to patchLengthPrefixUnknownCoders.addLengthPrefixedCoder, so that it can workwith the case when both ends (SDK and runner) are Java, but I alwayshit an issue somewhere. I think that it is because the decision ofwhich "wire coder" to use is in this case no longer a function of pair(coderId, SDK-side or runner-side), but of a tripple (coderId,producer side, consumer side). That is if the collection should beboth produced and consumed in the runner environment, the coder shouldbe different than if it is produced in runner and consumed inSDK-harness.
Another option seems that when a PCollection is produced directly in arunner, it should wrap it using LengthPrefixCoder (unless the coderused is already a ModelCoder), which is what I'll try next. I'll begrateful if someone verified that I understand the problem correctlyand that the solution with LengthPrefixCoder on output of Read shouldwork. The solution is somewhat suboptimal regarding performance,because it wraps the coder with LengthPrefixCoder in the case whereall coders on the way are known and therefore the length prefix shouldnot be needed. But I think that we could live with this right now, atleast until some finer control of the in-out coders of ExecutableStageis introduced.
Thanks for any thoughts on this!

 Jan

On 8/1/21 8:33 PM, Jan Lukavský wrote:
Hi,
I have figured out another way of fixing the problem withoutmodifying ModelCoders. It consists of creating aJavaSDKCoderTranslatorRegistrar [1] and fixingLengthPrefixUnknownCoders [2]. Would this be a better approach?
 Jan
[1]https://github.com/apache/beam/pull/15181/files#diff-e4df94a4e799e14a76ada42506aacb8cb7567c84349acacd6126c64ed03de062R27
[2]https://github.com/apache/beam/pull/15181/files#diff-64103a1eabf2872230e5df56cf02d535c4146f5a3f67c51c261433e4caa9a972R63
On 7/29/21 7:54 PM, Jan Lukavský wrote:
On 7/29/21 6:45 PM, Robert Bradshaw wrote:
On Thu, Jul 29, 2021 at 3:04 AM Jan Lukavský <je...@seznam.cz> wrote:
Hi,
I'd like to move the discussion of this topic further. Because itseems that fixing the portable SDF is a larger work, I think thereare two options:
+1
a) extend the definition of model coders to include SDK codersof the language that implement the model (that would mean that thedefinition of model coder is not "language agnostic coders", but"coders that a given SDK can instantiate"), or
b) make the model coders extensible so that a runner can modifyit - that would make it possible for each runner to have aslightly different definition of these model coders
I'm strongly in favor of a), but I can live with b) as well.
We should probably just rename "ModelCoders" to
"JavaCoders[Registrar]" and stick everything there. ModelCoders is not
understood or used by anything but Java. (That or we just discard the
whole ModelCoders thing and just let Coders define their own portable
representations, possibly with a registration system.)
Coders must be Serializable, so it seems to me, that all Java Codersare quite easily serialized and a registration is not exactly neededfor that. Renaming ModelCoders to Java(Portable)Coders looks good tome.
Thanks in advance for any comments on this.

  Jan

On 7/25/21 8:59 PM, Jan Lukavský wrote:
I didn't want to say that Flink should not support SDF. I only donot see any benefits of it for a native streaming source - likeKafka - without the ability to use dynamic splitting. Thepotential benefits of composability and extensibility do not applyhere. Yes, it would be good to have as low number of sourcetransforms as possible. And another yes, there probably isn'tanything that would fundamentally disable Flink to correctlysupport SDF. On the other hand, the current state is such wecannot use KafkaIO in Flink. I think we should fix this by theshortest possible path, because the technically correct solutionis currently unknown (at least to me, if anyone can give pointersabout how to fix the SDF, I'd be grateful).
I still think that enabling a runner to support Read natively,when appropriate, has value by itself. And it requires SDK Codersto be 'known' to the runner, at least that was the result of mytests.
On 7/25/21 8:31 PM, Chamikara Jayalath wrote:
On Sun, Jul 25, 2021 at 11:09 AM Jan Lukavský <je...@seznam.cz>wrote:
In general, language-neutral APIs and protocols are a key featureof portable Beam.
Yes, sure, that is well understood. But - language neutral APIsrequires language neutral environment. That is why the portablePipeline representation is built around protocol buffers andgRPC. That is truly language-neutral. Once we implement somethingaround that - like in the case of ModelCoders.java - we use aspecific language for that and the language-neutral part isalready gone. The decision to include same-language-SDK codersinto such language-specific object plays no role in the fact italready is language-specific.
Not all runners are implemented using Java. For example, theportable DirectRunner (FnAPI runner) is implemented using Pythonand Dataflow is implemented using C++. Such runners will not beable to do this.
Yes, I'm aware of that and that is why I said "any Java nativerunner". It is true, that non-Java runners *must* (as long as wedon't include Read into SDK harness) resort to expanding it toSDF. That is why use_deprecated_read is invalid setting for suchrunner and should be handled accordingly.
Similarly, I think there were previous discussions related tousing SDF as the source framework for portable runners.
Don't get me wrong, I'm not trying to revoke this decision. Onthe other hand I still think that the decision to use SDFimplementation of Read or not should be left to the runner.
I understand that there are some bugs related to SDF and portableFlink currently. How much work do you think is needed here ? Willit be better to focus our efforts on fixing remaining issues forSDF and portable runners instead of supporting"use_deprecated_read" for that path ?
I'm not sure. I don't know portability and the SDK harness wellenough to be able to answer this. But we should really know whywe do that. What exactly does SDF bring to the Flink runner (andlet's leave Flink aside of this - what does it bring to runnersthat cannot make use of dynamic splitting, being it admittedly avery cool feature)? Yes, supporting Java Read makes it impossibleto implement it in Python. But practically, I think that most ofthe Pipelines will use x-lang for that. It makes very much senseto offload IOs to a more performant environment.
A bit old, but please see the following for the benefits of SDFand the motivation for it.
https://beam.apache.org/blog/splittable-do-fn/
https://s.apache.org/splittable-do-fn

Thanks,
Cham
  Jan

On 7/25/21 6:54 PM, Chamikara Jayalath wrote:
On Sun, Jul 25, 2021 at 6:33 AM Jan Lukavský <je...@seznam.cz>wrote:
I'll start from the end.
I don't think we should be breaking language agnostic API layers(for example, definition of model coders) just to support"use_deprecated_read".
"Breaking" and "fixing" can only be a matter of the definitionof the object at hand. I don't think, that Coder can be totallylanguage agnostic - yes, the mapping between serialized form anddeserialized form can be _defined_ in a language agnostic way,but must be_implemented_ in a specific language. If we choosethe implementing language, what makes us treat SDK-specificcoders defined by the SDK of the same language as "unknown"? Itis only our decision, that seems to have no practical benefits.
In general, language-neutral APIs and protocols are a key featureof portable Beam. See here:https://beam.apache.org/roadmap/portability/(I did not look into all the old discussions and votes related tothis but I'm sure they are there)
Moreover, including SDK-specific coders into supported coders ofthe SDK runner construction counterpart (that is, runnercore-construction-java for Java SDK) is a necessary prerequisitefor unifying "classical" and "portable" runners, because therunner needs to understand *all* SDK coders so that it can_inline_ the complete Pipeline (if the Pipeline SDK has the samelanguage as the runner), instead of running it through SDKharness. This need therefore is not specific to supportinguse_deprecated_read, but is a generic requirement, which onlyhas the first manifestation in the support of a transform notsupported by SDK harness.
I think "use_deprecated_read" should be considered a stop-gapmeasure for Flink (and Spark ?) till we have proper support forSDF. In fact I don't think an arbitrary portable runner cansupport "use_deprecated_read" due to the following.
There seems to be nothing special about Flink regarding thesupport of primitive Read. I think any Java native runner canimplement it pretty much the same way as Flink does. Thequestion is if any other runner might want to do that. Theproblem with Flink is that
Not all runners are implemented using Java. For example, theportable DirectRunner (FnAPI runner) is implemented using Pythonand Dataflow is implemented using C++. Such runners will not beable to do this.
  1) portable SDF seems not to work [1]
2) even classical Flink runner has still issues with SDF -there are reports of watermark being stuck when reading data viaSDF, this gets resolved using use_deprecated_read
3) Flink actually does not have any benefits from SDF, becauseit cannot make use of the dynamic splitting, so this actuallybrings only implementation burden without any practical benefit
Similarly, I think there were previous discussions related tousing SDF as the source framework for portable runners.I understand that there are some bugs related to SDF and portableFlink currently. How much work do you think is needed here ? Willit be better to focus our efforts on fixing remaining issues forSDF and portable runners instead of supporting"use_deprecated_read" for that path ? Note that I'm fine withfixing any issues related to "use_deprecated_read" for classic(non-portable) Flink but I think you are trying to use x-langhence probably need portable Flink.
Thanks,
Cham
I think that we should reiterate on the decision of deprecatingRead - if we can implement it via SDF, what is the reason toforbid a runner to make use of a simpler implementation? Theexpansion of Read might be runner dependent, that is somethingwe do all the time, or am I missing something?
  Jan

[1] https://issues.apache.org/jira/browse/BEAM-10940

On 7/25/21 1:38 AM, Chamikara Jayalath wrote:
I think we might be going down a bit of a rabbit hole with thesupport for "use_deprecated_read" for portable Flink :)
I think "use_deprecated_read" should be considered a stop-gapmeasure for Flink (and Spark ?) till we have proper support forSDF. In fact I don't think an arbitrary portable runner cansupport "use_deprecated_read" due to the following.
(1) SDK Harness is not aware of BoundedSource/UnboundedSource.Only source framework SDK Harness is aware of is SDF.(2) Invoking BoundedSource/UnboundedSource is not a part of theFn API(3) A non-Java Beam portable runner will probably not be able todirectly invoke legacy Read transforms similar to the way Flinkdoes today.
I don't think we should be breaking language agnostic API layers(for example, definition of model coders) just to support"use_deprecated_read".
Thanks,
Cham
On Sat, Jul 24, 2021 at 11:50 AM Jan Lukavský <je...@seznam.cz>wrote:
On 7/24/21 12:34 AM, Robert Bradshaw wrote:
On Thu, Jul 22, 2021 at 10:20 AM Jan Lukavský<je...@seznam.cz> wrote:
Hi,
this was a ride. But I managed to get that working. I'd liketo discuss two points, though:
a) I had to push Java coders to ModelCoders for Java(which makes sense to me, but is that correct?). See [1]. Itis needed so that the Read transform (executed directly inTaskManager) can correctly communicate with Java SDK harnessusing custom coders (which is tested here [2]).
I think the intent was that ModelCoders represent the set of
language-agnostic in the model, though I have to admit I'vealwaysbeen a bit fuzzy on when a coder must or must not be in thatlist.
I think that this definition works as long, as runner does notitselfinterfere with the Pipeline. Once the runner starts (by itself,not via
SdkHarnessClient) producing data, it starts to be part of the
environment, and therefore it should understand its own Coders.I'dpropose the definition of "model coders" to be Coders that theSDK is
able to understand, which then works naturally for the ModelCoders
located in "core-construction-java", that it should understandJavs SDK
Coders.
b) I'd strongly prefer if we moved the handling ofuse_deprecated_read from outside of the Read PTransformdirectly into expand method, see [3]. Though this is notneeded for the Read on Flink to work, it seems cleaner.
WDYT?
The default value of use_deprecated_read should depend on therunner(e.g. some runners don't work well with it, others requireit). As
such should not be visible to the PTransform's expand.
I think we should know what is the expected outcome. If arunner doesnot support primitive Read (and therefore use_deprecated_read),whatshould we do, if we have such experiment set? Should thePipeline fail,or should it be silently ignored? I think that we should fail,becauseuser expects something that cannot be fulfilled. Therefore, wehave twooptions - handling the experiment explicitly in runners that donotsupport it, or handle it explicitly in all cases (bothsupported andunsupported). The latter case is when we force runners to callexplicitconversion method (convertPrimitiveRead....). Every runner thatdoes notsupport primitive Read must handle the experiment either way,becauseotherwise the experiment would be simply silently ignored,which is not
exactly user-friendly.
   Jan
[1]https://github.com/apache/beam/pull/15181/commits/394ddc3fdbaacc805d8f7ce02ad2698953f34375
[2]https://github.com/apache/beam/pull/15181/files#diff-b1ec58edff6c096481ff336f6fc96e7ba5bcb740dff56c72606ff4f8f0bf85f3R201
[3]https://github.com/apache/beam/pull/15181/commits/f1d3fd0217e5513995a72e92f68fe3d1d665c5bb
On 7/18/21 6:29 PM, Jan Lukavský wrote:

Hi,
I was debugging the issue and it relates to pipeline fusion -it seems that the primitive Read transform gets fused andthen is 'missing' as source. I'm a little lost in the code,but the most strange parts are that:
a) I tried to reject fusion of primitive Read by addingGreedyPCollectionFusers::cannotFuse forPTransformTranslation.READ_TRANSFORM_URN toGreedyPCollectionFusers.URN_FUSIBILITY_CHECKERS, but thatdidn't change the exception
b) I tried adding Reshuffle.viaRandomKey between Read andPAssert, but that didn't change it either
c) when I run portable Pipeline with use_deprecated_readon Flink it actually runs (though it fails when it actuallyreads any data, but if the input is empty, the job runs), soit does not hit the same issue, which is a mystery to me
If anyone has any pointers that I can investigate, I'd bereally grateful.
Thanks in advance,

   Jan



On 7/16/21 2:00 PM, Jan Lukavský wrote:

Hi,
I hit another issue with the portable Flink runner. Longstory short - reading from Kafka is not working in portableFlink. After solving issues with expansion serviceconfiguration (ability to add use_deprecated_read) option,because flink portable runner has issues with SDF [1], [2].After being able to inject the use_deprecated_read intoexpansion service I was able to get an execution DAG that hasthe UnboundedSource, but then more and more issues appeared(probably related to missing LengthPrefixCoder somewhere -maybe at the output from the primitive Read). I wanted tocreate a test for it and I found out, that there actually isReadSourcePortableTest in FlinkRunner, but _it testsnothing_. The problem is that Read is transformed to SDF, sothis test tests the SDF, not the Read transform. As a result,the Read transform does not work.
I tried using convertReadBasedSplittableDoFnsToPrimitiveReadsso that I could make the test fail and debug that, but I gotinto
java.lang.IllegalArgumentException: PCollectionNodes[PCollectionNode{id=PAssert$0/GroupGlobally/ParDo(ToSingletonIterables)/ParMultiDo(ToSingletonIterables).output,PCollection=unique_name:"PAssert$0/GroupGlobally/ParDo(ToSingletonIterables)/ParMultiDo(ToSingletonIterables).output"
coder_id: "IterableCoder"
is_bounded: BOUNDED
windowing_strategy_id: "WindowingStrategy(GlobalWindows)"
}] were consumed but never produced


which gave me the last knock-out. :)
My current impression is that starting from Beam 2.25.0,portable FlinkRunner is not able to read from Kafka. Couldsomeone give me a hint about what is wrong with usingconvertReadBasedSplittableDoFnsToPrimitiveReads in the test [3]?
   Jan

[1] https://issues.apache.org/jira/browse/BEAM-11991

[2] https://issues.apache.org/jira/browse/BEAM-11998

[3] https://github.com/apache/beam/pull/15181

Re: Primitive Read not working with Flink portable runner

Reply via email to