Re: [DISCUSS] Flink and Externalized connectors leads to block and circular dependency problems

Mason Chen Thu, 06 Jul 2023 23:36:46 -0700

Hi all,

I also agree with what's been said above.


+1, I think the Table API delegation is a good suggestion--it essentially
allows a connector to get Python support for free. We've seen that
Table/SQL and Python APIs complement each other well and are ideal for data
scientists. With respect to unaligned functionalities, I think that also
holds true for other APIs, e.g. Datastream and Table/SQL since there is
functionality that is not natural to represent as a configuration/SQL.

Best,
Mason

On Wed, Jul 5, 2023 at 10:14 PM Dian Fu <dian0511...@gmail.com> wrote:

> Hi Chesnay,
>
> >> The wrapping of connectors is a bit of a maintenance nightmare and
> doesn't really work with external/custom connectors.
>
> Cannot agree with you more.
>
> >> Has there ever been thoughts about changing flink-pythons connector
> setup to use the table api connectors underneath?
>
> I'm still not sure if this is feasible for all connectors, however,
> this may be a good idea. The concern is that the DataStream API
> connectors functionalities may be unaligned between Java and Python
> connectors. Besides, there are still a few connectors which only have
> DataStream API connectors, e.g. Google PubSub, RabbitMQ, Cassandra,
> Pulsar, Hybrid Source, etc. Besides, it currently already supports
> Table API connectors in PyFlink and if we take this way, maybe we
> could just tell users to use Table API connector directly.
>
> Another option in my head before is to provide an API which allows
> configuring the behavior via key/value pairs in both the Java & Python
> DataStream API connectors.
>
> Regards,
> Dian
>
> On Wed, Jul 5, 2023 at 6:34 PM Chesnay Schepler <ches...@apache.org>
> wrote:
> >
> > Has there ever been thoughts about changing flink-pythons connector
> > setup to use the table api connectors underneath?
> >
> > The wrapping of connectors is a bit of a maintenance nightmare and
> > doesn't really work with external/custom connectors.
> >
> > On 04/07/2023 13:35, Dian Fu wrote:
> > > Thanks Ran Tao for proposing this discussion and Martijn for sharing
> > > the thought.
> > >
> > >>   While flink-python now fails the CI, it shouldn't actually depend
> on the
> > > externalized connectors. I'm not sure what PyFlink does with it, but if
> > > belongs to the connector code,
> > >
> > > For each DataStream connector, there is a corresponding Python wrapper
> > > and also some test cases in PyFlink. In theory, we should move that
> > > wrapper into each connector repository. In the past, we have not done
> > > that when externalizing the connectors since it may introduce some
> > > burden when releasing since it means that we have to publish each
> > > connector to PyPI separately.
> > >
> > > To resolve this problem, I guess we can move the connector support in
> > > PyFlink into the external connector repository.
> > >
> > > Regards,
> > > Dian
> > >
> > >
> > > On Mon, Jul 3, 2023 at 11:08 PM Ran Tao <chucheng...@gmail.com> wrote:
> > >> @Martijn
> > >> thanks for clear explanations.
> > >>
> > >> If we follow the line you specified (Connectors shouldn't rely on
> > >> dependencies that may or may not be
> > >> available in Flink itself)
> > >> It seems that we should add a certain dependency if we need(such as
> > >> commons-io, commons-collection) in connector pom explicitly.
> > >> And bundle it in sql-connector uber jar.
> > >>
> > >> Then there is only one thing left that we need to make flink-python
> test
> > >> not depend on the released flink-connector.
> > >> Maybe we should check it out and decouple it like you suggested.
> > >>
> > >> Best Regards,
> > >> Ran Tao
> > >> https://github.com/chucheng92
> > >>
> > >>
> > >> Martijn Visser <martijnvis...@apache.org> 于2023年7月3日周一 22:06写道：
> > >>
> > >>> Hi Ran Tao,
> > >>>
> > >>> Thanks for opening this topic. I think there's a couple of things at
> hand:
> > >>> 1. Connectors shouldn't rely on dependencies that may or may not be
> > >>> available in Flink itself, like we've seen with flink-shaded. That
> avoids a
> > >>> tight coupling between Flink and connectors, which is exactly what
> we try
> > >>> to avoid.
> > >>> 2. When following that line, that would also be applicable for
> things like
> > >>> commons-collections and commons-io. If a connector wants to use
> them, it
> > >>> should make sure that it bundles those artifacts itself.
> > >>> 3. While flink-python now fails the CI, it shouldn't actually depend
> on the
> > >>> externalized connectors. I'm not sure what PyFlink does with it, but
> if
> > >>> belongs to the connector code, that code should also be moved to the
> > >>> individual connector repo. If it's just a generic test, we could
> consider
> > >>> creating a generic test against released connector versions to
> determine
> > >>> compatibility.
> > >>>
> > >>> I'm curious about the opinions of others as well.
> > >>>
> > >>> Best regards,
> > >>>
> > >>> Martijn
> > >>>
> > >>> On Mon, Jul 3, 2023 at 3:37 PM Ran Tao <chucheng...@gmail.com>
> wrote:
> > >>>
> > >>>> I have an issue here that needs to upgrade commons-collections[1]
> (this
> > >>> is
> > >>>> an example), but PR ci fails because flink-python test cases depend
> on
> > >>>> flink-sql-connector-kafka, but kafka-sql-connector is a small jar,
> does
> > >>> not
> > >>>> include this dependency, so flink ci cause exception[2]. Current my
> > >>>> solution is [3]. But even if this PR is done, the upgrade of flink
> still
> > >>>> requires kafka-connector released.
> > >>>>
> > >>>> This issue leads to deeper problems. Although the connectors have
> been
> > >>>> externalized, many UTs of flink-python depend on these connectors,
> and a
> > >>>> basic agreement of externalized connectors is that other
> dependencies
> > >>>> cannot be introduced explicitly, which means the externalized
> connectors
> > >>>> use dependencies inherited from flink. In this way, when flink main
> > >>>> upgrades some dependencies, it is easy to fail when executing
> > >>> flink-python
> > >>>> test cases，because flink no longer has this class, and the
> connector does
> > >>>> not contain it. It's circular problem.
> > >>>>
> > >>>> Unless, the connector self-consistently includes all dependencies,
> which
> > >>> is
> > >>>> uncontrollable.
> > >>>> (only a few connectors include all jars in shade phase)
> > >>>>
> > >>>> In short, the current flink-python module's dependencies on the
> connector
> > >>>> leads to an incomplete process of externalization and decoupling,
> which
> > >>>> will lead to circular dependencies when flink upgrade or change some
> > >>>> dependencies.
> > >>>>
> > >>>> I don't know if I made it clear. I hope to get everyone's opinions
> on
> > >>> what
> > >>>> better solutions we should adopt for similar problems in the future.
> > >>>>
> > >>>> [1] https://issues.apache.org/jira/browse/FLINK-30274
> > >>>> [2]
> > >>>>
> > >>>>
> > >>>
> https://user-images.githubusercontent.com/11287509/250120404-d12b60f4-7ff3-457e-a2c4-8cd415bb5ca2.png
> > >>>>
> > >>>>
> > >>>
> https://user-images.githubusercontent.com/11287509/250120522-6b096a4f-83f0-4287-b7ad-d46b9371de4c.png
> > >>>> [3] https://github.com/apache/flink-connector-kafka/pull/38
> > >>>>
> > >>>> Best Regards,
> > >>>> Ran Tao
> > >>>> https://github.com/chucheng92
> > >>>>
> >
>

Re: [DISCUSS] Flink and Externalized connectors leads to block and circular dependency problems

Reply via email to