Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Kurt Young Thu, 16 Apr 2020 07:58:16 -0700

I'm not so sure about the web tool solution though. The concern I have for
this approach is the final generated
distribution is kind of non-deterministic. We might generate too many
different combinations when user trying to
package different types of connector, format, and even maybe hadoop
releases.  As far as I can tell, most open
source projects and apache projects will only release some
pre-defined distributions, which most users are already
familiar with, thus hard to change IMO. And I also have went through in
some cases, users will try to re-distribute
the release package, because of the unstable network of apache website from
China. In web tool solution, I don't
think this kind of re-distribution would be possible anymore.


In the meantime, I also have a concern that we will fall back into our trap
again if we try to offer this smart & flexible
solution. Because it needs users to cooperate with such mechanism. It's
exactly the situation what we currently fell
into:
1. We offered a smart solution.
2. We hope users will follow the correct instructions.
3. Everything will work as expected if users followed the right
instructions.

In reality, I suspect not all users will do the second step correctly. And
for new users who only trying to have a quick
experience with Flink, I would bet most users will do it wrong.

So, my proposal would be one of the following 2 options:
1. Provide a slim distribution for advanced product users and provide a
distribution which will have some popular builtin jars.
2. Only provide a distribution which will have some popular builtin jars.

If we are trying to reduce the distributions we released, I would prefer 2
> 1.

Best,
Kurt


On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <trohrm...@apache.org> wrote:

> I think what Chesnay and Dawid proposed would be the ideal solution.
> Ideally, we would also have a nice web tool for the website which generates
> the corresponding distribution for download.
>
> To get things started we could start with only supporting to
> download/creating the "fat" version with the script. The fat version would
> then consist of the slim distribution and whatever we deem important for
> new users to get started.
>
> Cheers,
> Till
>
> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <dwysakow...@apache.org>
> wrote:
>
> > Hi all,
> >
> > Few points from my side:
> >
> > 1. I like the idea of simplifying the experience for first time users.
> > As for production use cases I share Jark's opinion that in this case I
> > would expect users to combine their distribution manually. I think in
> > such scenarios it is important to understand interconnections.
> > Personally I'd expect the slimmest possible distribution that I can
> > extend further with what I need in my production scenario.
> >
> > 2. I think there is also the problem that the matrix of possible
> > combinations that can be useful is already big. Do we want to have a
> > distribution for:
> >
> >     SQL users: which connectors should we include? should we include
> > hive? which other catalog?
> >
> >     DataStream users: which connectors should we include?
> >
> >    For both of the above should we include yarn/kubernetes?
> >
> > I would opt for providing only the "slim" distribution as a release
> > artifact.
> >
> > 3. However, as I said I think its worth investigating how we can improve
> > users experience. What do you think of providing a tool, could be e.g. a
> > shell script that constructs a distribution based on users choice. I
> > think that was also what Chesnay mentioned as "tooling to
> > assemble custom distributions" In the end how I see the difference
> > between a slim and fat distribution is which jars do we put into the
> > lib, right? It could have a few "screens".
> >
> > 1. Which API are you interested in:
> > a. SQL API
> > b. DataStream API
> >
> >
> > 2. [SQL] Which connectors do you want to use? [multichoice]:
> > a. Kafka
> > b. Elasticsearch
> > ...
> >
> > 3. [SQL] Which catalog you want to use?
> >
> > ...
> >
> > Such a tool would download all the dependencies from maven and put them
> > into the correct folder. In the future we can extend it with additional
> > rules e.g. kafka-0.9 cannot be chosen at the same time with
> > kafka-universal etc.
> >
> > The benefit of it would be that the distribution that we release could
> > remain "slim" or we could even make it slimmer. I might be missing
> > something here though.
> >
> > Best,
> >
> > Dawdi
> >
> > On 16/04/2020 11:02, Aljoscha Krettek wrote:
> > > I want to reinforce my opinion from earlier: This is about improving
> > > the situation both for first-time users and for experienced users that
> > > want to use a Flink dist in production. The current Flink dist is too
> > > "thin" for first-time SQL users and it is too "fat" for production
> > > users, that is where serving no-one properly with the current
> > > middle-ground. That's why I think introducing those specialized
> > > "spins" of Flink dist would be good.
> > >
> > > By the way, at some point in the future production users might not
> > > even need to get a Flink dist anymore. They should be able to have
> > > Flink as a dependency of their project (including the runtime) and
> > > then build an image from this for Kubernetes or a fat jar for YARN.
> > >
> > > Aljoscha
> > >
> > > On 15.04.20 18:14, wenlong.lwl wrote:
> > >> Hi all,
> > >>
> > >> Regarding slim and fat distributions, I think different kinds of jobs
> > >> may
> > >> prefer different type of distribution:
> > >>
> > >> For DataStream job, I think we may not like fat distribution
> containing
> > >> connectors because user would always need to depend on the connector
> in
> > >> user code, it is easy to include the connector jar in the user lib.
> Less
> > >> jar in lib means less class conflicts and problems.
> > >>
> > >> For SQL job, I think we are trying to encourage user to user pure
> > >> sql(DDL +
> > >> DML) to construct their job, In order to improve user experience, It
> > >> may be
> > >> important for flink, not only providing as many connector jar in
> > >> distribution as possible especially the connector and format we have
> > >> well
> > >> documented,  but also providing an mechanism to load connectors
> > >> according
> > >> to the DDLs,
> > >>
> > >> So I think it could be good to place connector/format jars in some
> > >> dir like
> > >> opt/connector which would not affect jobs by default, and introduce a
> > >> mechanism of dynamic discovery for SQL.
> > >>
> > >> Best,
> > >> Wenlong
> > >>
> > >> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <jingsongl...@gmail.com>
> > >> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I am thinking both "improve first experience" and "improve production
> > >>> experience".
> > >>>
> > >>> I'm thinking about what's the common mode of Flink?
> > >>> Streaming job use Kafka? Batch job use Hive?
> > >>>
> > >>> Hive 1.2.1 dependencies can be compatible with most of Hive server
> > >>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> > >>> Flink is currently mainly used for streaming, so let's not talk
> > >>> about hive.
> > >>>
> > >>> For streaming jobs, first of all, the jobs in my mind is (related to
> > >>> connectors):
> > >>> - ETL jobs: Kafka -> Kafka
> > >>> - Join jobs: Kafka -> DimJDBC -> Kafka
> > >>> - Aggregation jobs: Kafka -> JDBCSink
> > >>> So Kafka and JDBC are probably the most commonly used. Of course,
> also
> > >>> includes CSV, JSON's formats.
> > >>> So when we provide such a fat distribution:
> > >>> - With CSV, JSON.
> > >>> - With flink-kafka-universal and kafka dependencies.
> > >>> - With flink-jdbc.
> > >>> Using this fat distribution, most users can run their jobs well.
> (jdbc
> > >>> driver jar required, but this is very natural to do)
> > >>> Can these dependencies lead to kinds of conflicts? Only Kafka may
> have
> > >>> conflicts, but if our goal is to use kafka-universal to support all
> > >>> Kafka
> > >>> versions, it is hopeful to target the vast majority of users.
> > >>>
> > >>> We don't want to plug all jars into the fat distribution. Only need
> > >>> less
> > >>> conflict and common. of course, it is a matter of consideration to
> put
> > >>> which jar into fat distribution.
> > >>> We have the opportunity to facilitate the majority of users, but
> > >>> also left
> > >>> opportunities for customization.
> > >>>
> > >>> Best,
> > >>> Jingsong Lee
> > >>>
> > >>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <imj...@gmail.com> wrote:
> > >>>
> > >>>> Hi,
> > >>>>
> > >>>> I think we should first reach an consensus on "what problem do we
> > >>>> want to
> > >>>> solve?"
> > >>>> (1) improve first experience? or (2) improve production experience?
> > >>>>
> > >>>> As far as I can see, with the above discussion, I think what we
> > >>>> want to
> > >>>> solve is the "first experience".
> > >>>> And I think the slim jar is still the best distribution for
> > >>>> production,
> > >>>> because it's easier to assembling jars
> > >>>> than excluding jars and can avoid potential class conflicts.
> > >>>>
> > >>>> If we want to improve "first experience", I think it make sense to
> > >>>> have a
> > >>>> fat distribution to give users a more smooth first experience.
> > >>>> But I would like to call it "playground distribution" or something
> > >>>> like
> > >>>> that to explicitly differ from the "slim production-purpose
> > >>> distribution".
> > >>>> The "playground distribution" can contains some widely used jars,
> like
> > >>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
> > >>>> json,
> > >>>> csv, etc..
> > >>>> Even we can provide a playground docker which may contain the fat
> > >>>> distribution, python3, and hive.
> > >>>>
> > >>>> Best,
> > >>>> Jark
> > >>>>
> > >>>>
> > >>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ches...@apache.org>
> > >>> wrote:
> > >>>>
> > >>>>> I don't see a lot of value in having multiple distributions.
> > >>>>>
> > >>>>> The simple reality is that no fat distribution we could provide
> would
> > >>>>> satisfy all use-cases, so why even try.
> > >>>>> If users commonly run into issues for certain jars, then maybe
> those
> > >>>>> should be added to the current distribution.
> > >>>>>
> > >>>>> Personally though I still believe we should only distribute a slim
> > >>>>> version. I'd rather have users always add required jars to the
> > >>>>> distribution than only when they go outside our "expected"
> use-cases.
> > >>>>> Then we might finally address this issue properly, i.e., tooling to
> > >>>>> assemble custom distributions and/or better error messages if
> > >>>>> Flink-provided extensions cannot be found.
> > >>>>>
> > >>>>> On 15/04/2020 15:23, Kurt Young wrote:
> > >>>>>> Regarding to the specific solution, I'm not sure about the "fat"
> and
> > >>>>> "slim"
> > >>>>>> solution though. I get the idea
> > >>>>>> that we can make the slim one even more lightweight than current
> > >>>>>> distribution, but what about the "fat"
> > >>>>>> one? Do you mean that we would package all connectors and formats
> > >>> into
> > >>>>>> this? I'm not sure if this is
> > >>>>>> feasible. For example, we can't put all versions of kafka and hive
> > >>>>>> connector jars into lib directory, and
> > >>>>>> we also might need hadoop jars when using filesystem connector to
> > >>>> access
> > >>>>>> data from HDFS.
> > >>>>>>
> > >>>>>> So my guess would be we might hand-pick some of the most
> frequently
> > >>>> used
> > >>>>>> connectors and formats
> > >>>>>> into our "lib" directory, like kafka, csv, json metioned above,
> and
> > >>>> still
> > >>>>>> leave some other connectors out of it.
> > >>>>>> If this is the case, then why not we just provide this
> distribution
> > >>> to
> > >>>>>> user? I'm not sure i get the benefit of
> > >>>>>> providing another super "slim" jar (we have to pay some costs to
> > >>>> provide
> > >>>>>> another suit of distribution).
> > >>>>>>
> > >>>>>> What do you think?
> > >>>>>>
> > >>>>>> Best,
> > >>>>>> Kurt
> > >>>>>>
> > >>>>>>
> > >>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> jingsongl...@gmail.com
> > >
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>> Big +1.
> > >>>>>>>
> > >>>>>>> I like "fat" and "slim".
> > >>>>>>>
> > >>>>>>> For csv and json, like Jark said, they are quite small and don't
> > >>> have
> > >>>>> other
> > >>>>>>> dependencies. They are important to kafka connector, and
> important
> > >>>>>>> to upcoming file system connector too.
> > >>>>>>> So can we move them to both "fat" and "slim"? They're so
> important,
> > >>>> and
> > >>>>>>> they're so lightweight.
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Jingsong Lee
> > >>>>>>>
> > >>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <godfre...@gmail.com>
> > >>>> wrote:
> > >>>>>>>
> > >>>>>>>> Big +1.
> > >>>>>>>> This will improve user experience (special for Flink new users).
> > >>>>>>>> We answered so many questions about "class not found".
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>> Godfrey
> > >>>>>>>>
> > >>>>>>>> Dian Fu <dian0511...@gmail.com> 于2020年4月15日周三 下午4:30写道：
> > >>>>>>>>
> > >>>>>>>>> +1 to this proposal.
> > >>>>>>>>>
> > >>>>>>>>> Missing connector jars is also a big problem for PyFlink users.
> > >>>>>>>> Currently,
> > >>>>>>>>> after a Python user has installed PyFlink using `pip`, he has
> to
> > >>>>>>> manually
> > >>>>>>>>> copy the connector fat jars to the PyFlink installation
> directory
> > >>>> for
> > >>>>>>> the
> > >>>>>>>>> connectors to be used if he wants to run jobs locally. This
> > >>> process
> > >>>> is
> > >>>>>>>> very
> > >>>>>>>>> confuse for users and affects the experience a lot.
> > >>>>>>>>>
> > >>>>>>>>> Regards,
> > >>>>>>>>> Dian
> > >>>>>>>>>
> > >>>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <imj...@gmail.com> 写道：
> > >>>>>>>>>>
> > >>>>>>>>>> +1 to the proposal. I also found the "download additional jar"
> > >>> step
> > >>>>>>> is
> > >>>>>>>>>> really verbose when I prepare webinars.
> > >>>>>>>>>>
> > >>>>>>>>>> At least, I think the flink-csv and flink-json should in the
> > >>>>>>>>> distribution,
> > >>>>>>>>>> they are quite small and don't have other dependencies.
> > >>>>>>>>>>
> > >>>>>>>>>> Best,
> > >>>>>>>>>> Jark
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zjf...@gmail.com>
> > >>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hi Aljoscha,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to
> put
> > >>>>>>> these
> > >>>>>>>>>>> connectors ? opt or lib ?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Aljoscha Krettek <aljos...@apache.org> 于2020年4月15日周三
> > >>>>>>>>>>> 下午3:30写道：
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hi Everyone,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I'd like to discuss about releasing a more full-featured
> Flink
> > >>>>>>>>>>>> distribution. The motivation is that there is friction for
> > >>>>>>> SQL/Table
> > >>>>>>>>> API
> > >>>>>>>>>>>> users that want to use Table connectors which are not there
> in
> > >>>> the
> > >>>>>>>>>>>> current Flink Distribution. For these users the workflow is
> > >>>>>>> currently
> > >>>>>>>>>>>> roughly:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>    - download Flink dist
> > >>>>>>>>>>>>    - configure csv/Kafka/json connectors per configuration
> > >>>>>>>>>>>>    - run SQL client or program
> > >>>>>>>>>>>>    - decrypt error message and research the solution
> > >>>>>>>>>>>>    - download additional connector jars
> > >>>>>>>>>>>>    - program works correctly
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I realize that this can be made to work but if every SQL
> user
> > >>> has
> > >>>>>>>> this
> > >>>>>>>>>>>> as their first experience that doesn't seem good to me.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> My proposal is to provide two versions of the Flink
> > >>> Distribution
> > >>>> in
> > >>>>>>>> the
> > >>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>    - slim would be even trimmer than todays distribution
> > >>>>>>>>>>>>    - fat would contain a lot of convenience connectors (yet
> to
> > >>> be
> > >>>>>>>>>>>> determined which one)
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> And yes, I realize that there are already more dimensions of
> > >>>> Flink
> > >>>>>>>>>>>> releases (Scala version and Java version).
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> For background, our current Flink dist has these in the opt
> > >>>>>>>> directory:
> > >>>>>>>>>>>>    - flink-azure-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>>    - flink-cep-scala_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-cep_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-gelly-scala_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-gelly_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-datadog-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-graphite-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-influxdb-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-prometheus-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-slf4j-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-statsd-1.10.0.jar
> > >>>>>>>>>>>>    - flink-oss-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>>    - flink-python_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-queryable-state-runtime_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-s3-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>>    - flink-s3-fs-presto-1.10.0.jar
> > >>>>>>>>>>>>    -
> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > >>>>>>>>>>>>    - flink-sql-client_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-state-processor-api_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-swift-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from
> opt
> > >>> we
> > >>>>>>>> would
> > >>>>>>>>>>>> go down to 126M. I would reccomend this, because the large
> > >>>> majority
> > >>>>>>>> of
> > >>>>>>>>>>>> the files in opt are probably unused.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> What do you think?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Aljoscha
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>>>>> Best Regards
> > >>>>>>>>>>>
> > >>>>>>>>>>> Jeff Zhang
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> Best, Jingsong Lee
> > >>>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Best, Jingsong Lee
> > >>>
> > >>
> > >
> >
> >
>

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Reply via email to