Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Benchao Li Thu, 04 Jun 2020 23:48:29 -0700

+1 to include them for sql-client by default;
+0 to put into lib and exposed to all kinds of jobs, including DataStream.


Danny Chan <[email protected]> 于2020年6月5日周五 下午2:31写道：

> +1, at least, we should keep an out of the box SQL-CLI, it’s very poor
> experience to add such required format jars for SQL users.
>
> Best,
> Danny Chan
> 在 2020年6月5日 +0800 AM11:14，Jingsong Li <[email protected]>，写道：
> > Hi all,
> >
> > Considering that 1.11 will be released soon, what about my previous
> > proposal? Put flink-csv, flink-json and flink-avro under lib.
> > These three formats are very small and no third party dependence, and
> they
> > are widely used by table users.
> >
> > Best,
> > Jingsong Lee
> >
> > On Tue, May 12, 2020 at 4:19 PM Jingsong Li <[email protected]>
> wrote:
> >
> > > Thanks for your discussion.
> > >
> > > Sorry to start discussing another thing:
> > >
> > > The biggest problem I see is the variety of problems caused by users'
> lack
> > > of format dependency.
> > > As Aljoscha said, these three formats are very small and no third party
> > > dependence, and they are widely used by table users.
> > > Actually, we don't have any other built-in table formats now... In
> total
> > > 151K...
> > >
> > > 73K flink-avro-1.10.0.jar
> > > 36K flink-csv-1.10.0.jar
> > > 42K flink-json-1.10.0.jar
> > >
> > > So, Can we just put them into "lib/" or flink-table-uber?
> > > It not solve all problems and maybe it is independent of "fat" and
> "slim".
> > > But also improve usability.
> > > What do you think? Any objections?
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <[email protected]>
> > > wrote:
> > >
> > > > One downside would be that we're shipping more stuff when running on
> > > > YARN for example, since the entire plugins directory is shiped by
> default.
> > > >
> > > > On 17/04/2020 16:38, Stephan Ewen wrote:
> > > > > @Aljoscha I think that is an interesting line of thinking. the
> swift-fs
> > > > may
> > > > > be rarely enough used to move it to an optional download.
> > > > >
> > > > > I would still drop two more thoughts:
> > > > >
> > > > > (1) Now that we have plugins support, is there a reason to have a
> > > > metrics
> > > > > reporter or file system in /opt instead of /plugins? They don't
> spoil
> > > > the
> > > > > class path any more.
> > > > >
> > > > > (2) I can imagine there still being a desire to have a "minimal"
> docker
> > > > > file, for users that want to keep the container images as small as
> > > > > possible, to speed up deployment. It is fine if that would not be
> the
> > > > > default, though.
> > > > >
> > > > >
> > > > > On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <
> [email protected]>
> > > > > wrote:
> > > > >
> > > > > > I think having such tools and/or tailor-made distributions can
> be nice
> > > > > > but I also think the discussion is missing the main point: The
> initial
> > > > > > observation/motivation is that apparently a lot of users (Kurt
> and I
> > > > > > talked about this) on the chinese DingTalk support groups, and
> other
> > > > > > support channels have problems when first using the SQL client
> because
> > > > > > of these missing connectors/formats. For these, having
> additional tools
> > > > > > would not solve anything because they would also not take that
> extra
> > > > > > step. I think that even tiny friction should be avoided because
> the
> > > > > > annoyance from it accumulates of the (hopefully) many users that
> we
> > > > want
> > > > > > to have.
> > > > > >
> > > > > > Maybe we should take a step back from discussing the
> "fat"/"slim" idea
> > > > > > and instead think about the composition of the current dist. As
> > > > > > mentioned we have these jars in opt/:
> > > > > >
> > > > > > 17M flink-azure-fs-hadoop-1.10.0.jar
> > > > > > 52K flink-cep-scala_2.11-1.10.0.jar
> > > > > > 180K flink-cep_2.11-1.10.0.jar
> > > > > > 746K flink-gelly-scala_2.11-1.10.0.jar
> > > > > > 626K flink-gelly_2.11-1.10.0.jar
> > > > > > 512K flink-metrics-datadog-1.10.0.jar
> > > > > > 159K flink-metrics-graphite-1.10.0.jar
> > > > > > 1.0M flink-metrics-influxdb-1.10.0.jar
> > > > > > 102K flink-metrics-prometheus-1.10.0.jar
> > > > > > 10K flink-metrics-slf4j-1.10.0.jar
> > > > > > 12K flink-metrics-statsd-1.10.0.jar
> > > > > > 36M flink-oss-fs-hadoop-1.10.0.jar
> > > > > > 28M flink-python_2.11-1.10.0.jar
> > > > > > 22K flink-queryable-state-runtime_2.11-1.10.0.jar
> > > > > > 18M flink-s3-fs-hadoop-1.10.0.jar
> > > > > > 31M flink-s3-fs-presto-1.10.0.jar
> > > > > > 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > > > > > 518K flink-sql-client_2.11-1.10.0.jar
> > > > > > 99K flink-state-processor-api_2.11-1.10.0.jar
> > > > > > 25M flink-swift-fs-hadoop-1.10.0.jar
> > > > > > 160M opt
> > > > > >
> > > > > > The "filesystem" connectors ar ethe heavy hitters, there.
> > > > > >
> > > > > > I downloaded most of the SQL connectors/formats and this is what
> I got:
> > > > > >
> > > > > > 73K flink-avro-1.10.0.jar
> > > > > > 36K flink-csv-1.10.0.jar
> > > > > > 55K flink-hbase_2.11-1.10.0.jar
> > > > > > 88K flink-jdbc_2.11-1.10.0.jar
> > > > > > 42K flink-json-1.10.0.jar
> > > > > > 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> > > > > > 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
> > > > > > 24M sql-connectors-formats
> > > > > >
> > > > > > We could just add these to the Flink distribution without
> blowing it up
> > > > > > by much. We could drop any of the existing "filesystem"
> connectors from
> > > > > > opt and add the SQL connectors/formats and not change the size
> of Flink
> > > > > > dist. So maybe we should do that instead?
> > > > > >
> > > > > > We would need some tooling for the sql-client shell script to
> pick-up
> > > > > > the connectors/formats up from opt/ because we don't want to add
> them
> > > > to
> > > > > > lib/. We're already doing that for finding the flink-sql-client
> jar,
> > > > > > which is also not in lib/.
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > Best,
> > > > > > Aljoscha
> > > > > >
> > > > > > On 17.04.20 05:22, Jark Wu wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I like the idea of web tool to assemble fat distribution. And
> the
> > > > > > > https://code.quarkus.io/ looks very nice.
> > > > > > > All the users need to do is just select what he/she need (I
> think this
> > > > > > step
> > > > > > > can't be omitted anyway).
> > > > > > > We can also provide a default fat distribution on the web which
> > > > default
> > > > > > > selects some popular connectors.
> > > > > > >
> > > > > > > Best,
> > > > > > > Jark
> > > > > > >
> > > > > > > On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <[email protected]
> >
> > > > wrote:
> > > > > > >
> > > > > > > > As a reference for a nice first-experience I had, take a
> look at
> > > > > > > > https://code.quarkus.io/
> > > > > > > > You reach this page after you click "Start Coding" at the
> project
> > > > > > homepage.
> > > > > > > > Rafi
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <[email protected]>
> wrote:
> > > > > > > >
> > > > > > > > > I'm not saying pre-bundle some jars will make this problem
> go away,
> > > > and
> > > > > > > > > you're right that only hides the problem for
> > > > > > > > > some users. But what if this solution can hide the problem
> for 90%
> > > > > > users?
> > > > > > > > > Would't that be good enough for us to try?
> > > > > > > > >
> > > > > > > > > Regarding to would users following instructions really be
> such a big
> > > > > > > > > problem?
> > > > > > > > > I'm afraid yes. Otherwise I won't answer such questions
> for at
> > > > least a
> > > > > > > > > dozen times and I won't see such questions coming
> > > > > > > > > up from time to time. During some periods, I even saw such
> questions
> > > > > > > > every
> > > > > > > > > day.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Kurt
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
> > > > [email protected]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > The problem with having a distribution with "popular"
> stuff is
> > > > that it
> > > > > > > > > > doesn't really *solve* a problem, it just hides it for
> users who
> > > > fall
> > > > > > > > > > into these particular use-cases.
> > > > > > > > > > Move out of it and you once again run into exact same
> problems
> > > > > > > > out-lined.
> > > > > > > > > > This is exactly why I like the tooling approach; you
> have to deal
> > > > with
> > > > > > > > it
> > > > > > > > > > from the start and transitioning to a custom use-case is
> easier.
> > > > > > > > > >
> > > > > > > > > > Would users following instructions really be such a big
> problem?
> > > > > > > > > > I would expect that users generally know *what *they
> need, just not
> > > > > > > > > > necessarily how it is assembled correctly (where do get
> which jar,
> > > > > > > > which
> > > > > > > > > > directory to put it in).
> > > > > > > > > > It seems like these are exactly the problem this would
> solve?
> > > > > > > > > > I just don't see how moving a jar corresponding to some
> feature
> > > > from
> > > > > > > > opt
> > > > > > > > > > to some directory (lib/plugins) is less error-prone than
> just
> > > > > > selecting
> > > > > > > > > the
> > > > > > > > > > feature and having the tool handle the rest.
> > > > > > > > > >
> > > > > > > > > > As for re-distributions, it depends on the form that the
> tool would
> > > > > > > > take.
> > > > > > > > > > It could be an application that runs locally and works
> against
> > > > maven
> > > > > > > > > > central (note: not necessarily *using* maven); this
> should would
> > > > work
> > > > > > > > in
> > > > > > > > > > China, no?
> > > > > > > > > >
> > > > > > > > > > A web tool would of course be fancy, but I don't know
> how feasible
> > > > > > this
> > > > > > > > > is
> > > > > > > > > > with the ASF infrastructure.
> > > > > > > > > > You wouldn't be able to mirror the distribution, so the
> load can't
> > > > be
> > > > > > > > > > distributed. I doubt INFRA would like this.
> > > > > > > > > >
> > > > > > > > > > Note that third-parties could also start distributing
> use-case
> > > > > > oriented
> > > > > > > > > > distributions, which would be perfectly fine as far as
> I'm
> > > > concerned.
> > > > > > > > > >
> > > > > > > > > > On 16/04/2020 16:57, Kurt Young wrote:
> > > > > > > > > >
> > > > > > > > > > I'm not so sure about the web tool solution though. The
> concern I
> > > > have
> > > > > > > > > for
> > > > > > > > > > this approach is the final generated
> > > > > > > > > > distribution is kind of non-deterministic. We might
> generate too
> > > > many
> > > > > > > > > > different combinations when user trying to
> > > > > > > > > > package different types of connector, format, and even
> maybe hadoop
> > > > > > > > > > releases. As far as I can tell, most open
> > > > > > > > > > source projects and apache projects will only release
> some
> > > > > > > > > > pre-defined distributions, which most users are already
> > > > > > > > > > familiar with, thus hard to change IMO. And I also have
> went
> > > > through
> > > > > > in
> > > > > > > > > > some cases, users will try to re-distribute
> > > > > > > > > > the release package, because of the unstable network of
> apache
> > > > website
> > > > > > > > > from
> > > > > > > > > > China. In web tool solution, I don't
> > > > > > > > > > think this kind of re-distribution would be possible
> anymore.
> > > > > > > > > >
> > > > > > > > > > In the meantime, I also have a concern that we will fall
> back into
> > > > our
> > > > > > > > > trap
> > > > > > > > > > again if we try to offer this smart & flexible
> > > > > > > > > > solution. Because it needs users to cooperate with such
> mechanism.
> > > > > > It's
> > > > > > > > > > exactly the situation what we currently fell
> > > > > > > > > > into:
> > > > > > > > > > 1. We offered a smart solution.
> > > > > > > > > > 2. We hope users will follow the correct instructions.
> > > > > > > > > > 3. Everything will work as expected if users followed
> the right
> > > > > > > > > > instructions.
> > > > > > > > > >
> > > > > > > > > > In reality, I suspect not all users will do the second
> step
> > > > correctly.
> > > > > > > > > And
> > > > > > > > > > for new users who only trying to have a quick
> > > > > > > > > > experience with Flink, I would bet most users will do it
> wrong.
> > > > > > > > > >
> > > > > > > > > > So, my proposal would be one of the following 2 options:
> > > > > > > > > > 1. Provide a slim distribution for advanced product
> users and
> > > > provide
> > > > > > a
> > > > > > > > > > distribution which will have some popular builtin jars.
> > > > > > > > > > 2. Only provide a distribution which will have some
> popular builtin
> > > > > > > > jars.
> > > > > > > > > > If we are trying to reduce the distributions we
> released, I would
> > > > > > > > prefer
> > > > > > > > > 2
> > > > > > > > > > 1.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Kurt
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
> > > > [email protected]>
> > > > > > <
> > > > > > > > > [email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > I think what Chesnay and Dawid proposed would be the
> ideal
> > > > solution.
> > > > > > > > > > Ideally, we would also have a nice web tool for the
> website which
> > > > > > > > > generates
> > > > > > > > > > the corresponding distribution for download.
> > > > > > > > > >
> > > > > > > > > > To get things started we could start with only
> supporting to
> > > > > > > > > > download/creating the "fat" version with the script. The
> fat
> > > > version
> > > > > > > > > would
> > > > > > > > > > then consist of the slim distribution and whatever we
> deem
> > > > important
> > > > > > > > for
> > > > > > > > > > new users to get started.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Till
> > > > > > > > > >
> > > > > > > > > > On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> > > > > > > > > [email protected]> <[email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > Few points from my side:
> > > > > > > > > >
> > > > > > > > > > 1. I like the idea of simplifying the experience for
> first time
> > > > users.
> > > > > > > > > > As for production use cases I share Jark's opinion that
> in this
> > > > case I
> > > > > > > > > > would expect users to combine their distribution
> manually. I think
> > > > in
> > > > > > > > > > such scenarios it is important to understand
> interconnections.
> > > > > > > > > > Personally I'd expect the slimmest possible distribution
> that I can
> > > > > > > > > > extend further with what I need in my production
> scenario.
> > > > > > > > > >
> > > > > > > > > > 2. I think there is also the problem that the matrix of
> possible
> > > > > > > > > > combinations that can be useful is already big. Do we
> want to have
> > > > a
> > > > > > > > > > distribution for:
> > > > > > > > > >
> > > > > > > > > > SQL users: which connectors should we include? should we
> > > > include
> > > > > > > > > > hive? which other catalog?
> > > > > > > > > >
> > > > > > > > > > DataStream users: which connectors should we include?
> > > > > > > > > >
> > > > > > > > > > For both of the above should we include yarn/kubernetes?
> > > > > > > > > >
> > > > > > > > > > I would opt for providing only the "slim" distribution
> as a release
> > > > > > > > > > artifact.
> > > > > > > > > >
> > > > > > > > > > 3. However, as I said I think its worth investigating
> how we can
> > > > > > > > improve
> > > > > > > > > > users experience. What do you think of providing a tool,
> could be
> > > > e.g.
> > > > > > > > a
> > > > > > > > > > shell script that constructs a distribution based on
> users choice.
> > > > I
> > > > > > > > > > think that was also what Chesnay mentioned as "tooling to
> > > > > > > > > > assemble custom distributions" In the end how I see the
> difference
> > > > > > > > > > between a slim and fat distribution is which jars do we
> put into
> > > > the
> > > > > > > > > > lib, right? It could have a few "screens".
> > > > > > > > > >
> > > > > > > > > > 1. Which API are you interested in:
> > > > > > > > > > a. SQL API
> > > > > > > > > > b. DataStream API
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 2. [SQL] Which connectors do you want to use?
> [multichoice]:
> > > > > > > > > > a. Kafka
> > > > > > > > > > b. Elasticsearch
> > > > > > > > > > ...
> > > > > > > > > >
> > > > > > > > > > 3. [SQL] Which catalog you want to use?
> > > > > > > > > >
> > > > > > > > > > ...
> > > > > > > > > >
> > > > > > > > > > Such a tool would download all the dependencies from
> maven and put
> > > > > > them
> > > > > > > > > > into the correct folder. In the future we can extend it
> with
> > > > > > additional
> > > > > > > > > > rules e.g. kafka-0.9 cannot be chosen at the same time
> with
> > > > > > > > > > kafka-universal etc.
> > > > > > > > > >
> > > > > > > > > > The benefit of it would be that the distribution that we
> release
> > > > could
> > > > > > > > > > remain "slim" or we could even make it slimmer. I might
> be missing
> > > > > > > > > > something here though.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > >
> > > > > > > > > > Dawdi
> > > > > > > > > >
> > > > > > > > > > On 16/04/2020 11:02, Aljoscha Krettek wrote:
> > > > > > > > > >
> > > > > > > > > > I want to reinforce my opinion from earlier: This is
> about
> > > > improving
> > > > > > > > > > the situation both for first-time users and for
> experienced users
> > > > that
> > > > > > > > > > want to use a Flink dist in production. The current
> Flink dist is
> > > > too
> > > > > > > > > > "thin" for first-time SQL users and it is too "fat" for
> production
> > > > > > > > > > users, that is where serving no-one properly with the
> current
> > > > > > > > > > middle-ground. That's why I think introducing those
> specialized
> > > > > > > > > > "spins" of Flink dist would be good.
> > > > > > > > > >
> > > > > > > > > > By the way, at some point in the future production users
> might not
> > > > > > > > > > even need to get a Flink dist anymore. They should be
> able to have
> > > > > > > > > > Flink as a dependency of their project (including the
> runtime) and
> > > > > > > > > > then build an image from this for Kubernetes or a fat
> jar for YARN.
> > > > > > > > > >
> > > > > > > > > > Aljoscha
> > > > > > > > > >
> > > > > > > > > > On 15.04.20 18:14, wenlong.lwl wrote:
> > > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > Regarding slim and fat distributions, I think different
> kinds of
> > > > jobs
> > > > > > > > > > may
> > > > > > > > > > prefer different type of distribution:
> > > > > > > > > >
> > > > > > > > > > For DataStream job, I think we may not like fat
> distribution
> > > > > > > > > >
> > > > > > > > > > containing
> > > > > > > > > >
> > > > > > > > > > connectors because user would always need to depend on
> the
> > > > connector
> > > > > > > > > >
> > > > > > > > > > in
> > > > > > > > > >
> > > > > > > > > > user code, it is easy to include the connector jar in
> the user lib.
> > > > > > > > > >
> > > > > > > > > > Less
> > > > > > > > > >
> > > > > > > > > > jar in lib means less class conflicts and problems.
> > > > > > > > > >
> > > > > > > > > > For SQL job, I think we are trying to encourage user to
> user pure
> > > > > > > > > > sql(DDL +
> > > > > > > > > > DML) to construct their job, In order to improve user
> experience,
> > > > It
> > > > > > > > > > may be
> > > > > > > > > > important for flink, not only providing as many
> connector jar in
> > > > > > > > > > distribution as possible especially the connector and
> format we
> > > > have
> > > > > > > > > > well
> > > > > > > > > > documented, but also providing an mechanism to load
> connectors
> > > > > > > > > > according
> > > > > > > > > > to the DDLs,
> > > > > > > > > >
> > > > > > > > > > So I think it could be good to place connector/format
> jars in some
> > > > > > > > > > dir like
> > > > > > > > > > opt/connector which would not affect jobs by default, and
> > > > introduce a
> > > > > > > > > > mechanism of dynamic discovery for SQL.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Wenlong
> > > > > > > > > >
> > > > > > > > > > On Wed, 15 Apr 2020 at 22:46, Jingsong Li <
> [email protected]>
> > > > <
> > > > > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > I am thinking both "improve first experience" and
> "improve
> > > > production
> > > > > > > > > > experience".
> > > > > > > > > >
> > > > > > > > > > I'm thinking about what's the common mode of Flink?
> > > > > > > > > > Streaming job use Kafka? Batch job use Hive?
> > > > > > > > > >
> > > > > > > > > > Hive 1.2.1 dependencies can be compatible with most of
> Hive server
> > > > > > > > > > versions. So Spark and Presto have built-in Hive 1.2.1
> dependency.
> > > > > > > > > > Flink is currently mainly used for streaming, so let's
> not talk
> > > > > > > > > > about hive.
> > > > > > > > > >
> > > > > > > > > > For streaming jobs, first of all, the jobs in my mind is
> (related
> > > > to
> > > > > > > > > > connectors):
> > > > > > > > > > - ETL jobs: Kafka -> Kafka
> > > > > > > > > > - Join jobs: Kafka -> DimJDBC -> Kafka
> > > > > > > > > > - Aggregation jobs: Kafka -> JDBCSink
> > > > > > > > > > So Kafka and JDBC are probably the most commonly used.
> Of course,
> > > > > > > > > >
> > > > > > > > > > also
> > > > > > > > > >
> > > > > > > > > > includes CSV, JSON's formats.
> > > > > > > > > > So when we provide such a fat distribution:
> > > > > > > > > > - With CSV, JSON.
> > > > > > > > > > - With flink-kafka-universal and kafka dependencies.
> > > > > > > > > > - With flink-jdbc.
> > > > > > > > > > Using this fat distribution, most users can run their
> jobs well.
> > > > > > > > > >
> > > > > > > > > > (jdbc
> > > > > > > > > >
> > > > > > > > > > driver jar required, but this is very natural to do)
> > > > > > > > > > Can these dependencies lead to kinds of conflicts? Only
> Kafka may
> > > > > > > > > >
> > > > > > > > > > have
> > > > > > > > > >
> > > > > > > > > > conflicts, but if our goal is to use kafka-universal to
> support all
> > > > > > > > > > Kafka
> > > > > > > > > > versions, it is hopeful to target the vast majority of
> users.
> > > > > > > > > >
> > > > > > > > > > We don't want to plug all jars into the fat
> distribution. Only need
> > > > > > > > > > less
> > > > > > > > > > conflict and common. of course, it is a matter of
> consideration to
> > > > > > > > > >
> > > > > > > > > > put
> > > > > > > > > >
> > > > > > > > > > which jar into fat distribution.
> > > > > > > > > > We have the opportunity to facilitate the majority of
> users, but
> > > > > > > > > > also left
> > > > > > > > > > opportunities for customization.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Jingsong Lee
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <
> [email protected]> <
> > > > > > > > > [email protected]> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > I think we should first reach an consensus on "what
> problem do we
> > > > > > > > > > want to
> > > > > > > > > > solve?"
> > > > > > > > > > (1) improve first experience? or (2) improve production
> experience?
> > > > > > > > > >
> > > > > > > > > > As far as I can see, with the above discussion, I think
> what we
> > > > > > > > > > want to
> > > > > > > > > > solve is the "first experience".
> > > > > > > > > > And I think the slim jar is still the best distribution
> for
> > > > > > > > > > production,
> > > > > > > > > > because it's easier to assembling jars
> > > > > > > > > > than excluding jars and can avoid potential class
> conflicts.
> > > > > > > > > >
> > > > > > > > > > If we want to improve "first experience", I think it
> make sense to
> > > > > > > > > > have a
> > > > > > > > > > fat distribution to give users a more smooth first
> experience.
> > > > > > > > > > But I would like to call it "playground distribution" or
> something
> > > > > > > > > > like
> > > > > > > > > > that to explicitly differ from the "slim
> production-purpose
> > > > > > > > > >
> > > > > > > > > > distribution".
> > > > > > > > > >
> > > > > > > > > > The "playground distribution" can contains some widely
> used jars,
> > > > > > > > > >
> > > > > > > > > > like
> > > > > > > > > >
> > > > > > > > > > universal-kafka-sql-connector,
> elasticsearch7-sql-connector, avro,
> > > > > > > > > > json,
> > > > > > > > > > csv, etc..
> > > > > > > > > > Even we can provide a playground docker which may
> contain the fat
> > > > > > > > > > distribution, python3, and hive.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Jark
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
> [email protected]>
> > > > <
> > > > > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > I don't see a lot of value in having multiple
> distributions.
> > > > > > > > > >
> > > > > > > > > > The simple reality is that no fat distribution we could
> provide
> > > > > > > > > >
> > > > > > > > > > would
> > > > > > > > > >
> > > > > > > > > > satisfy all use-cases, so why even try.
> > > > > > > > > > If users commonly run into issues for certain jars, then
> maybe
> > > > > > > > > >
> > > > > > > > > > those
> > > > > > > > > >
> > > > > > > > > > should be added to the current distribution.
> > > > > > > > > >
> > > > > > > > > > Personally though I still believe we should only
> distribute a slim
> > > > > > > > > > version. I'd rather have users always add required jars
> to the
> > > > > > > > > > distribution than only when they go outside our
> "expected"
> > > > > > > > > >
> > > > > > > > > > use-cases.
> > > > > > > > > >
> > > > > > > > > > Then we might finally address this issue properly, i.e.,
> tooling to
> > > > > > > > > > assemble custom distributions and/or better error
> messages if
> > > > > > > > > > Flink-provided extensions cannot be found.
> > > > > > > > > >
> > > > > > > > > > On 15/04/2020 15:23, Kurt Young wrote:
> > > > > > > > > >
> > > > > > > > > > Regarding to the specific solution, I'm not sure about
> the "fat"
> > > > > > > > > >
> > > > > > > > > > and
> > > > > > > > > >
> > > > > > > > > > "slim"
> > > > > > > > > >
> > > > > > > > > > solution though. I get the idea
> > > > > > > > > > that we can make the slim one even more lightweight than
> current
> > > > > > > > > > distribution, but what about the "fat"
> > > > > > > > > > one? Do you mean that we would package all connectors
> and formats
> > > > > > > > > >
> > > > > > > > > > into
> > > > > > > > > >
> > > > > > > > > > this? I'm not sure if this is
> > > > > > > > > > feasible. For example, we can't put all versions of
> kafka and hive
> > > > > > > > > > connector jars into lib directory, and
> > > > > > > > > > we also might need hadoop jars when using filesystem
> connector to
> > > > > > > > > >
> > > > > > > > > > access
> > > > > > > > > >
> > > > > > > > > > data from HDFS.
> > > > > > > > > >
> > > > > > > > > > So my guess would be we might hand-pick some of the most
> > > > > > > > > >
> > > > > > > > > > frequently
> > > > > > > > > >
> > > > > > > > > > used
> > > > > > > > > >
> > > > > > > > > > connectors and formats
> > > > > > > > > > into our "lib" directory, like kafka, csv, json metioned
> above,
> > > > > > > > > >
> > > > > > > > > > and
> > > > > > > > > >
> > > > > > > > > > still
> > > > > > > > > >
> > > > > > > > > > leave some other connectors out of it.
> > > > > > > > > > If this is the case, then why not we just provide this
> > > > > > > > > >
> > > > > > > > > > distribution
> > > > > > > > > >
> > > > > > > > > > to
> > > > > > > > > >
> > > > > > > > > > user? I'm not sure i get the benefit of
> > > > > > > > > > providing another super "slim" jar (we have to pay some
> costs to
> > > > > > > > > >
> > > > > > > > > > provide
> > > > > > > > > >
> > > > > > > > > > another suit of distribution).
> > > > > > > > > >
> > > > > > > > > > What do you think?
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Kurt
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> > > > > > > > > >
> > > > > > > > > > [email protected]
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Big +1.
> > > > > > > > > >
> > > > > > > > > > I like "fat" and "slim".
> > > > > > > > > >
> > > > > > > > > > For csv and json, like Jark said, they are quite small
> and don't
> > > > > > > > > >
> > > > > > > > > > have
> > > > > > > > > >
> > > > > > > > > > other
> > > > > > > > > >
> > > > > > > > > > dependencies. They are important to kafka connector, and
> > > > > > > > > >
> > > > > > > > > > important
> > > > > > > > > >
> > > > > > > > > > to upcoming file system connector too.
> > > > > > > > > > So can we move them to both "fat" and "slim"? They're so
> > > > > > > > > >
> > > > > > > > > > important,
> > > > > > > > > >
> > > > > > > > > > and
> > > > > > > > > >
> > > > > > > > > > they're so lightweight.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Jingsong Lee
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 15, 2020 at 4:53 PM godfrey he <
> [email protected]> <
> > > > > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Big +1.
> > > > > > > > > > This will improve user experience (special for Flink new
> users).
> > > > > > > > > > We answered so many questions about "class not found".
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Godfrey
> > > > > > > > > >
> > > > > > > > > > Dian Fu <[email protected]> <[email protected]>
> > > > 于2020年4月15日周三
> > > > > > > > > 下午4:30写道：
> > > > > > > > > >
> > > > > > > > > > +1 to this proposal.
> > > > > > > > > >
> > > > > > > > > > Missing connector jars is also a big problem for PyFlink
> users.
> > > > > > > > > >
> > > > > > > > > > Currently,
> > > > > > > > > >
> > > > > > > > > > after a Python user has installed PyFlink using `pip`,
> he has
> > > > > > > > > >
> > > > > > > > > > to
> > > > > > > > > >
> > > > > > > > > > manually
> > > > > > > > > >
> > > > > > > > > > copy the connector fat jars to the PyFlink installation
> > > > > > > > > >
> > > > > > > > > > directory
> > > > > > > > > >
> > > > > > > > > > for
> > > > > > > > > >
> > > > > > > > > > the
> > > > > > > > > >
> > > > > > > > > > connectors to be used if he wants to run jobs locally.
> This
> > > > > > > > > >
> > > > > > > > > > process
> > > > > > > > > >
> > > > > > > > > > is
> > > > > > > > > >
> > > > > > > > > > very
> > > > > > > > > >
> > > > > > > > > > confuse for users and affects the experience a lot.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Dian
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 在 2020年4月15日，下午3:51，Jark Wu <[email protected]> <
> [email protected]>
> > > > 写道：
> > > > > > > > > >
> > > > > > > > > > +1 to the proposal. I also found the "download
> additional jar"
> > > > > > > > > >
> > > > > > > > > > step
> > > > > > > > > >
> > > > > > > > > > is
> > > > > > > > > >
> > > > > > > > > > really verbose when I prepare webinars.
> > > > > > > > > >
> > > > > > > > > > At least, I think the flink-csv and flink-json should in
> the
> > > > > > > > > >
> > > > > > > > > > distribution,
> > > > > > > > > >
> > > > > > > > > > they are quite small and don't have other dependencies.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Jark
> > > > > > > > > >
> > > > > > > > > > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <
> [email protected]> <
> > > > > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Aljoscha,
> > > > > > > > > >
> > > > > > > > > > Big +1 for the fat flink distribution, where do you plan
> to
> > > > > > > > > >
> > > > > > > > > > put
> > > > > > > > > >
> > > > > > > > > > these
> > > > > > > > > >
> > > > > > > > > > connectors ? opt or lib ?
> > > > > > > > > >
> > > > > > > > > > Aljoscha Krettek <[email protected]> <
> [email protected]>
> > > > > > > > > 于2020年4月15日周三
> > > > > > > > > > 下午3:30写道：
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi Everyone,
> > > > > > > > > >
> > > > > > > > > > I'd like to discuss about releasing a more full-featured
> > > > > > > > > >
> > > > > > > > > > Flink
> > > > > > > > > >
> > > > > > > > > > distribution. The motivation is that there is friction
> for
> > > > > > > > > >
> > > > > > > > > > SQL/Table
> > > > > > > > > >
> > > > > > > > > > API
> > > > > > > > > >
> > > > > > > > > > users that want to use Table connectors which are not
> there
> > > > > > > > > >
> > > > > > > > > > in
> > > > > > > > > >
> > > > > > > > > > the
> > > > > > > > > >
> > > > > > > > > > current Flink Distribution. For these users the workflow
> is
> > > > > > > > > >
> > > > > > > > > > currently
> > > > > > > > > >
> > > > > > > > > > roughly:
> > > > > > > > > >
> > > > > > > > > > - download Flink dist
> > > > > > > > > > - configure csv/Kafka/json connectors per configuration
> > > > > > > > > > - run SQL client or program
> > > > > > > > > > - decrypt error message and research the solution
> > > > > > > > > > - download additional connector jars
> > > > > > > > > > - program works correctly
> > > > > > > > > >
> > > > > > > > > > I realize that this can be made to work but if every SQL
> > > > > > > > > >
> > > > > > > > > > user
> > > > > > > > > >
> > > > > > > > > > has
> > > > > > > > > >
> > > > > > > > > > this
> > > > > > > > > >
> > > > > > > > > > as their first experience that doesn't seem good to me.
> > > > > > > > > >
> > > > > > > > > > My proposal is to provide two versions of the Flink
> > > > > > > > > >
> > > > > > > > > > Distribution
> > > > > > > > > >
> > > > > > > > > > in
> > > > > > > > > >
> > > > > > > > > > the
> > > > > > > > > >
> > > > > > > > > > future: "fat" and "slim" (names to be discussed):
> > > > > > > > > >
> > > > > > > > > > - slim would be even trimmer than todays distribution
> > > > > > > > > > - fat would contain a lot of convenience connectors (yet
> > > > > > > > > >
> > > > > > > > > > to
> > > > > > > > > >
> > > > > > > > > > be
> > > > > > > > > >
> > > > > > > > > > determined which one)
> > > > > > > > > >
> > > > > > > > > > And yes, I realize that there are already more
> dimensions of
> > > > > > > > > >
> > > > > > > > > > Flink
> > > > > > > > > >
> > > > > > > > > > releases (Scala version and Java version).
> > > > > > > > > >
> > > > > > > > > > For background, our current Flink dist has these in the
> opt
> > > > > > > > > >
> > > > > > > > > > directory:
> > > > > > > > > >
> > > > > > > > > > - flink-azure-fs-hadoop-1.10.0.jar
> > > > > > > > > > - flink-cep-scala_2.12-1.10.0.jar
> > > > > > > > > > - flink-cep_2.12-1.10.0.jar
> > > > > > > > > > - flink-gelly-scala_2.12-1.10.0.jar
> > > > > > > > > > - flink-gelly_2.12-1.10.0.jar
> > > > > > > > > > - flink-metrics-datadog-1.10.0.jar
> > > > > > > > > > - flink-metrics-graphite-1.10.0.jar
> > > > > > > > > > - flink-metrics-influxdb-1.10.0.jar
> > > > > > > > > > - flink-metrics-prometheus-1.10.0.jar
> > > > > > > > > > - flink-metrics-slf4j-1.10.0.jar
> > > > > > > > > > - flink-metrics-statsd-1.10.0.jar
> > > > > > > > > > - flink-oss-fs-hadoop-1.10.0.jar
> > > > > > > > > > - flink-python_2.12-1.10.0.jar
> > > > > > > > > > - flink-queryable-state-runtime_2.12-1.10.0.jar
> > > > > > > > > > - flink-s3-fs-hadoop-1.10.0.jar
> > > > > > > > > > - flink-s3-fs-presto-1.10.0.jar
> > > > > > > > > > -
> > > > > > > > > >
> > > > > > > > > > flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > > > > > > > > >
> > > > > > > > > > - flink-sql-client_2.12-1.10.0.jar
> > > > > > > > > > - flink-state-processor-api_2.12-1.10.0.jar
> > > > > > > > > > - flink-swift-fs-hadoop-1.10.0.jar
> > > > > > > > > >
> > > > > > > > > > Current Flink dist is 267M. If we removed everything from
> > > > > > > > > >
> > > > > > > > > > opt
> > > > > > > > > >
> > > > > > > > > > we
> > > > > > > > > >
> > > > > > > > > > would
> > > > > > > > > >
> > > > > > > > > > go down to 126M. I would reccomend this, because the
> large
> > > > > > > > > >
> > > > > > > > > > majority
> > > > > > > > > >
> > > > > > > > > > of
> > > > > > > > > >
> > > > > > > > > > the files in opt are probably unused.
> > > > > > > > > >
> > > > > > > > > > What do you think?
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Aljoscha
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best Regards
> > > > > > > > > >
> > > > > > > > > > Jeff Zhang
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best, Jingsong Lee
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best, Jingsong Lee
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > > >
> > > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> >
> >
> > --
> > Best, Jingsong Lee
>


-- 

Best,
Benchao Li

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Reply via email to