Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Leonard Xu Fri, 05 Jun 2020 00:29:12 -0700

+1 for Jingsong’s proposal to put flink-csv, flink-json and flink-avro under 
lib/ directory.
I have heard many SQL users(most of newbies) complaint the out-of-box 
experience in mail list.


Best,
Leonard Xu


> 在 2020年6月5日，14:39，Benchao Li <libenc...@gmail.com> 写道：
> 
> +1 to include them for sql-client by default;
> +0 to put into lib and exposed to all kinds of jobs, including DataStream.
> 
> Danny Chan <yuzhao....@gmail.com> 于2020年6月5日周五 下午2:31写道：
> 
>> +1, at least, we should keep an out of the box SQL-CLI, it’s very poor
>> experience to add such required format jars for SQL users.
>> 
>> Best,
>> Danny Chan
>> 在 2020年6月5日 +0800 AM11:14，Jingsong Li <jingsongl...@gmail.com>，写道：
>>> Hi all,
>>> 
>>> Considering that 1.11 will be released soon, what about my previous
>>> proposal? Put flink-csv, flink-json and flink-avro under lib.
>>> These three formats are very small and no third party dependence, and
>> they
>>> are widely used by table users.
>>> 
>>> Best,
>>> Jingsong Lee
>>> 
>>> On Tue, May 12, 2020 at 4:19 PM Jingsong Li <jingsongl...@gmail.com>
>> wrote:
>>> 
>>>> Thanks for your discussion.
>>>> 
>>>> Sorry to start discussing another thing:
>>>> 
>>>> The biggest problem I see is the variety of problems caused by users'
>> lack
>>>> of format dependency.
>>>> As Aljoscha said, these three formats are very small and no third party
>>>> dependence, and they are widely used by table users.
>>>> Actually, we don't have any other built-in table formats now... In
>> total
>>>> 151K...
>>>> 
>>>> 73K flink-avro-1.10.0.jar
>>>> 36K flink-csv-1.10.0.jar
>>>> 42K flink-json-1.10.0.jar
>>>> 
>>>> So, Can we just put them into "lib/" or flink-table-uber?
>>>> It not solve all problems and maybe it is independent of "fat" and
>> "slim".
>>>> But also improve usability.
>>>> What do you think? Any objections?
>>>> 
>>>> Best,
>>>> Jingsong Lee
>>>> 
>>>> On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <ches...@apache.org>
>>>> wrote:
>>>> 
>>>>> One downside would be that we're shipping more stuff when running on
>>>>> YARN for example, since the entire plugins directory is shiped by
>> default.
>>>>> 
>>>>> On 17/04/2020 16:38, Stephan Ewen wrote:
>>>>>> @Aljoscha I think that is an interesting line of thinking. the
>> swift-fs
>>>>> may
>>>>>> be rarely enough used to move it to an optional download.
>>>>>> 
>>>>>> I would still drop two more thoughts:
>>>>>> 
>>>>>> (1) Now that we have plugins support, is there a reason to have a
>>>>> metrics
>>>>>> reporter or file system in /opt instead of /plugins? They don't
>> spoil
>>>>> the
>>>>>> class path any more.
>>>>>> 
>>>>>> (2) I can imagine there still being a desire to have a "minimal"
>> docker
>>>>>> file, for users that want to keep the container images as small as
>>>>>> possible, to speed up deployment. It is fine if that would not be
>> the
>>>>>> default, though.
>>>>>> 
>>>>>> 
>>>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <
>> aljos...@apache.org>
>>>>>> wrote:
>>>>>> 
>>>>>>> I think having such tools and/or tailor-made distributions can
>> be nice
>>>>>>> but I also think the discussion is missing the main point: The
>> initial
>>>>>>> observation/motivation is that apparently a lot of users (Kurt
>> and I
>>>>>>> talked about this) on the chinese DingTalk support groups, and
>> other
>>>>>>> support channels have problems when first using the SQL client
>> because
>>>>>>> of these missing connectors/formats. For these, having
>> additional tools
>>>>>>> would not solve anything because they would also not take that
>> extra
>>>>>>> step. I think that even tiny friction should be avoided because
>> the
>>>>>>> annoyance from it accumulates of the (hopefully) many users that
>> we
>>>>> want
>>>>>>> to have.
>>>>>>> 
>>>>>>> Maybe we should take a step back from discussing the
>> "fat"/"slim" idea
>>>>>>> and instead think about the composition of the current dist. As
>>>>>>> mentioned we have these jars in opt/:
>>>>>>> 
>>>>>>> 17M flink-azure-fs-hadoop-1.10.0.jar
>>>>>>> 52K flink-cep-scala_2.11-1.10.0.jar
>>>>>>> 180K flink-cep_2.11-1.10.0.jar
>>>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar
>>>>>>> 626K flink-gelly_2.11-1.10.0.jar
>>>>>>> 512K flink-metrics-datadog-1.10.0.jar
>>>>>>> 159K flink-metrics-graphite-1.10.0.jar
>>>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar
>>>>>>> 102K flink-metrics-prometheus-1.10.0.jar
>>>>>>> 10K flink-metrics-slf4j-1.10.0.jar
>>>>>>> 12K flink-metrics-statsd-1.10.0.jar
>>>>>>> 36M flink-oss-fs-hadoop-1.10.0.jar
>>>>>>> 28M flink-python_2.11-1.10.0.jar
>>>>>>> 22K flink-queryable-state-runtime_2.11-1.10.0.jar
>>>>>>> 18M flink-s3-fs-hadoop-1.10.0.jar
>>>>>>> 31M flink-s3-fs-presto-1.10.0.jar
>>>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>> 518K flink-sql-client_2.11-1.10.0.jar
>>>>>>> 99K flink-state-processor-api_2.11-1.10.0.jar
>>>>>>> 25M flink-swift-fs-hadoop-1.10.0.jar
>>>>>>> 160M opt
>>>>>>> 
>>>>>>> The "filesystem" connectors ar ethe heavy hitters, there.
>>>>>>> 
>>>>>>> I downloaded most of the SQL connectors/formats and this is what
>> I got:
>>>>>>> 
>>>>>>> 73K flink-avro-1.10.0.jar
>>>>>>> 36K flink-csv-1.10.0.jar
>>>>>>> 55K flink-hbase_2.11-1.10.0.jar
>>>>>>> 88K flink-jdbc_2.11-1.10.0.jar
>>>>>>> 42K flink-json-1.10.0.jar
>>>>>>> 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
>>>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
>>>>>>> 24M sql-connectors-formats
>>>>>>> 
>>>>>>> We could just add these to the Flink distribution without
>> blowing it up
>>>>>>> by much. We could drop any of the existing "filesystem"
>> connectors from
>>>>>>> opt and add the SQL connectors/formats and not change the size
>> of Flink
>>>>>>> dist. So maybe we should do that instead?
>>>>>>> 
>>>>>>> We would need some tooling for the sql-client shell script to
>> pick-up
>>>>>>> the connectors/formats up from opt/ because we don't want to add
>> them
>>>>> to
>>>>>>> lib/. We're already doing that for finding the flink-sql-client
>> jar,
>>>>>>> which is also not in lib/.
>>>>>>> 
>>>>>>> What do you think?
>>>>>>> 
>>>>>>> Best,
>>>>>>> Aljoscha
>>>>>>> 
>>>>>>> On 17.04.20 05:22, Jark Wu wrote:
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I like the idea of web tool to assemble fat distribution. And
>> the
>>>>>>>> https://code.quarkus.io/ looks very nice.
>>>>>>>> All the users need to do is just select what he/she need (I
>> think this
>>>>>>> step
>>>>>>>> can't be omitted anyway).
>>>>>>>> We can also provide a default fat distribution on the web which
>>>>> default
>>>>>>>> selects some popular connectors.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Jark
>>>>>>>> 
>>>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <rafi.ar...@gmail.com
>>> 
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> As a reference for a nice first-experience I had, take a
>> look at
>>>>>>>>> https://code.quarkus.io/
>>>>>>>>> You reach this page after you click "Start Coding" at the
>> project
>>>>>>> homepage.
>>>>>>>>> Rafi
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <ykt...@gmail.com>
>> wrote:
>>>>>>>>> 
>>>>>>>>>> I'm not saying pre-bundle some jars will make this problem
>> go away,
>>>>> and
>>>>>>>>>> you're right that only hides the problem for
>>>>>>>>>> some users. But what if this solution can hide the problem
>> for 90%
>>>>>>> users?
>>>>>>>>>> Would't that be good enough for us to try?
>>>>>>>>>> 
>>>>>>>>>> Regarding to would users following instructions really be
>> such a big
>>>>>>>>>> problem?
>>>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions
>> for at
>>>>> least a
>>>>>>>>>> dozen times and I won't see such questions coming
>>>>>>>>>> up from time to time. During some periods, I even saw such
>> questions
>>>>>>>>> every
>>>>>>>>>> day.
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Kurt
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
>>>>> ches...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> The problem with having a distribution with "popular"
>> stuff is
>>>>> that it
>>>>>>>>>>> doesn't really *solve* a problem, it just hides it for
>> users who
>>>>> fall
>>>>>>>>>>> into these particular use-cases.
>>>>>>>>>>> Move out of it and you once again run into exact same
>> problems
>>>>>>>>> out-lined.
>>>>>>>>>>> This is exactly why I like the tooling approach; you
>> have to deal
>>>>> with
>>>>>>>>> it
>>>>>>>>>>> from the start and transitioning to a custom use-case is
>> easier.
>>>>>>>>>>> 
>>>>>>>>>>> Would users following instructions really be such a big
>> problem?
>>>>>>>>>>> I would expect that users generally know *what *they
>> need, just not
>>>>>>>>>>> necessarily how it is assembled correctly (where do get
>> which jar,
>>>>>>>>> which
>>>>>>>>>>> directory to put it in).
>>>>>>>>>>> It seems like these are exactly the problem this would
>> solve?
>>>>>>>>>>> I just don't see how moving a jar corresponding to some
>> feature
>>>>> from
>>>>>>>>> opt
>>>>>>>>>>> to some directory (lib/plugins) is less error-prone than
>> just
>>>>>>> selecting
>>>>>>>>>> the
>>>>>>>>>>> feature and having the tool handle the rest.
>>>>>>>>>>> 
>>>>>>>>>>> As for re-distributions, it depends on the form that the
>> tool would
>>>>>>>>> take.
>>>>>>>>>>> It could be an application that runs locally and works
>> against
>>>>> maven
>>>>>>>>>>> central (note: not necessarily *using* maven); this
>> should would
>>>>> work
>>>>>>>>> in
>>>>>>>>>>> China, no?
>>>>>>>>>>> 
>>>>>>>>>>> A web tool would of course be fancy, but I don't know
>> how feasible
>>>>>>> this
>>>>>>>>>> is
>>>>>>>>>>> with the ASF infrastructure.
>>>>>>>>>>> You wouldn't be able to mirror the distribution, so the
>> load can't
>>>>> be
>>>>>>>>>>> distributed. I doubt INFRA would like this.
>>>>>>>>>>> 
>>>>>>>>>>> Note that third-parties could also start distributing
>> use-case
>>>>>>> oriented
>>>>>>>>>>> distributions, which would be perfectly fine as far as
>> I'm
>>>>> concerned.
>>>>>>>>>>> 
>>>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I'm not so sure about the web tool solution though. The
>> concern I
>>>>> have
>>>>>>>>>> for
>>>>>>>>>>> this approach is the final generated
>>>>>>>>>>> distribution is kind of non-deterministic. We might
>> generate too
>>>>> many
>>>>>>>>>>> different combinations when user trying to
>>>>>>>>>>> package different types of connector, format, and even
>> maybe hadoop
>>>>>>>>>>> releases. As far as I can tell, most open
>>>>>>>>>>> source projects and apache projects will only release
>> some
>>>>>>>>>>> pre-defined distributions, which most users are already
>>>>>>>>>>> familiar with, thus hard to change IMO. And I also have
>> went
>>>>> through
>>>>>>> in
>>>>>>>>>>> some cases, users will try to re-distribute
>>>>>>>>>>> the release package, because of the unstable network of
>> apache
>>>>> website
>>>>>>>>>> from
>>>>>>>>>>> China. In web tool solution, I don't
>>>>>>>>>>> think this kind of re-distribution would be possible
>> anymore.
>>>>>>>>>>> 
>>>>>>>>>>> In the meantime, I also have a concern that we will fall
>> back into
>>>>> our
>>>>>>>>>> trap
>>>>>>>>>>> again if we try to offer this smart & flexible
>>>>>>>>>>> solution. Because it needs users to cooperate with such
>> mechanism.
>>>>>>> It's
>>>>>>>>>>> exactly the situation what we currently fell
>>>>>>>>>>> into:
>>>>>>>>>>> 1. We offered a smart solution.
>>>>>>>>>>> 2. We hope users will follow the correct instructions.
>>>>>>>>>>> 3. Everything will work as expected if users followed
>> the right
>>>>>>>>>>> instructions.
>>>>>>>>>>> 
>>>>>>>>>>> In reality, I suspect not all users will do the second
>> step
>>>>> correctly.
>>>>>>>>>> And
>>>>>>>>>>> for new users who only trying to have a quick
>>>>>>>>>>> experience with Flink, I would bet most users will do it
>> wrong.
>>>>>>>>>>> 
>>>>>>>>>>> So, my proposal would be one of the following 2 options:
>>>>>>>>>>> 1. Provide a slim distribution for advanced product
>> users and
>>>>> provide
>>>>>>> a
>>>>>>>>>>> distribution which will have some popular builtin jars.
>>>>>>>>>>> 2. Only provide a distribution which will have some
>> popular builtin
>>>>>>>>> jars.
>>>>>>>>>>> If we are trying to reduce the distributions we
>> released, I would
>>>>>>>>> prefer
>>>>>>>>>> 2
>>>>>>>>>>> 1.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Kurt
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
>>>>> trohrm...@apache.org>
>>>>>>> <
>>>>>>>>>> trohrm...@apache.org> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I think what Chesnay and Dawid proposed would be the
>> ideal
>>>>> solution.
>>>>>>>>>>> Ideally, we would also have a nice web tool for the
>> website which
>>>>>>>>>> generates
>>>>>>>>>>> the corresponding distribution for download.
>>>>>>>>>>> 
>>>>>>>>>>> To get things started we could start with only
>> supporting to
>>>>>>>>>>> download/creating the "fat" version with the script. The
>> fat
>>>>> version
>>>>>>>>>> would
>>>>>>>>>>> then consist of the slim distribution and whatever we
>> deem
>>>>> important
>>>>>>>>> for
>>>>>>>>>>> new users to get started.
>>>>>>>>>>> 
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Till
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
>>>>>>>>>> dwysakow...@apache.org> <dwysakow...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hi all,
>>>>>>>>>>> 
>>>>>>>>>>> Few points from my side:
>>>>>>>>>>> 
>>>>>>>>>>> 1. I like the idea of simplifying the experience for
>> first time
>>>>> users.
>>>>>>>>>>> As for production use cases I share Jark's opinion that
>> in this
>>>>> case I
>>>>>>>>>>> would expect users to combine their distribution
>> manually. I think
>>>>> in
>>>>>>>>>>> such scenarios it is important to understand
>> interconnections.
>>>>>>>>>>> Personally I'd expect the slimmest possible distribution
>> that I can
>>>>>>>>>>> extend further with what I need in my production
>> scenario.
>>>>>>>>>>> 
>>>>>>>>>>> 2. I think there is also the problem that the matrix of
>> possible
>>>>>>>>>>> combinations that can be useful is already big. Do we
>> want to have
>>>>> a
>>>>>>>>>>> distribution for:
>>>>>>>>>>> 
>>>>>>>>>>> SQL users: which connectors should we include? should we
>>>>> include
>>>>>>>>>>> hive? which other catalog?
>>>>>>>>>>> 
>>>>>>>>>>> DataStream users: which connectors should we include?
>>>>>>>>>>> 
>>>>>>>>>>> For both of the above should we include yarn/kubernetes?
>>>>>>>>>>> 
>>>>>>>>>>> I would opt for providing only the "slim" distribution
>> as a release
>>>>>>>>>>> artifact.
>>>>>>>>>>> 
>>>>>>>>>>> 3. However, as I said I think its worth investigating
>> how we can
>>>>>>>>> improve
>>>>>>>>>>> users experience. What do you think of providing a tool,
>> could be
>>>>> e.g.
>>>>>>>>> a
>>>>>>>>>>> shell script that constructs a distribution based on
>> users choice.
>>>>> I
>>>>>>>>>>> think that was also what Chesnay mentioned as "tooling to
>>>>>>>>>>> assemble custom distributions" In the end how I see the
>> difference
>>>>>>>>>>> between a slim and fat distribution is which jars do we
>> put into
>>>>> the
>>>>>>>>>>> lib, right? It could have a few "screens".
>>>>>>>>>>> 
>>>>>>>>>>> 1. Which API are you interested in:
>>>>>>>>>>> a. SQL API
>>>>>>>>>>> b. DataStream API
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 2. [SQL] Which connectors do you want to use?
>> [multichoice]:
>>>>>>>>>>> a. Kafka
>>>>>>>>>>> b. Elasticsearch
>>>>>>>>>>> ...
>>>>>>>>>>> 
>>>>>>>>>>> 3. [SQL] Which catalog you want to use?
>>>>>>>>>>> 
>>>>>>>>>>> ...
>>>>>>>>>>> 
>>>>>>>>>>> Such a tool would download all the dependencies from
>> maven and put
>>>>>>> them
>>>>>>>>>>> into the correct folder. In the future we can extend it
>> with
>>>>>>> additional
>>>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time
>> with
>>>>>>>>>>> kafka-universal etc.
>>>>>>>>>>> 
>>>>>>>>>>> The benefit of it would be that the distribution that we
>> release
>>>>> could
>>>>>>>>>>> remain "slim" or we could even make it slimmer. I might
>> be missing
>>>>>>>>>>> something here though.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> 
>>>>>>>>>>> Dawdi
>>>>>>>>>>> 
>>>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I want to reinforce my opinion from earlier: This is
>> about
>>>>> improving
>>>>>>>>>>> the situation both for first-time users and for
>> experienced users
>>>>> that
>>>>>>>>>>> want to use a Flink dist in production. The current
>> Flink dist is
>>>>> too
>>>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for
>> production
>>>>>>>>>>> users, that is where serving no-one properly with the
>> current
>>>>>>>>>>> middle-ground. That's why I think introducing those
>> specialized
>>>>>>>>>>> "spins" of Flink dist would be good.
>>>>>>>>>>> 
>>>>>>>>>>> By the way, at some point in the future production users
>> might not
>>>>>>>>>>> even need to get a Flink dist anymore. They should be
>> able to have
>>>>>>>>>>> Flink as a dependency of their project (including the
>> runtime) and
>>>>>>>>>>> then build an image from this for Kubernetes or a fat
>> jar for YARN.
>>>>>>>>>>> 
>>>>>>>>>>> Aljoscha
>>>>>>>>>>> 
>>>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi all,
>>>>>>>>>>> 
>>>>>>>>>>> Regarding slim and fat distributions, I think different
>> kinds of
>>>>> jobs
>>>>>>>>>>> may
>>>>>>>>>>> prefer different type of distribution:
>>>>>>>>>>> 
>>>>>>>>>>> For DataStream job, I think we may not like fat
>> distribution
>>>>>>>>>>> 
>>>>>>>>>>> containing
>>>>>>>>>>> 
>>>>>>>>>>> connectors because user would always need to depend on
>> the
>>>>> connector
>>>>>>>>>>> 
>>>>>>>>>>> in
>>>>>>>>>>> 
>>>>>>>>>>> user code, it is easy to include the connector jar in
>> the user lib.
>>>>>>>>>>> 
>>>>>>>>>>> Less
>>>>>>>>>>> 
>>>>>>>>>>> jar in lib means less class conflicts and problems.
>>>>>>>>>>> 
>>>>>>>>>>> For SQL job, I think we are trying to encourage user to
>> user pure
>>>>>>>>>>> sql(DDL +
>>>>>>>>>>> DML) to construct their job, In order to improve user
>> experience,
>>>>> It
>>>>>>>>>>> may be
>>>>>>>>>>> important for flink, not only providing as many
>> connector jar in
>>>>>>>>>>> distribution as possible especially the connector and
>> format we
>>>>> have
>>>>>>>>>>> well
>>>>>>>>>>> documented, but also providing an mechanism to load
>> connectors
>>>>>>>>>>> according
>>>>>>>>>>> to the DDLs,
>>>>>>>>>>> 
>>>>>>>>>>> So I think it could be good to place connector/format
>> jars in some
>>>>>>>>>>> dir like
>>>>>>>>>>> opt/connector which would not affect jobs by default, and
>>>>> introduce a
>>>>>>>>>>> mechanism of dynamic discovery for SQL.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Wenlong
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <
>> jingsongl...@gmail.com>
>>>>> <
>>>>>>>>>> jingsongl...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I am thinking both "improve first experience" and
>> "improve
>>>>> production
>>>>>>>>>>> experience".
>>>>>>>>>>> 
>>>>>>>>>>> I'm thinking about what's the common mode of Flink?
>>>>>>>>>>> Streaming job use Kafka? Batch job use Hive?
>>>>>>>>>>> 
>>>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of
>> Hive server
>>>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1
>> dependency.
>>>>>>>>>>> Flink is currently mainly used for streaming, so let's
>> not talk
>>>>>>>>>>> about hive.
>>>>>>>>>>> 
>>>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is
>> (related
>>>>> to
>>>>>>>>>>> connectors):
>>>>>>>>>>> - ETL jobs: Kafka -> Kafka
>>>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
>>>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink
>>>>>>>>>>> So Kafka and JDBC are probably the most commonly used.
>> Of course,
>>>>>>>>>>> 
>>>>>>>>>>> also
>>>>>>>>>>> 
>>>>>>>>>>> includes CSV, JSON's formats.
>>>>>>>>>>> So when we provide such a fat distribution:
>>>>>>>>>>> - With CSV, JSON.
>>>>>>>>>>> - With flink-kafka-universal and kafka dependencies.
>>>>>>>>>>> - With flink-jdbc.
>>>>>>>>>>> Using this fat distribution, most users can run their
>> jobs well.
>>>>>>>>>>> 
>>>>>>>>>>> (jdbc
>>>>>>>>>>> 
>>>>>>>>>>> driver jar required, but this is very natural to do)
>>>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only
>> Kafka may
>>>>>>>>>>> 
>>>>>>>>>>> have
>>>>>>>>>>> 
>>>>>>>>>>> conflicts, but if our goal is to use kafka-universal to
>> support all
>>>>>>>>>>> Kafka
>>>>>>>>>>> versions, it is hopeful to target the vast majority of
>> users.
>>>>>>>>>>> 
>>>>>>>>>>> We don't want to plug all jars into the fat
>> distribution. Only need
>>>>>>>>>>> less
>>>>>>>>>>> conflict and common. of course, it is a matter of
>> consideration to
>>>>>>>>>>> 
>>>>>>>>>>> put
>>>>>>>>>>> 
>>>>>>>>>>> which jar into fat distribution.
>>>>>>>>>>> We have the opportunity to facilitate the majority of
>> users, but
>>>>>>>>>>> also left
>>>>>>>>>>> opportunities for customization.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Jingsong Lee
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <
>> imj...@gmail.com> <
>>>>>>>>>> imj...@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I think we should first reach an consensus on "what
>> problem do we
>>>>>>>>>>> want to
>>>>>>>>>>> solve?"
>>>>>>>>>>> (1) improve first experience? or (2) improve production
>> experience?
>>>>>>>>>>> 
>>>>>>>>>>> As far as I can see, with the above discussion, I think
>> what we
>>>>>>>>>>> want to
>>>>>>>>>>> solve is the "first experience".
>>>>>>>>>>> And I think the slim jar is still the best distribution
>> for
>>>>>>>>>>> production,
>>>>>>>>>>> because it's easier to assembling jars
>>>>>>>>>>> than excluding jars and can avoid potential class
>> conflicts.
>>>>>>>>>>> 
>>>>>>>>>>> If we want to improve "first experience", I think it
>> make sense to
>>>>>>>>>>> have a
>>>>>>>>>>> fat distribution to give users a more smooth first
>> experience.
>>>>>>>>>>> But I would like to call it "playground distribution" or
>> something
>>>>>>>>>>> like
>>>>>>>>>>> that to explicitly differ from the "slim
>> production-purpose
>>>>>>>>>>> 
>>>>>>>>>>> distribution".
>>>>>>>>>>> 
>>>>>>>>>>> The "playground distribution" can contains some widely
>> used jars,
>>>>>>>>>>> 
>>>>>>>>>>> like
>>>>>>>>>>> 
>>>>>>>>>>> universal-kafka-sql-connector,
>> elasticsearch7-sql-connector, avro,
>>>>>>>>>>> json,
>>>>>>>>>>> csv, etc..
>>>>>>>>>>> Even we can provide a playground docker which may
>> contain the fat
>>>>>>>>>>> distribution, python3, and hive.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Jark
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
>> ches...@apache.org>
>>>>> <
>>>>>>>>>> ches...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> I don't see a lot of value in having multiple
>> distributions.
>>>>>>>>>>> 
>>>>>>>>>>> The simple reality is that no fat distribution we could
>> provide
>>>>>>>>>>> 
>>>>>>>>>>> would
>>>>>>>>>>> 
>>>>>>>>>>> satisfy all use-cases, so why even try.
>>>>>>>>>>> If users commonly run into issues for certain jars, then
>> maybe
>>>>>>>>>>> 
>>>>>>>>>>> those
>>>>>>>>>>> 
>>>>>>>>>>> should be added to the current distribution.
>>>>>>>>>>> 
>>>>>>>>>>> Personally though I still believe we should only
>> distribute a slim
>>>>>>>>>>> version. I'd rather have users always add required jars
>> to the
>>>>>>>>>>> distribution than only when they go outside our
>> "expected"
>>>>>>>>>>> 
>>>>>>>>>>> use-cases.
>>>>>>>>>>> 
>>>>>>>>>>> Then we might finally address this issue properly, i.e.,
>> tooling to
>>>>>>>>>>> assemble custom distributions and/or better error
>> messages if
>>>>>>>>>>> Flink-provided extensions cannot be found.
>>>>>>>>>>> 
>>>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Regarding to the specific solution, I'm not sure about
>> the "fat"
>>>>>>>>>>> 
>>>>>>>>>>> and
>>>>>>>>>>> 
>>>>>>>>>>> "slim"
>>>>>>>>>>> 
>>>>>>>>>>> solution though. I get the idea
>>>>>>>>>>> that we can make the slim one even more lightweight than
>> current
>>>>>>>>>>> distribution, but what about the "fat"
>>>>>>>>>>> one? Do you mean that we would package all connectors
>> and formats
>>>>>>>>>>> 
>>>>>>>>>>> into
>>>>>>>>>>> 
>>>>>>>>>>> this? I'm not sure if this is
>>>>>>>>>>> feasible. For example, we can't put all versions of
>> kafka and hive
>>>>>>>>>>> connector jars into lib directory, and
>>>>>>>>>>> we also might need hadoop jars when using filesystem
>> connector to
>>>>>>>>>>> 
>>>>>>>>>>> access
>>>>>>>>>>> 
>>>>>>>>>>> data from HDFS.
>>>>>>>>>>> 
>>>>>>>>>>> So my guess would be we might hand-pick some of the most
>>>>>>>>>>> 
>>>>>>>>>>> frequently
>>>>>>>>>>> 
>>>>>>>>>>> used
>>>>>>>>>>> 
>>>>>>>>>>> connectors and formats
>>>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned
>> above,
>>>>>>>>>>> 
>>>>>>>>>>> and
>>>>>>>>>>> 
>>>>>>>>>>> still
>>>>>>>>>>> 
>>>>>>>>>>> leave some other connectors out of it.
>>>>>>>>>>> If this is the case, then why not we just provide this
>>>>>>>>>>> 
>>>>>>>>>>> distribution
>>>>>>>>>>> 
>>>>>>>>>>> to
>>>>>>>>>>> 
>>>>>>>>>>> user? I'm not sure i get the benefit of
>>>>>>>>>>> providing another super "slim" jar (we have to pay some
>> costs to
>>>>>>>>>>> 
>>>>>>>>>>> provide
>>>>>>>>>>> 
>>>>>>>>>>> another suit of distribution).
>>>>>>>>>>> 
>>>>>>>>>>> What do you think?
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Kurt
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
>>>>>>>>>>> 
>>>>>>>>>>> jingsongl...@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Big +1.
>>>>>>>>>>> 
>>>>>>>>>>> I like "fat" and "slim".
>>>>>>>>>>> 
>>>>>>>>>>> For csv and json, like Jark said, they are quite small
>> and don't
>>>>>>>>>>> 
>>>>>>>>>>> have
>>>>>>>>>>> 
>>>>>>>>>>> other
>>>>>>>>>>> 
>>>>>>>>>>> dependencies. They are important to kafka connector, and
>>>>>>>>>>> 
>>>>>>>>>>> important
>>>>>>>>>>> 
>>>>>>>>>>> to upcoming file system connector too.
>>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so
>>>>>>>>>>> 
>>>>>>>>>>> important,
>>>>>>>>>>> 
>>>>>>>>>>> and
>>>>>>>>>>> 
>>>>>>>>>>> they're so lightweight.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Jingsong Lee
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <
>> godfre...@gmail.com> <
>>>>>>>>>> godfre...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Big +1.
>>>>>>>>>>> This will improve user experience (special for Flink new
>> users).
>>>>>>>>>>> We answered so many questions about "class not found".
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Godfrey
>>>>>>>>>>> 
>>>>>>>>>>> Dian Fu <dian0511...@gmail.com> <dian0511...@gmail.com>
>>>>> 于2020年4月15日周三
>>>>>>>>>> 下午4:30写道：
>>>>>>>>>>> 
>>>>>>>>>>> +1 to this proposal.
>>>>>>>>>>> 
>>>>>>>>>>> Missing connector jars is also a big problem for PyFlink
>> users.
>>>>>>>>>>> 
>>>>>>>>>>> Currently,
>>>>>>>>>>> 
>>>>>>>>>>> after a Python user has installed PyFlink using `pip`,
>> he has
>>>>>>>>>>> 
>>>>>>>>>>> to
>>>>>>>>>>> 
>>>>>>>>>>> manually
>>>>>>>>>>> 
>>>>>>>>>>> copy the connector fat jars to the PyFlink installation
>>>>>>>>>>> 
>>>>>>>>>>> directory
>>>>>>>>>>> 
>>>>>>>>>>> for
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> connectors to be used if he wants to run jobs locally.
>> This
>>>>>>>>>>> 
>>>>>>>>>>> process
>>>>>>>>>>> 
>>>>>>>>>>> is
>>>>>>>>>>> 
>>>>>>>>>>> very
>>>>>>>>>>> 
>>>>>>>>>>> confuse for users and affects the experience a lot.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Dian
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <imj...@gmail.com> <
>> imj...@gmail.com>
>>>>> 写道：
>>>>>>>>>>> 
>>>>>>>>>>> +1 to the proposal. I also found the "download
>> additional jar"
>>>>>>>>>>> 
>>>>>>>>>>> step
>>>>>>>>>>> 
>>>>>>>>>>> is
>>>>>>>>>>> 
>>>>>>>>>>> really verbose when I prepare webinars.
>>>>>>>>>>> 
>>>>>>>>>>> At least, I think the flink-csv and flink-json should in
>> the
>>>>>>>>>>> 
>>>>>>>>>>> distribution,
>>>>>>>>>>> 
>>>>>>>>>>> they are quite small and don't have other dependencies.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Jark
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <
>> zjf...@gmail.com> <
>>>>>>>>>> zjf...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Aljoscha,
>>>>>>>>>>> 
>>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan
>> to
>>>>>>>>>>> 
>>>>>>>>>>> put
>>>>>>>>>>> 
>>>>>>>>>>> these
>>>>>>>>>>> 
>>>>>>>>>>> connectors ? opt or lib ?
>>>>>>>>>>> 
>>>>>>>>>>> Aljoscha Krettek <aljos...@apache.org> <
>> aljos...@apache.org>
>>>>>>>>>> 于2020年4月15日周三
>>>>>>>>>>> 下午3:30写道：
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>> 
>>>>>>>>>>> I'd like to discuss about releasing a more full-featured
>>>>>>>>>>> 
>>>>>>>>>>> Flink
>>>>>>>>>>> 
>>>>>>>>>>> distribution. The motivation is that there is friction
>> for
>>>>>>>>>>> 
>>>>>>>>>>> SQL/Table
>>>>>>>>>>> 
>>>>>>>>>>> API
>>>>>>>>>>> 
>>>>>>>>>>> users that want to use Table connectors which are not
>> there
>>>>>>>>>>> 
>>>>>>>>>>> in
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> current Flink Distribution. For these users the workflow
>> is
>>>>>>>>>>> 
>>>>>>>>>>> currently
>>>>>>>>>>> 
>>>>>>>>>>> roughly:
>>>>>>>>>>> 
>>>>>>>>>>> - download Flink dist
>>>>>>>>>>> - configure csv/Kafka/json connectors per configuration
>>>>>>>>>>> - run SQL client or program
>>>>>>>>>>> - decrypt error message and research the solution
>>>>>>>>>>> - download additional connector jars
>>>>>>>>>>> - program works correctly
>>>>>>>>>>> 
>>>>>>>>>>> I realize that this can be made to work but if every SQL
>>>>>>>>>>> 
>>>>>>>>>>> user
>>>>>>>>>>> 
>>>>>>>>>>> has
>>>>>>>>>>> 
>>>>>>>>>>> this
>>>>>>>>>>> 
>>>>>>>>>>> as their first experience that doesn't seem good to me.
>>>>>>>>>>> 
>>>>>>>>>>> My proposal is to provide two versions of the Flink
>>>>>>>>>>> 
>>>>>>>>>>> Distribution
>>>>>>>>>>> 
>>>>>>>>>>> in
>>>>>>>>>>> 
>>>>>>>>>>> the
>>>>>>>>>>> 
>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>>>>>> 
>>>>>>>>>>> - slim would be even trimmer than todays distribution
>>>>>>>>>>> - fat would contain a lot of convenience connectors (yet
>>>>>>>>>>> 
>>>>>>>>>>> to
>>>>>>>>>>> 
>>>>>>>>>>> be
>>>>>>>>>>> 
>>>>>>>>>>> determined which one)
>>>>>>>>>>> 
>>>>>>>>>>> And yes, I realize that there are already more
>> dimensions of
>>>>>>>>>>> 
>>>>>>>>>>> Flink
>>>>>>>>>>> 
>>>>>>>>>>> releases (Scala version and Java version).
>>>>>>>>>>> 
>>>>>>>>>>> For background, our current Flink dist has these in the
>> opt
>>>>>>>>>>> 
>>>>>>>>>>> directory:
>>>>>>>>>>> 
>>>>>>>>>>> - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>>>>>> - flink-cep-scala_2.12-1.10.0.jar
>>>>>>>>>>> - flink-cep_2.12-1.10.0.jar
>>>>>>>>>>> - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>>>>>> - flink-gelly_2.12-1.10.0.jar
>>>>>>>>>>> - flink-metrics-datadog-1.10.0.jar
>>>>>>>>>>> - flink-metrics-graphite-1.10.0.jar
>>>>>>>>>>> - flink-metrics-influxdb-1.10.0.jar
>>>>>>>>>>> - flink-metrics-prometheus-1.10.0.jar
>>>>>>>>>>> - flink-metrics-slf4j-1.10.0.jar
>>>>>>>>>>> - flink-metrics-statsd-1.10.0.jar
>>>>>>>>>>> - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>>>>>> - flink-python_2.12-1.10.0.jar
>>>>>>>>>>> - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>>>>>> - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>>>>>> - flink-s3-fs-presto-1.10.0.jar
>>>>>>>>>>> -
>>>>>>>>>>> 
>>>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>>>>>> 
>>>>>>>>>>> - flink-sql-client_2.12-1.10.0.jar
>>>>>>>>>>> - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>>>>>> - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>>>>>> 
>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from
>>>>>>>>>>> 
>>>>>>>>>>> opt
>>>>>>>>>>> 
>>>>>>>>>>> we
>>>>>>>>>>> 
>>>>>>>>>>> would
>>>>>>>>>>> 
>>>>>>>>>>> go down to 126M. I would reccomend this, because the
>> large
>>>>>>>>>>> 
>>>>>>>>>>> majority
>>>>>>>>>>> 
>>>>>>>>>>> of
>>>>>>>>>>> 
>>>>>>>>>>> the files in opt are probably unused.
>>>>>>>>>>> 
>>>>>>>>>>> What do you think?
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Aljoscha
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Best Regards
>>>>>>>>>>> 
>>>>>>>>>>> Jeff Zhang
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Best, Jingsong Lee
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Best, Jingsong Lee
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> Best, Jingsong Lee
>>>> 
>>> 
>>> 
>>> --
>>> Best, Jingsong Lee
>> 
> 
> 
> -- 
> 
> Best,
> Benchao Li

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Reply via email to