I'm not so sure about the web tool solution though. The concern I have for this approach is the final generated distribution is kind of non-deterministic. We might generate too many different combinations when user trying to package different types of connector, format, and even maybe hadoop releases. As far as I can tell, most open source projects and apache projects will only release some pre-defined distributions, which most users are already familiar with, thus hard to change IMO. And I also have went through in some cases, users will try to re-distribute the release package, because of the unstable network of apache website from China. In web tool solution, I don't think this kind of re-distribution would be possible anymore.
In the meantime, I also have a concern that we will fall back into our trap again if we try to offer this smart & flexible solution. Because it needs users to cooperate with such mechanism. It's exactly the situation what we currently fell into: 1. We offered a smart solution. 2. We hope users will follow the correct instructions. 3. Everything will work as expected if users followed the right instructions. In reality, I suspect not all users will do the second step correctly. And for new users who only trying to have a quick experience with Flink, I would bet most users will do it wrong. So, my proposal would be one of the following 2 options: 1. Provide a slim distribution for advanced product users and provide a distribution which will have some popular builtin jars. 2. Only provide a distribution which will have some popular builtin jars. If we are trying to reduce the distributions we released, I would prefer 2 > 1. Best, Kurt On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <trohrm...@apache.org> wrote: > I think what Chesnay and Dawid proposed would be the ideal solution. > Ideally, we would also have a nice web tool for the website which generates > the corresponding distribution for download. > > To get things started we could start with only supporting to > download/creating the "fat" version with the script. The fat version would > then consist of the slim distribution and whatever we deem important for > new users to get started. > > Cheers, > Till > > On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <dwysakow...@apache.org> > wrote: > > > Hi all, > > > > Few points from my side: > > > > 1. I like the idea of simplifying the experience for first time users. > > As for production use cases I share Jark's opinion that in this case I > > would expect users to combine their distribution manually. I think in > > such scenarios it is important to understand interconnections. > > Personally I'd expect the slimmest possible distribution that I can > > extend further with what I need in my production scenario. > > > > 2. I think there is also the problem that the matrix of possible > > combinations that can be useful is already big. Do we want to have a > > distribution for: > > > > SQL users: which connectors should we include? should we include > > hive? which other catalog? > > > > DataStream users: which connectors should we include? > > > > For both of the above should we include yarn/kubernetes? > > > > I would opt for providing only the "slim" distribution as a release > > artifact. > > > > 3. However, as I said I think its worth investigating how we can improve > > users experience. What do you think of providing a tool, could be e.g. a > > shell script that constructs a distribution based on users choice. I > > think that was also what Chesnay mentioned as "tooling to > > assemble custom distributions" In the end how I see the difference > > between a slim and fat distribution is which jars do we put into the > > lib, right? It could have a few "screens". > > > > 1. Which API are you interested in: > > a. SQL API > > b. DataStream API > > > > > > 2. [SQL] Which connectors do you want to use? [multichoice]: > > a. Kafka > > b. Elasticsearch > > ... > > > > 3. [SQL] Which catalog you want to use? > > > > ... > > > > Such a tool would download all the dependencies from maven and put them > > into the correct folder. In the future we can extend it with additional > > rules e.g. kafka-0.9 cannot be chosen at the same time with > > kafka-universal etc. > > > > The benefit of it would be that the distribution that we release could > > remain "slim" or we could even make it slimmer. I might be missing > > something here though. > > > > Best, > > > > Dawdi > > > > On 16/04/2020 11:02, Aljoscha Krettek wrote: > > > I want to reinforce my opinion from earlier: This is about improving > > > the situation both for first-time users and for experienced users that > > > want to use a Flink dist in production. The current Flink dist is too > > > "thin" for first-time SQL users and it is too "fat" for production > > > users, that is where serving no-one properly with the current > > > middle-ground. That's why I think introducing those specialized > > > "spins" of Flink dist would be good. > > > > > > By the way, at some point in the future production users might not > > > even need to get a Flink dist anymore. They should be able to have > > > Flink as a dependency of their project (including the runtime) and > > > then build an image from this for Kubernetes or a fat jar for YARN. > > > > > > Aljoscha > > > > > > On 15.04.20 18:14, wenlong.lwl wrote: > > >> Hi all, > > >> > > >> Regarding slim and fat distributions, I think different kinds of jobs > > >> may > > >> prefer different type of distribution: > > >> > > >> For DataStream job, I think we may not like fat distribution > containing > > >> connectors because user would always need to depend on the connector > in > > >> user code, it is easy to include the connector jar in the user lib. > Less > > >> jar in lib means less class conflicts and problems. > > >> > > >> For SQL job, I think we are trying to encourage user to user pure > > >> sql(DDL + > > >> DML) to construct their job, In order to improve user experience, It > > >> may be > > >> important for flink, not only providing as many connector jar in > > >> distribution as possible especially the connector and format we have > > >> well > > >> documented, but also providing an mechanism to load connectors > > >> according > > >> to the DDLs, > > >> > > >> So I think it could be good to place connector/format jars in some > > >> dir like > > >> opt/connector which would not affect jobs by default, and introduce a > > >> mechanism of dynamic discovery for SQL. > > >> > > >> Best, > > >> Wenlong > > >> > > >> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <jingsongl...@gmail.com> > > >> wrote: > > >> > > >>> Hi, > > >>> > > >>> I am thinking both "improve first experience" and "improve production > > >>> experience". > > >>> > > >>> I'm thinking about what's the common mode of Flink? > > >>> Streaming job use Kafka? Batch job use Hive? > > >>> > > >>> Hive 1.2.1 dependencies can be compatible with most of Hive server > > >>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency. > > >>> Flink is currently mainly used for streaming, so let's not talk > > >>> about hive. > > >>> > > >>> For streaming jobs, first of all, the jobs in my mind is (related to > > >>> connectors): > > >>> - ETL jobs: Kafka -> Kafka > > >>> - Join jobs: Kafka -> DimJDBC -> Kafka > > >>> - Aggregation jobs: Kafka -> JDBCSink > > >>> So Kafka and JDBC are probably the most commonly used. Of course, > also > > >>> includes CSV, JSON's formats. > > >>> So when we provide such a fat distribution: > > >>> - With CSV, JSON. > > >>> - With flink-kafka-universal and kafka dependencies. > > >>> - With flink-jdbc. > > >>> Using this fat distribution, most users can run their jobs well. > (jdbc > > >>> driver jar required, but this is very natural to do) > > >>> Can these dependencies lead to kinds of conflicts? Only Kafka may > have > > >>> conflicts, but if our goal is to use kafka-universal to support all > > >>> Kafka > > >>> versions, it is hopeful to target the vast majority of users. > > >>> > > >>> We don't want to plug all jars into the fat distribution. Only need > > >>> less > > >>> conflict and common. of course, it is a matter of consideration to > put > > >>> which jar into fat distribution. > > >>> We have the opportunity to facilitate the majority of users, but > > >>> also left > > >>> opportunities for customization. > > >>> > > >>> Best, > > >>> Jingsong Lee > > >>> > > >>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <imj...@gmail.com> wrote: > > >>> > > >>>> Hi, > > >>>> > > >>>> I think we should first reach an consensus on "what problem do we > > >>>> want to > > >>>> solve?" > > >>>> (1) improve first experience? or (2) improve production experience? > > >>>> > > >>>> As far as I can see, with the above discussion, I think what we > > >>>> want to > > >>>> solve is the "first experience". > > >>>> And I think the slim jar is still the best distribution for > > >>>> production, > > >>>> because it's easier to assembling jars > > >>>> than excluding jars and can avoid potential class conflicts. > > >>>> > > >>>> If we want to improve "first experience", I think it make sense to > > >>>> have a > > >>>> fat distribution to give users a more smooth first experience. > > >>>> But I would like to call it "playground distribution" or something > > >>>> like > > >>>> that to explicitly differ from the "slim production-purpose > > >>> distribution". > > >>>> The "playground distribution" can contains some widely used jars, > like > > >>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro, > > >>>> json, > > >>>> csv, etc.. > > >>>> Even we can provide a playground docker which may contain the fat > > >>>> distribution, python3, and hive. > > >>>> > > >>>> Best, > > >>>> Jark > > >>>> > > >>>> > > >>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ches...@apache.org> > > >>> wrote: > > >>>> > > >>>>> I don't see a lot of value in having multiple distributions. > > >>>>> > > >>>>> The simple reality is that no fat distribution we could provide > would > > >>>>> satisfy all use-cases, so why even try. > > >>>>> If users commonly run into issues for certain jars, then maybe > those > > >>>>> should be added to the current distribution. > > >>>>> > > >>>>> Personally though I still believe we should only distribute a slim > > >>>>> version. I'd rather have users always add required jars to the > > >>>>> distribution than only when they go outside our "expected" > use-cases. > > >>>>> Then we might finally address this issue properly, i.e., tooling to > > >>>>> assemble custom distributions and/or better error messages if > > >>>>> Flink-provided extensions cannot be found. > > >>>>> > > >>>>> On 15/04/2020 15:23, Kurt Young wrote: > > >>>>>> Regarding to the specific solution, I'm not sure about the "fat" > and > > >>>>> "slim" > > >>>>>> solution though. I get the idea > > >>>>>> that we can make the slim one even more lightweight than current > > >>>>>> distribution, but what about the "fat" > > >>>>>> one? Do you mean that we would package all connectors and formats > > >>> into > > >>>>>> this? I'm not sure if this is > > >>>>>> feasible. For example, we can't put all versions of kafka and hive > > >>>>>> connector jars into lib directory, and > > >>>>>> we also might need hadoop jars when using filesystem connector to > > >>>> access > > >>>>>> data from HDFS. > > >>>>>> > > >>>>>> So my guess would be we might hand-pick some of the most > frequently > > >>>> used > > >>>>>> connectors and formats > > >>>>>> into our "lib" directory, like kafka, csv, json metioned above, > and > > >>>> still > > >>>>>> leave some other connectors out of it. > > >>>>>> If this is the case, then why not we just provide this > distribution > > >>> to > > >>>>>> user? I'm not sure i get the benefit of > > >>>>>> providing another super "slim" jar (we have to pay some costs to > > >>>> provide > > >>>>>> another suit of distribution). > > >>>>>> > > >>>>>> What do you think? > > >>>>>> > > >>>>>> Best, > > >>>>>> Kurt > > >>>>>> > > >>>>>> > > >>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li < > jingsongl...@gmail.com > > > > > >>>>> wrote: > > >>>>>> > > >>>>>>> Big +1. > > >>>>>>> > > >>>>>>> I like "fat" and "slim". > > >>>>>>> > > >>>>>>> For csv and json, like Jark said, they are quite small and don't > > >>> have > > >>>>> other > > >>>>>>> dependencies. They are important to kafka connector, and > important > > >>>>>>> to upcoming file system connector too. > > >>>>>>> So can we move them to both "fat" and "slim"? They're so > important, > > >>>> and > > >>>>>>> they're so lightweight. > > >>>>>>> > > >>>>>>> Best, > > >>>>>>> Jingsong Lee > > >>>>>>> > > >>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <godfre...@gmail.com> > > >>>> wrote: > > >>>>>>> > > >>>>>>>> Big +1. > > >>>>>>>> This will improve user experience (special for Flink new users). > > >>>>>>>> We answered so many questions about "class not found". > > >>>>>>>> > > >>>>>>>> Best, > > >>>>>>>> Godfrey > > >>>>>>>> > > >>>>>>>> Dian Fu <dian0511...@gmail.com> 于2020年4月15日周三 下午4:30写道: > > >>>>>>>> > > >>>>>>>>> +1 to this proposal. > > >>>>>>>>> > > >>>>>>>>> Missing connector jars is also a big problem for PyFlink users. > > >>>>>>>> Currently, > > >>>>>>>>> after a Python user has installed PyFlink using `pip`, he has > to > > >>>>>>> manually > > >>>>>>>>> copy the connector fat jars to the PyFlink installation > directory > > >>>> for > > >>>>>>> the > > >>>>>>>>> connectors to be used if he wants to run jobs locally. This > > >>> process > > >>>> is > > >>>>>>>> very > > >>>>>>>>> confuse for users and affects the experience a lot. > > >>>>>>>>> > > >>>>>>>>> Regards, > > >>>>>>>>> Dian > > >>>>>>>>> > > >>>>>>>>>> 在 2020年4月15日,下午3:51,Jark Wu <imj...@gmail.com> 写道: > > >>>>>>>>>> > > >>>>>>>>>> +1 to the proposal. I also found the "download additional jar" > > >>> step > > >>>>>>> is > > >>>>>>>>>> really verbose when I prepare webinars. > > >>>>>>>>>> > > >>>>>>>>>> At least, I think the flink-csv and flink-json should in the > > >>>>>>>>> distribution, > > >>>>>>>>>> they are quite small and don't have other dependencies. > > >>>>>>>>>> > > >>>>>>>>>> Best, > > >>>>>>>>>> Jark > > >>>>>>>>>> > > >>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zjf...@gmail.com> > > >>> wrote: > > >>>>>>>>>> > > >>>>>>>>>>> Hi Aljoscha, > > >>>>>>>>>>> > > >>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to > put > > >>>>>>> these > > >>>>>>>>>>> connectors ? opt or lib ? > > >>>>>>>>>>> > > >>>>>>>>>>> Aljoscha Krettek <aljos...@apache.org> 于2020年4月15日周三 > > >>>>>>>>>>> 下午3:30写道: > > >>>>>>>>>>> > > >>>>>>>>>>>> Hi Everyone, > > >>>>>>>>>>>> > > >>>>>>>>>>>> I'd like to discuss about releasing a more full-featured > Flink > > >>>>>>>>>>>> distribution. The motivation is that there is friction for > > >>>>>>> SQL/Table > > >>>>>>>>> API > > >>>>>>>>>>>> users that want to use Table connectors which are not there > in > > >>>> the > > >>>>>>>>>>>> current Flink Distribution. For these users the workflow is > > >>>>>>> currently > > >>>>>>>>>>>> roughly: > > >>>>>>>>>>>> > > >>>>>>>>>>>> - download Flink dist > > >>>>>>>>>>>> - configure csv/Kafka/json connectors per configuration > > >>>>>>>>>>>> - run SQL client or program > > >>>>>>>>>>>> - decrypt error message and research the solution > > >>>>>>>>>>>> - download additional connector jars > > >>>>>>>>>>>> - program works correctly > > >>>>>>>>>>>> > > >>>>>>>>>>>> I realize that this can be made to work but if every SQL > user > > >>> has > > >>>>>>>> this > > >>>>>>>>>>>> as their first experience that doesn't seem good to me. > > >>>>>>>>>>>> > > >>>>>>>>>>>> My proposal is to provide two versions of the Flink > > >>> Distribution > > >>>> in > > >>>>>>>> the > > >>>>>>>>>>>> future: "fat" and "slim" (names to be discussed): > > >>>>>>>>>>>> > > >>>>>>>>>>>> - slim would be even trimmer than todays distribution > > >>>>>>>>>>>> - fat would contain a lot of convenience connectors (yet > to > > >>> be > > >>>>>>>>>>>> determined which one) > > >>>>>>>>>>>> > > >>>>>>>>>>>> And yes, I realize that there are already more dimensions of > > >>>> Flink > > >>>>>>>>>>>> releases (Scala version and Java version). > > >>>>>>>>>>>> > > >>>>>>>>>>>> For background, our current Flink dist has these in the opt > > >>>>>>>> directory: > > >>>>>>>>>>>> - flink-azure-fs-hadoop-1.10.0.jar > > >>>>>>>>>>>> - flink-cep-scala_2.12-1.10.0.jar > > >>>>>>>>>>>> - flink-cep_2.12-1.10.0.jar > > >>>>>>>>>>>> - flink-gelly-scala_2.12-1.10.0.jar > > >>>>>>>>>>>> - flink-gelly_2.12-1.10.0.jar > > >>>>>>>>>>>> - flink-metrics-datadog-1.10.0.jar > > >>>>>>>>>>>> - flink-metrics-graphite-1.10.0.jar > > >>>>>>>>>>>> - flink-metrics-influxdb-1.10.0.jar > > >>>>>>>>>>>> - flink-metrics-prometheus-1.10.0.jar > > >>>>>>>>>>>> - flink-metrics-slf4j-1.10.0.jar > > >>>>>>>>>>>> - flink-metrics-statsd-1.10.0.jar > > >>>>>>>>>>>> - flink-oss-fs-hadoop-1.10.0.jar > > >>>>>>>>>>>> - flink-python_2.12-1.10.0.jar > > >>>>>>>>>>>> - flink-queryable-state-runtime_2.12-1.10.0.jar > > >>>>>>>>>>>> - flink-s3-fs-hadoop-1.10.0.jar > > >>>>>>>>>>>> - flink-s3-fs-presto-1.10.0.jar > > >>>>>>>>>>>> - > flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar > > >>>>>>>>>>>> - flink-sql-client_2.12-1.10.0.jar > > >>>>>>>>>>>> - flink-state-processor-api_2.12-1.10.0.jar > > >>>>>>>>>>>> - flink-swift-fs-hadoop-1.10.0.jar > > >>>>>>>>>>>> > > >>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from > opt > > >>> we > > >>>>>>>> would > > >>>>>>>>>>>> go down to 126M. I would reccomend this, because the large > > >>>> majority > > >>>>>>>> of > > >>>>>>>>>>>> the files in opt are probably unused. > > >>>>>>>>>>>> > > >>>>>>>>>>>> What do you think? > > >>>>>>>>>>>> > > >>>>>>>>>>>> Best, > > >>>>>>>>>>>> Aljoscha > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>> -- > > >>>>>>>>>>> Best Regards > > >>>>>>>>>>> > > >>>>>>>>>>> Jeff Zhang > > >>>>>>>>>>> > > >>>>>>>>> > > >>>>>>> > > >>>>>>> -- > > >>>>>>> Best, Jingsong Lee > > >>>>>>> > > >>>>> > > >>>>> > > >>>> > > >>> > > >>> > > >>> -- > > >>> Best, Jingsong Lee > > >>> > > >> > > > > > > > >