As a reference for a nice first-experience I had, take a look at https://code.quarkus.io/ You reach this page after you click "Start Coding" at the project homepage.
Rafi On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <ykt...@gmail.com> wrote: > I'm not saying pre-bundle some jars will make this problem go away, and > you're right that only hides the problem for > some users. But what if this solution can hide the problem for 90% users? > Would't that be good enough for us to try? > > Regarding to would users following instructions really be such a big > problem? > I'm afraid yes. Otherwise I won't answer such questions for at least a > dozen times and I won't see such questions coming > up from time to time. During some periods, I even saw such questions every > day. > > Best, > Kurt > > > On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <ches...@apache.org> > wrote: > > > The problem with having a distribution with "popular" stuff is that it > > doesn't really *solve* a problem, it just hides it for users who fall > > into these particular use-cases. > > Move out of it and you once again run into exact same problems out-lined. > > > > This is exactly why I like the tooling approach; you have to deal with it > > from the start and transitioning to a custom use-case is easier. > > > > Would users following instructions really be such a big problem? > > I would expect that users generally know *what *they need, just not > > necessarily how it is assembled correctly (where do get which jar, which > > directory to put it in). > > It seems like these are exactly the problem this would solve? > > I just don't see how moving a jar corresponding to some feature from opt > > to some directory (lib/plugins) is less error-prone than just selecting > the > > feature and having the tool handle the rest. > > > > As for re-distributions, it depends on the form that the tool would take. > > It could be an application that runs locally and works against maven > > central (note: not necessarily *using* maven); this should would work in > > China, no? > > > > A web tool would of course be fancy, but I don't know how feasible this > is > > with the ASF infrastructure. > > You wouldn't be able to mirror the distribution, so the load can't be > > distributed. I doubt INFRA would like this. > > > > Note that third-parties could also start distributing use-case oriented > > distributions, which would be perfectly fine as far as I'm concerned. > > > > On 16/04/2020 16:57, Kurt Young wrote: > > > > I'm not so sure about the web tool solution though. The concern I have > for > > this approach is the final generated > > distribution is kind of non-deterministic. We might generate too many > > different combinations when user trying to > > package different types of connector, format, and even maybe hadoop > > releases. As far as I can tell, most open > > source projects and apache projects will only release some > > pre-defined distributions, which most users are already > > familiar with, thus hard to change IMO. And I also have went through in > > some cases, users will try to re-distribute > > the release package, because of the unstable network of apache website > from > > China. In web tool solution, I don't > > think this kind of re-distribution would be possible anymore. > > > > In the meantime, I also have a concern that we will fall back into our > trap > > again if we try to offer this smart & flexible > > solution. Because it needs users to cooperate with such mechanism. It's > > exactly the situation what we currently fell > > into: > > 1. We offered a smart solution. > > 2. We hope users will follow the correct instructions. > > 3. Everything will work as expected if users followed the right > > instructions. > > > > In reality, I suspect not all users will do the second step correctly. > And > > for new users who only trying to have a quick > > experience with Flink, I would bet most users will do it wrong. > > > > So, my proposal would be one of the following 2 options: > > 1. Provide a slim distribution for advanced product users and provide a > > distribution which will have some popular builtin jars. > > 2. Only provide a distribution which will have some popular builtin jars. > > > > If we are trying to reduce the distributions we released, I would prefer > 2 > > > > 1. > > > > Best, > > Kurt > > > > > > On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <trohrm...@apache.org> < > trohrm...@apache.org> wrote: > > > > > > I think what Chesnay and Dawid proposed would be the ideal solution. > > Ideally, we would also have a nice web tool for the website which > generates > > the corresponding distribution for download. > > > > To get things started we could start with only supporting to > > download/creating the "fat" version with the script. The fat version > would > > then consist of the slim distribution and whatever we deem important for > > new users to get started. > > > > Cheers, > > Till > > > > On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz < > dwysakow...@apache.org> <dwysakow...@apache.org> > > wrote: > > > > > > Hi all, > > > > Few points from my side: > > > > 1. I like the idea of simplifying the experience for first time users. > > As for production use cases I share Jark's opinion that in this case I > > would expect users to combine their distribution manually. I think in > > such scenarios it is important to understand interconnections. > > Personally I'd expect the slimmest possible distribution that I can > > extend further with what I need in my production scenario. > > > > 2. I think there is also the problem that the matrix of possible > > combinations that can be useful is already big. Do we want to have a > > distribution for: > > > > SQL users: which connectors should we include? should we include > > hive? which other catalog? > > > > DataStream users: which connectors should we include? > > > > For both of the above should we include yarn/kubernetes? > > > > I would opt for providing only the "slim" distribution as a release > > artifact. > > > > 3. However, as I said I think its worth investigating how we can improve > > users experience. What do you think of providing a tool, could be e.g. a > > shell script that constructs a distribution based on users choice. I > > think that was also what Chesnay mentioned as "tooling to > > assemble custom distributions" In the end how I see the difference > > between a slim and fat distribution is which jars do we put into the > > lib, right? It could have a few "screens". > > > > 1. Which API are you interested in: > > a. SQL API > > b. DataStream API > > > > > > 2. [SQL] Which connectors do you want to use? [multichoice]: > > a. Kafka > > b. Elasticsearch > > ... > > > > 3. [SQL] Which catalog you want to use? > > > > ... > > > > Such a tool would download all the dependencies from maven and put them > > into the correct folder. In the future we can extend it with additional > > rules e.g. kafka-0.9 cannot be chosen at the same time with > > kafka-universal etc. > > > > The benefit of it would be that the distribution that we release could > > remain "slim" or we could even make it slimmer. I might be missing > > something here though. > > > > Best, > > > > Dawdi > > > > On 16/04/2020 11:02, Aljoscha Krettek wrote: > > > > I want to reinforce my opinion from earlier: This is about improving > > the situation both for first-time users and for experienced users that > > want to use a Flink dist in production. The current Flink dist is too > > "thin" for first-time SQL users and it is too "fat" for production > > users, that is where serving no-one properly with the current > > middle-ground. That's why I think introducing those specialized > > "spins" of Flink dist would be good. > > > > By the way, at some point in the future production users might not > > even need to get a Flink dist anymore. They should be able to have > > Flink as a dependency of their project (including the runtime) and > > then build an image from this for Kubernetes or a fat jar for YARN. > > > > Aljoscha > > > > On 15.04.20 18:14, wenlong.lwl wrote: > > > > Hi all, > > > > Regarding slim and fat distributions, I think different kinds of jobs > > may > > prefer different type of distribution: > > > > For DataStream job, I think we may not like fat distribution > > > > containing > > > > connectors because user would always need to depend on the connector > > > > in > > > > user code, it is easy to include the connector jar in the user lib. > > > > Less > > > > jar in lib means less class conflicts and problems. > > > > For SQL job, I think we are trying to encourage user to user pure > > sql(DDL + > > DML) to construct their job, In order to improve user experience, It > > may be > > important for flink, not only providing as many connector jar in > > distribution as possible especially the connector and format we have > > well > > documented, but also providing an mechanism to load connectors > > according > > to the DDLs, > > > > So I think it could be good to place connector/format jars in some > > dir like > > opt/connector which would not affect jobs by default, and introduce a > > mechanism of dynamic discovery for SQL. > > > > Best, > > Wenlong > > > > On Wed, 15 Apr 2020 at 22:46, Jingsong Li <jingsongl...@gmail.com> < > jingsongl...@gmail.com> > > wrote: > > > > > > Hi, > > > > I am thinking both "improve first experience" and "improve production > > experience". > > > > I'm thinking about what's the common mode of Flink? > > Streaming job use Kafka? Batch job use Hive? > > > > Hive 1.2.1 dependencies can be compatible with most of Hive server > > versions. So Spark and Presto have built-in Hive 1.2.1 dependency. > > Flink is currently mainly used for streaming, so let's not talk > > about hive. > > > > For streaming jobs, first of all, the jobs in my mind is (related to > > connectors): > > - ETL jobs: Kafka -> Kafka > > - Join jobs: Kafka -> DimJDBC -> Kafka > > - Aggregation jobs: Kafka -> JDBCSink > > So Kafka and JDBC are probably the most commonly used. Of course, > > > > also > > > > includes CSV, JSON's formats. > > So when we provide such a fat distribution: > > - With CSV, JSON. > > - With flink-kafka-universal and kafka dependencies. > > - With flink-jdbc. > > Using this fat distribution, most users can run their jobs well. > > > > (jdbc > > > > driver jar required, but this is very natural to do) > > Can these dependencies lead to kinds of conflicts? Only Kafka may > > > > have > > > > conflicts, but if our goal is to use kafka-universal to support all > > Kafka > > versions, it is hopeful to target the vast majority of users. > > > > We don't want to plug all jars into the fat distribution. Only need > > less > > conflict and common. of course, it is a matter of consideration to > > > > put > > > > which jar into fat distribution. > > We have the opportunity to facilitate the majority of users, but > > also left > > opportunities for customization. > > > > Best, > > Jingsong Lee > > > > On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <imj...@gmail.com> < > imj...@gmail.com> wrote: > > > > > > Hi, > > > > I think we should first reach an consensus on "what problem do we > > want to > > solve?" > > (1) improve first experience? or (2) improve production experience? > > > > As far as I can see, with the above discussion, I think what we > > want to > > solve is the "first experience". > > And I think the slim jar is still the best distribution for > > production, > > because it's easier to assembling jars > > than excluding jars and can avoid potential class conflicts. > > > > If we want to improve "first experience", I think it make sense to > > have a > > fat distribution to give users a more smooth first experience. > > But I would like to call it "playground distribution" or something > > like > > that to explicitly differ from the "slim production-purpose > > > > distribution". > > > > The "playground distribution" can contains some widely used jars, > > > > like > > > > universal-kafka-sql-connector, elasticsearch7-sql-connector, avro, > > json, > > csv, etc.. > > Even we can provide a playground docker which may contain the fat > > distribution, python3, and hive. > > > > Best, > > Jark > > > > > > On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ches...@apache.org> < > ches...@apache.org> > > > > wrote: > > > > I don't see a lot of value in having multiple distributions. > > > > The simple reality is that no fat distribution we could provide > > > > would > > > > satisfy all use-cases, so why even try. > > If users commonly run into issues for certain jars, then maybe > > > > those > > > > should be added to the current distribution. > > > > Personally though I still believe we should only distribute a slim > > version. I'd rather have users always add required jars to the > > distribution than only when they go outside our "expected" > > > > use-cases. > > > > Then we might finally address this issue properly, i.e., tooling to > > assemble custom distributions and/or better error messages if > > Flink-provided extensions cannot be found. > > > > On 15/04/2020 15:23, Kurt Young wrote: > > > > Regarding to the specific solution, I'm not sure about the "fat" > > > > and > > > > "slim" > > > > solution though. I get the idea > > that we can make the slim one even more lightweight than current > > distribution, but what about the "fat" > > one? Do you mean that we would package all connectors and formats > > > > into > > > > this? I'm not sure if this is > > feasible. For example, we can't put all versions of kafka and hive > > connector jars into lib directory, and > > we also might need hadoop jars when using filesystem connector to > > > > access > > > > data from HDFS. > > > > So my guess would be we might hand-pick some of the most > > > > frequently > > > > used > > > > connectors and formats > > into our "lib" directory, like kafka, csv, json metioned above, > > > > and > > > > still > > > > leave some other connectors out of it. > > If this is the case, then why not we just provide this > > > > distribution > > > > to > > > > user? I'm not sure i get the benefit of > > providing another super "slim" jar (we have to pay some costs to > > > > provide > > > > another suit of distribution). > > > > What do you think? > > > > Best, > > Kurt > > > > > > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li < > > > > jingsongl...@gmail.com > > > > wrote: > > > > Big +1. > > > > I like "fat" and "slim". > > > > For csv and json, like Jark said, they are quite small and don't > > > > have > > > > other > > > > dependencies. They are important to kafka connector, and > > > > important > > > > to upcoming file system connector too. > > So can we move them to both "fat" and "slim"? They're so > > > > important, > > > > and > > > > they're so lightweight. > > > > Best, > > Jingsong Lee > > > > On Wed, Apr 15, 2020 at 4:53 PM godfrey he <godfre...@gmail.com> < > godfre...@gmail.com> > > > > wrote: > > > > Big +1. > > This will improve user experience (special for Flink new users). > > We answered so many questions about "class not found". > > > > Best, > > Godfrey > > > > Dian Fu <dian0511...@gmail.com> <dian0511...@gmail.com> 于2020年4月15日周三 > 下午4:30写道: > > > > > > +1 to this proposal. > > > > Missing connector jars is also a big problem for PyFlink users. > > > > Currently, > > > > after a Python user has installed PyFlink using `pip`, he has > > > > to > > > > manually > > > > copy the connector fat jars to the PyFlink installation > > > > directory > > > > for > > > > the > > > > connectors to be used if he wants to run jobs locally. This > > > > process > > > > is > > > > very > > > > confuse for users and affects the experience a lot. > > > > Regards, > > Dian > > > > > > 在 2020年4月15日,下午3:51,Jark Wu <imj...@gmail.com> <imj...@gmail.com> 写道: > > > > +1 to the proposal. I also found the "download additional jar" > > > > step > > > > is > > > > really verbose when I prepare webinars. > > > > At least, I think the flink-csv and flink-json should in the > > > > distribution, > > > > they are quite small and don't have other dependencies. > > > > Best, > > Jark > > > > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zjf...@gmail.com> < > zjf...@gmail.com> > > > > wrote: > > > > Hi Aljoscha, > > > > Big +1 for the fat flink distribution, where do you plan to > > > > put > > > > these > > > > connectors ? opt or lib ? > > > > Aljoscha Krettek <aljos...@apache.org> <aljos...@apache.org> > 于2020年4月15日周三 > > 下午3:30写道: > > > > > > Hi Everyone, > > > > I'd like to discuss about releasing a more full-featured > > > > Flink > > > > distribution. The motivation is that there is friction for > > > > SQL/Table > > > > API > > > > users that want to use Table connectors which are not there > > > > in > > > > the > > > > current Flink Distribution. For these users the workflow is > > > > currently > > > > roughly: > > > > - download Flink dist > > - configure csv/Kafka/json connectors per configuration > > - run SQL client or program > > - decrypt error message and research the solution > > - download additional connector jars > > - program works correctly > > > > I realize that this can be made to work but if every SQL > > > > user > > > > has > > > > this > > > > as their first experience that doesn't seem good to me. > > > > My proposal is to provide two versions of the Flink > > > > Distribution > > > > in > > > > the > > > > future: "fat" and "slim" (names to be discussed): > > > > - slim would be even trimmer than todays distribution > > - fat would contain a lot of convenience connectors (yet > > > > to > > > > be > > > > determined which one) > > > > And yes, I realize that there are already more dimensions of > > > > Flink > > > > releases (Scala version and Java version). > > > > For background, our current Flink dist has these in the opt > > > > directory: > > > > - flink-azure-fs-hadoop-1.10.0.jar > > - flink-cep-scala_2.12-1.10.0.jar > > - flink-cep_2.12-1.10.0.jar > > - flink-gelly-scala_2.12-1.10.0.jar > > - flink-gelly_2.12-1.10.0.jar > > - flink-metrics-datadog-1.10.0.jar > > - flink-metrics-graphite-1.10.0.jar > > - flink-metrics-influxdb-1.10.0.jar > > - flink-metrics-prometheus-1.10.0.jar > > - flink-metrics-slf4j-1.10.0.jar > > - flink-metrics-statsd-1.10.0.jar > > - flink-oss-fs-hadoop-1.10.0.jar > > - flink-python_2.12-1.10.0.jar > > - flink-queryable-state-runtime_2.12-1.10.0.jar > > - flink-s3-fs-hadoop-1.10.0.jar > > - flink-s3-fs-presto-1.10.0.jar > > - > > > > flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar > > > > - flink-sql-client_2.12-1.10.0.jar > > - flink-state-processor-api_2.12-1.10.0.jar > > - flink-swift-fs-hadoop-1.10.0.jar > > > > Current Flink dist is 267M. If we removed everything from > > > > opt > > > > we > > > > would > > > > go down to 126M. I would reccomend this, because the large > > > > majority > > > > of > > > > the files in opt are probably unused. > > > > What do you think? > > > > Best, > > Aljoscha > > > > > > > > -- > > Best Regards > > > > Jeff Zhang > > > > > > -- > > Best, Jingsong Lee > > > > > > -- > > Best, Jingsong Lee > > > > > > > > >