+1 for Jingsong’s proposal to put flink-csv, flink-json and flink-avro under lib/ directory. I have heard many SQL users(most of newbies) complaint the out-of-box experience in mail list.
Best, Leonard Xu > 在 2020年6月5日,14:39,Benchao Li <libenc...@gmail.com> 写道: > > +1 to include them for sql-client by default; > +0 to put into lib and exposed to all kinds of jobs, including DataStream. > > Danny Chan <yuzhao....@gmail.com> 于2020年6月5日周五 下午2:31写道: > >> +1, at least, we should keep an out of the box SQL-CLI, it’s very poor >> experience to add such required format jars for SQL users. >> >> Best, >> Danny Chan >> 在 2020年6月5日 +0800 AM11:14,Jingsong Li <jingsongl...@gmail.com>,写道: >>> Hi all, >>> >>> Considering that 1.11 will be released soon, what about my previous >>> proposal? Put flink-csv, flink-json and flink-avro under lib. >>> These three formats are very small and no third party dependence, and >> they >>> are widely used by table users. >>> >>> Best, >>> Jingsong Lee >>> >>> On Tue, May 12, 2020 at 4:19 PM Jingsong Li <jingsongl...@gmail.com> >> wrote: >>> >>>> Thanks for your discussion. >>>> >>>> Sorry to start discussing another thing: >>>> >>>> The biggest problem I see is the variety of problems caused by users' >> lack >>>> of format dependency. >>>> As Aljoscha said, these three formats are very small and no third party >>>> dependence, and they are widely used by table users. >>>> Actually, we don't have any other built-in table formats now... In >> total >>>> 151K... >>>> >>>> 73K flink-avro-1.10.0.jar >>>> 36K flink-csv-1.10.0.jar >>>> 42K flink-json-1.10.0.jar >>>> >>>> So, Can we just put them into "lib/" or flink-table-uber? >>>> It not solve all problems and maybe it is independent of "fat" and >> "slim". >>>> But also improve usability. >>>> What do you think? Any objections? >>>> >>>> Best, >>>> Jingsong Lee >>>> >>>> On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <ches...@apache.org> >>>> wrote: >>>> >>>>> One downside would be that we're shipping more stuff when running on >>>>> YARN for example, since the entire plugins directory is shiped by >> default. >>>>> >>>>> On 17/04/2020 16:38, Stephan Ewen wrote: >>>>>> @Aljoscha I think that is an interesting line of thinking. the >> swift-fs >>>>> may >>>>>> be rarely enough used to move it to an optional download. >>>>>> >>>>>> I would still drop two more thoughts: >>>>>> >>>>>> (1) Now that we have plugins support, is there a reason to have a >>>>> metrics >>>>>> reporter or file system in /opt instead of /plugins? They don't >> spoil >>>>> the >>>>>> class path any more. >>>>>> >>>>>> (2) I can imagine there still being a desire to have a "minimal" >> docker >>>>>> file, for users that want to keep the container images as small as >>>>>> possible, to speed up deployment. It is fine if that would not be >> the >>>>>> default, though. >>>>>> >>>>>> >>>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek < >> aljos...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> I think having such tools and/or tailor-made distributions can >> be nice >>>>>>> but I also think the discussion is missing the main point: The >> initial >>>>>>> observation/motivation is that apparently a lot of users (Kurt >> and I >>>>>>> talked about this) on the chinese DingTalk support groups, and >> other >>>>>>> support channels have problems when first using the SQL client >> because >>>>>>> of these missing connectors/formats. For these, having >> additional tools >>>>>>> would not solve anything because they would also not take that >> extra >>>>>>> step. I think that even tiny friction should be avoided because >> the >>>>>>> annoyance from it accumulates of the (hopefully) many users that >> we >>>>> want >>>>>>> to have. >>>>>>> >>>>>>> Maybe we should take a step back from discussing the >> "fat"/"slim" idea >>>>>>> and instead think about the composition of the current dist. As >>>>>>> mentioned we have these jars in opt/: >>>>>>> >>>>>>> 17M flink-azure-fs-hadoop-1.10.0.jar >>>>>>> 52K flink-cep-scala_2.11-1.10.0.jar >>>>>>> 180K flink-cep_2.11-1.10.0.jar >>>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar >>>>>>> 626K flink-gelly_2.11-1.10.0.jar >>>>>>> 512K flink-metrics-datadog-1.10.0.jar >>>>>>> 159K flink-metrics-graphite-1.10.0.jar >>>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar >>>>>>> 102K flink-metrics-prometheus-1.10.0.jar >>>>>>> 10K flink-metrics-slf4j-1.10.0.jar >>>>>>> 12K flink-metrics-statsd-1.10.0.jar >>>>>>> 36M flink-oss-fs-hadoop-1.10.0.jar >>>>>>> 28M flink-python_2.11-1.10.0.jar >>>>>>> 22K flink-queryable-state-runtime_2.11-1.10.0.jar >>>>>>> 18M flink-s3-fs-hadoop-1.10.0.jar >>>>>>> 31M flink-s3-fs-presto-1.10.0.jar >>>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar >>>>>>> 518K flink-sql-client_2.11-1.10.0.jar >>>>>>> 99K flink-state-processor-api_2.11-1.10.0.jar >>>>>>> 25M flink-swift-fs-hadoop-1.10.0.jar >>>>>>> 160M opt >>>>>>> >>>>>>> The "filesystem" connectors ar ethe heavy hitters, there. >>>>>>> >>>>>>> I downloaded most of the SQL connectors/formats and this is what >> I got: >>>>>>> >>>>>>> 73K flink-avro-1.10.0.jar >>>>>>> 36K flink-csv-1.10.0.jar >>>>>>> 55K flink-hbase_2.11-1.10.0.jar >>>>>>> 88K flink-jdbc_2.11-1.10.0.jar >>>>>>> 42K flink-json-1.10.0.jar >>>>>>> 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar >>>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar >>>>>>> 24M sql-connectors-formats >>>>>>> >>>>>>> We could just add these to the Flink distribution without >> blowing it up >>>>>>> by much. We could drop any of the existing "filesystem" >> connectors from >>>>>>> opt and add the SQL connectors/formats and not change the size >> of Flink >>>>>>> dist. So maybe we should do that instead? >>>>>>> >>>>>>> We would need some tooling for the sql-client shell script to >> pick-up >>>>>>> the connectors/formats up from opt/ because we don't want to add >> them >>>>> to >>>>>>> lib/. We're already doing that for finding the flink-sql-client >> jar, >>>>>>> which is also not in lib/. >>>>>>> >>>>>>> What do you think? >>>>>>> >>>>>>> Best, >>>>>>> Aljoscha >>>>>>> >>>>>>> On 17.04.20 05:22, Jark Wu wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> I like the idea of web tool to assemble fat distribution. And >> the >>>>>>>> https://code.quarkus.io/ looks very nice. >>>>>>>> All the users need to do is just select what he/she need (I >> think this >>>>>>> step >>>>>>>> can't be omitted anyway). >>>>>>>> We can also provide a default fat distribution on the web which >>>>> default >>>>>>>> selects some popular connectors. >>>>>>>> >>>>>>>> Best, >>>>>>>> Jark >>>>>>>> >>>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <rafi.ar...@gmail.com >>> >>>>> wrote: >>>>>>>> >>>>>>>>> As a reference for a nice first-experience I had, take a >> look at >>>>>>>>> https://code.quarkus.io/ >>>>>>>>> You reach this page after you click "Start Coding" at the >> project >>>>>>> homepage. >>>>>>>>> Rafi >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <ykt...@gmail.com> >> wrote: >>>>>>>>> >>>>>>>>>> I'm not saying pre-bundle some jars will make this problem >> go away, >>>>> and >>>>>>>>>> you're right that only hides the problem for >>>>>>>>>> some users. But what if this solution can hide the problem >> for 90% >>>>>>> users? >>>>>>>>>> Would't that be good enough for us to try? >>>>>>>>>> >>>>>>>>>> Regarding to would users following instructions really be >> such a big >>>>>>>>>> problem? >>>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions >> for at >>>>> least a >>>>>>>>>> dozen times and I won't see such questions coming >>>>>>>>>> up from time to time. During some periods, I even saw such >> questions >>>>>>>>> every >>>>>>>>>> day. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Kurt >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler < >>>>> ches...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> The problem with having a distribution with "popular" >> stuff is >>>>> that it >>>>>>>>>>> doesn't really *solve* a problem, it just hides it for >> users who >>>>> fall >>>>>>>>>>> into these particular use-cases. >>>>>>>>>>> Move out of it and you once again run into exact same >> problems >>>>>>>>> out-lined. >>>>>>>>>>> This is exactly why I like the tooling approach; you >> have to deal >>>>> with >>>>>>>>> it >>>>>>>>>>> from the start and transitioning to a custom use-case is >> easier. >>>>>>>>>>> >>>>>>>>>>> Would users following instructions really be such a big >> problem? >>>>>>>>>>> I would expect that users generally know *what *they >> need, just not >>>>>>>>>>> necessarily how it is assembled correctly (where do get >> which jar, >>>>>>>>> which >>>>>>>>>>> directory to put it in). >>>>>>>>>>> It seems like these are exactly the problem this would >> solve? >>>>>>>>>>> I just don't see how moving a jar corresponding to some >> feature >>>>> from >>>>>>>>> opt >>>>>>>>>>> to some directory (lib/plugins) is less error-prone than >> just >>>>>>> selecting >>>>>>>>>> the >>>>>>>>>>> feature and having the tool handle the rest. >>>>>>>>>>> >>>>>>>>>>> As for re-distributions, it depends on the form that the >> tool would >>>>>>>>> take. >>>>>>>>>>> It could be an application that runs locally and works >> against >>>>> maven >>>>>>>>>>> central (note: not necessarily *using* maven); this >> should would >>>>> work >>>>>>>>> in >>>>>>>>>>> China, no? >>>>>>>>>>> >>>>>>>>>>> A web tool would of course be fancy, but I don't know >> how feasible >>>>>>> this >>>>>>>>>> is >>>>>>>>>>> with the ASF infrastructure. >>>>>>>>>>> You wouldn't be able to mirror the distribution, so the >> load can't >>>>> be >>>>>>>>>>> distributed. I doubt INFRA would like this. >>>>>>>>>>> >>>>>>>>>>> Note that third-parties could also start distributing >> use-case >>>>>>> oriented >>>>>>>>>>> distributions, which would be perfectly fine as far as >> I'm >>>>> concerned. >>>>>>>>>>> >>>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote: >>>>>>>>>>> >>>>>>>>>>> I'm not so sure about the web tool solution though. The >> concern I >>>>> have >>>>>>>>>> for >>>>>>>>>>> this approach is the final generated >>>>>>>>>>> distribution is kind of non-deterministic. We might >> generate too >>>>> many >>>>>>>>>>> different combinations when user trying to >>>>>>>>>>> package different types of connector, format, and even >> maybe hadoop >>>>>>>>>>> releases. As far as I can tell, most open >>>>>>>>>>> source projects and apache projects will only release >> some >>>>>>>>>>> pre-defined distributions, which most users are already >>>>>>>>>>> familiar with, thus hard to change IMO. And I also have >> went >>>>> through >>>>>>> in >>>>>>>>>>> some cases, users will try to re-distribute >>>>>>>>>>> the release package, because of the unstable network of >> apache >>>>> website >>>>>>>>>> from >>>>>>>>>>> China. In web tool solution, I don't >>>>>>>>>>> think this kind of re-distribution would be possible >> anymore. >>>>>>>>>>> >>>>>>>>>>> In the meantime, I also have a concern that we will fall >> back into >>>>> our >>>>>>>>>> trap >>>>>>>>>>> again if we try to offer this smart & flexible >>>>>>>>>>> solution. Because it needs users to cooperate with such >> mechanism. >>>>>>> It's >>>>>>>>>>> exactly the situation what we currently fell >>>>>>>>>>> into: >>>>>>>>>>> 1. We offered a smart solution. >>>>>>>>>>> 2. We hope users will follow the correct instructions. >>>>>>>>>>> 3. Everything will work as expected if users followed >> the right >>>>>>>>>>> instructions. >>>>>>>>>>> >>>>>>>>>>> In reality, I suspect not all users will do the second >> step >>>>> correctly. >>>>>>>>>> And >>>>>>>>>>> for new users who only trying to have a quick >>>>>>>>>>> experience with Flink, I would bet most users will do it >> wrong. >>>>>>>>>>> >>>>>>>>>>> So, my proposal would be one of the following 2 options: >>>>>>>>>>> 1. Provide a slim distribution for advanced product >> users and >>>>> provide >>>>>>> a >>>>>>>>>>> distribution which will have some popular builtin jars. >>>>>>>>>>> 2. Only provide a distribution which will have some >> popular builtin >>>>>>>>> jars. >>>>>>>>>>> If we are trying to reduce the distributions we >> released, I would >>>>>>>>> prefer >>>>>>>>>> 2 >>>>>>>>>>> 1. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Kurt >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann < >>>>> trohrm...@apache.org> >>>>>>> < >>>>>>>>>> trohrm...@apache.org> wrote: >>>>>>>>>>> >>>>>>>>>>> I think what Chesnay and Dawid proposed would be the >> ideal >>>>> solution. >>>>>>>>>>> Ideally, we would also have a nice web tool for the >> website which >>>>>>>>>> generates >>>>>>>>>>> the corresponding distribution for download. >>>>>>>>>>> >>>>>>>>>>> To get things started we could start with only >> supporting to >>>>>>>>>>> download/creating the "fat" version with the script. The >> fat >>>>> version >>>>>>>>>> would >>>>>>>>>>> then consist of the slim distribution and whatever we >> deem >>>>> important >>>>>>>>> for >>>>>>>>>>> new users to get started. >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Till >>>>>>>>>>> >>>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz < >>>>>>>>>> dwysakow...@apache.org> <dwysakow...@apache.org> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> Few points from my side: >>>>>>>>>>> >>>>>>>>>>> 1. I like the idea of simplifying the experience for >> first time >>>>> users. >>>>>>>>>>> As for production use cases I share Jark's opinion that >> in this >>>>> case I >>>>>>>>>>> would expect users to combine their distribution >> manually. I think >>>>> in >>>>>>>>>>> such scenarios it is important to understand >> interconnections. >>>>>>>>>>> Personally I'd expect the slimmest possible distribution >> that I can >>>>>>>>>>> extend further with what I need in my production >> scenario. >>>>>>>>>>> >>>>>>>>>>> 2. I think there is also the problem that the matrix of >> possible >>>>>>>>>>> combinations that can be useful is already big. Do we >> want to have >>>>> a >>>>>>>>>>> distribution for: >>>>>>>>>>> >>>>>>>>>>> SQL users: which connectors should we include? should we >>>>> include >>>>>>>>>>> hive? which other catalog? >>>>>>>>>>> >>>>>>>>>>> DataStream users: which connectors should we include? >>>>>>>>>>> >>>>>>>>>>> For both of the above should we include yarn/kubernetes? >>>>>>>>>>> >>>>>>>>>>> I would opt for providing only the "slim" distribution >> as a release >>>>>>>>>>> artifact. >>>>>>>>>>> >>>>>>>>>>> 3. However, as I said I think its worth investigating >> how we can >>>>>>>>> improve >>>>>>>>>>> users experience. What do you think of providing a tool, >> could be >>>>> e.g. >>>>>>>>> a >>>>>>>>>>> shell script that constructs a distribution based on >> users choice. >>>>> I >>>>>>>>>>> think that was also what Chesnay mentioned as "tooling to >>>>>>>>>>> assemble custom distributions" In the end how I see the >> difference >>>>>>>>>>> between a slim and fat distribution is which jars do we >> put into >>>>> the >>>>>>>>>>> lib, right? It could have a few "screens". >>>>>>>>>>> >>>>>>>>>>> 1. Which API are you interested in: >>>>>>>>>>> a. SQL API >>>>>>>>>>> b. DataStream API >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2. [SQL] Which connectors do you want to use? >> [multichoice]: >>>>>>>>>>> a. Kafka >>>>>>>>>>> b. Elasticsearch >>>>>>>>>>> ... >>>>>>>>>>> >>>>>>>>>>> 3. [SQL] Which catalog you want to use? >>>>>>>>>>> >>>>>>>>>>> ... >>>>>>>>>>> >>>>>>>>>>> Such a tool would download all the dependencies from >> maven and put >>>>>>> them >>>>>>>>>>> into the correct folder. In the future we can extend it >> with >>>>>>> additional >>>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time >> with >>>>>>>>>>> kafka-universal etc. >>>>>>>>>>> >>>>>>>>>>> The benefit of it would be that the distribution that we >> release >>>>> could >>>>>>>>>>> remain "slim" or we could even make it slimmer. I might >> be missing >>>>>>>>>>> something here though. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> >>>>>>>>>>> Dawdi >>>>>>>>>>> >>>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote: >>>>>>>>>>> >>>>>>>>>>> I want to reinforce my opinion from earlier: This is >> about >>>>> improving >>>>>>>>>>> the situation both for first-time users and for >> experienced users >>>>> that >>>>>>>>>>> want to use a Flink dist in production. The current >> Flink dist is >>>>> too >>>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for >> production >>>>>>>>>>> users, that is where serving no-one properly with the >> current >>>>>>>>>>> middle-ground. That's why I think introducing those >> specialized >>>>>>>>>>> "spins" of Flink dist would be good. >>>>>>>>>>> >>>>>>>>>>> By the way, at some point in the future production users >> might not >>>>>>>>>>> even need to get a Flink dist anymore. They should be >> able to have >>>>>>>>>>> Flink as a dependency of their project (including the >> runtime) and >>>>>>>>>>> then build an image from this for Kubernetes or a fat >> jar for YARN. >>>>>>>>>>> >>>>>>>>>>> Aljoscha >>>>>>>>>>> >>>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote: >>>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> Regarding slim and fat distributions, I think different >> kinds of >>>>> jobs >>>>>>>>>>> may >>>>>>>>>>> prefer different type of distribution: >>>>>>>>>>> >>>>>>>>>>> For DataStream job, I think we may not like fat >> distribution >>>>>>>>>>> >>>>>>>>>>> containing >>>>>>>>>>> >>>>>>>>>>> connectors because user would always need to depend on >> the >>>>> connector >>>>>>>>>>> >>>>>>>>>>> in >>>>>>>>>>> >>>>>>>>>>> user code, it is easy to include the connector jar in >> the user lib. >>>>>>>>>>> >>>>>>>>>>> Less >>>>>>>>>>> >>>>>>>>>>> jar in lib means less class conflicts and problems. >>>>>>>>>>> >>>>>>>>>>> For SQL job, I think we are trying to encourage user to >> user pure >>>>>>>>>>> sql(DDL + >>>>>>>>>>> DML) to construct their job, In order to improve user >> experience, >>>>> It >>>>>>>>>>> may be >>>>>>>>>>> important for flink, not only providing as many >> connector jar in >>>>>>>>>>> distribution as possible especially the connector and >> format we >>>>> have >>>>>>>>>>> well >>>>>>>>>>> documented, but also providing an mechanism to load >> connectors >>>>>>>>>>> according >>>>>>>>>>> to the DDLs, >>>>>>>>>>> >>>>>>>>>>> So I think it could be good to place connector/format >> jars in some >>>>>>>>>>> dir like >>>>>>>>>>> opt/connector which would not affect jobs by default, and >>>>> introduce a >>>>>>>>>>> mechanism of dynamic discovery for SQL. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Wenlong >>>>>>>>>>> >>>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li < >> jingsongl...@gmail.com> >>>>> < >>>>>>>>>> jingsongl...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I am thinking both "improve first experience" and >> "improve >>>>> production >>>>>>>>>>> experience". >>>>>>>>>>> >>>>>>>>>>> I'm thinking about what's the common mode of Flink? >>>>>>>>>>> Streaming job use Kafka? Batch job use Hive? >>>>>>>>>>> >>>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of >> Hive server >>>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1 >> dependency. >>>>>>>>>>> Flink is currently mainly used for streaming, so let's >> not talk >>>>>>>>>>> about hive. >>>>>>>>>>> >>>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is >> (related >>>>> to >>>>>>>>>>> connectors): >>>>>>>>>>> - ETL jobs: Kafka -> Kafka >>>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka >>>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink >>>>>>>>>>> So Kafka and JDBC are probably the most commonly used. >> Of course, >>>>>>>>>>> >>>>>>>>>>> also >>>>>>>>>>> >>>>>>>>>>> includes CSV, JSON's formats. >>>>>>>>>>> So when we provide such a fat distribution: >>>>>>>>>>> - With CSV, JSON. >>>>>>>>>>> - With flink-kafka-universal and kafka dependencies. >>>>>>>>>>> - With flink-jdbc. >>>>>>>>>>> Using this fat distribution, most users can run their >> jobs well. >>>>>>>>>>> >>>>>>>>>>> (jdbc >>>>>>>>>>> >>>>>>>>>>> driver jar required, but this is very natural to do) >>>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only >> Kafka may >>>>>>>>>>> >>>>>>>>>>> have >>>>>>>>>>> >>>>>>>>>>> conflicts, but if our goal is to use kafka-universal to >> support all >>>>>>>>>>> Kafka >>>>>>>>>>> versions, it is hopeful to target the vast majority of >> users. >>>>>>>>>>> >>>>>>>>>>> We don't want to plug all jars into the fat >> distribution. Only need >>>>>>>>>>> less >>>>>>>>>>> conflict and common. of course, it is a matter of >> consideration to >>>>>>>>>>> >>>>>>>>>>> put >>>>>>>>>>> >>>>>>>>>>> which jar into fat distribution. >>>>>>>>>>> We have the opportunity to facilitate the majority of >> users, but >>>>>>>>>>> also left >>>>>>>>>>> opportunities for customization. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Jingsong Lee >>>>>>>>>>> >>>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu < >> imj...@gmail.com> < >>>>>>>>>> imj...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I think we should first reach an consensus on "what >> problem do we >>>>>>>>>>> want to >>>>>>>>>>> solve?" >>>>>>>>>>> (1) improve first experience? or (2) improve production >> experience? >>>>>>>>>>> >>>>>>>>>>> As far as I can see, with the above discussion, I think >> what we >>>>>>>>>>> want to >>>>>>>>>>> solve is the "first experience". >>>>>>>>>>> And I think the slim jar is still the best distribution >> for >>>>>>>>>>> production, >>>>>>>>>>> because it's easier to assembling jars >>>>>>>>>>> than excluding jars and can avoid potential class >> conflicts. >>>>>>>>>>> >>>>>>>>>>> If we want to improve "first experience", I think it >> make sense to >>>>>>>>>>> have a >>>>>>>>>>> fat distribution to give users a more smooth first >> experience. >>>>>>>>>>> But I would like to call it "playground distribution" or >> something >>>>>>>>>>> like >>>>>>>>>>> that to explicitly differ from the "slim >> production-purpose >>>>>>>>>>> >>>>>>>>>>> distribution". >>>>>>>>>>> >>>>>>>>>>> The "playground distribution" can contains some widely >> used jars, >>>>>>>>>>> >>>>>>>>>>> like >>>>>>>>>>> >>>>>>>>>>> universal-kafka-sql-connector, >> elasticsearch7-sql-connector, avro, >>>>>>>>>>> json, >>>>>>>>>>> csv, etc.. >>>>>>>>>>> Even we can provide a playground docker which may >> contain the fat >>>>>>>>>>> distribution, python3, and hive. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Jark >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler < >> ches...@apache.org> >>>>> < >>>>>>>>>> ches...@apache.org> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I don't see a lot of value in having multiple >> distributions. >>>>>>>>>>> >>>>>>>>>>> The simple reality is that no fat distribution we could >> provide >>>>>>>>>>> >>>>>>>>>>> would >>>>>>>>>>> >>>>>>>>>>> satisfy all use-cases, so why even try. >>>>>>>>>>> If users commonly run into issues for certain jars, then >> maybe >>>>>>>>>>> >>>>>>>>>>> those >>>>>>>>>>> >>>>>>>>>>> should be added to the current distribution. >>>>>>>>>>> >>>>>>>>>>> Personally though I still believe we should only >> distribute a slim >>>>>>>>>>> version. I'd rather have users always add required jars >> to the >>>>>>>>>>> distribution than only when they go outside our >> "expected" >>>>>>>>>>> >>>>>>>>>>> use-cases. >>>>>>>>>>> >>>>>>>>>>> Then we might finally address this issue properly, i.e., >> tooling to >>>>>>>>>>> assemble custom distributions and/or better error >> messages if >>>>>>>>>>> Flink-provided extensions cannot be found. >>>>>>>>>>> >>>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote: >>>>>>>>>>> >>>>>>>>>>> Regarding to the specific solution, I'm not sure about >> the "fat" >>>>>>>>>>> >>>>>>>>>>> and >>>>>>>>>>> >>>>>>>>>>> "slim" >>>>>>>>>>> >>>>>>>>>>> solution though. I get the idea >>>>>>>>>>> that we can make the slim one even more lightweight than >> current >>>>>>>>>>> distribution, but what about the "fat" >>>>>>>>>>> one? Do you mean that we would package all connectors >> and formats >>>>>>>>>>> >>>>>>>>>>> into >>>>>>>>>>> >>>>>>>>>>> this? I'm not sure if this is >>>>>>>>>>> feasible. For example, we can't put all versions of >> kafka and hive >>>>>>>>>>> connector jars into lib directory, and >>>>>>>>>>> we also might need hadoop jars when using filesystem >> connector to >>>>>>>>>>> >>>>>>>>>>> access >>>>>>>>>>> >>>>>>>>>>> data from HDFS. >>>>>>>>>>> >>>>>>>>>>> So my guess would be we might hand-pick some of the most >>>>>>>>>>> >>>>>>>>>>> frequently >>>>>>>>>>> >>>>>>>>>>> used >>>>>>>>>>> >>>>>>>>>>> connectors and formats >>>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned >> above, >>>>>>>>>>> >>>>>>>>>>> and >>>>>>>>>>> >>>>>>>>>>> still >>>>>>>>>>> >>>>>>>>>>> leave some other connectors out of it. >>>>>>>>>>> If this is the case, then why not we just provide this >>>>>>>>>>> >>>>>>>>>>> distribution >>>>>>>>>>> >>>>>>>>>>> to >>>>>>>>>>> >>>>>>>>>>> user? I'm not sure i get the benefit of >>>>>>>>>>> providing another super "slim" jar (we have to pay some >> costs to >>>>>>>>>>> >>>>>>>>>>> provide >>>>>>>>>>> >>>>>>>>>>> another suit of distribution). >>>>>>>>>>> >>>>>>>>>>> What do you think? >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Kurt >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li < >>>>>>>>>>> >>>>>>>>>>> jingsongl...@gmail.com >>>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Big +1. >>>>>>>>>>> >>>>>>>>>>> I like "fat" and "slim". >>>>>>>>>>> >>>>>>>>>>> For csv and json, like Jark said, they are quite small >> and don't >>>>>>>>>>> >>>>>>>>>>> have >>>>>>>>>>> >>>>>>>>>>> other >>>>>>>>>>> >>>>>>>>>>> dependencies. They are important to kafka connector, and >>>>>>>>>>> >>>>>>>>>>> important >>>>>>>>>>> >>>>>>>>>>> to upcoming file system connector too. >>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so >>>>>>>>>>> >>>>>>>>>>> important, >>>>>>>>>>> >>>>>>>>>>> and >>>>>>>>>>> >>>>>>>>>>> they're so lightweight. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Jingsong Lee >>>>>>>>>>> >>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he < >> godfre...@gmail.com> < >>>>>>>>>> godfre...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Big +1. >>>>>>>>>>> This will improve user experience (special for Flink new >> users). >>>>>>>>>>> We answered so many questions about "class not found". >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Godfrey >>>>>>>>>>> >>>>>>>>>>> Dian Fu <dian0511...@gmail.com> <dian0511...@gmail.com> >>>>> 于2020年4月15日周三 >>>>>>>>>> 下午4:30写道: >>>>>>>>>>> >>>>>>>>>>> +1 to this proposal. >>>>>>>>>>> >>>>>>>>>>> Missing connector jars is also a big problem for PyFlink >> users. >>>>>>>>>>> >>>>>>>>>>> Currently, >>>>>>>>>>> >>>>>>>>>>> after a Python user has installed PyFlink using `pip`, >> he has >>>>>>>>>>> >>>>>>>>>>> to >>>>>>>>>>> >>>>>>>>>>> manually >>>>>>>>>>> >>>>>>>>>>> copy the connector fat jars to the PyFlink installation >>>>>>>>>>> >>>>>>>>>>> directory >>>>>>>>>>> >>>>>>>>>>> for >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> connectors to be used if he wants to run jobs locally. >> This >>>>>>>>>>> >>>>>>>>>>> process >>>>>>>>>>> >>>>>>>>>>> is >>>>>>>>>>> >>>>>>>>>>> very >>>>>>>>>>> >>>>>>>>>>> confuse for users and affects the experience a lot. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Dian >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 在 2020年4月15日,下午3:51,Jark Wu <imj...@gmail.com> < >> imj...@gmail.com> >>>>> 写道: >>>>>>>>>>> >>>>>>>>>>> +1 to the proposal. I also found the "download >> additional jar" >>>>>>>>>>> >>>>>>>>>>> step >>>>>>>>>>> >>>>>>>>>>> is >>>>>>>>>>> >>>>>>>>>>> really verbose when I prepare webinars. >>>>>>>>>>> >>>>>>>>>>> At least, I think the flink-csv and flink-json should in >> the >>>>>>>>>>> >>>>>>>>>>> distribution, >>>>>>>>>>> >>>>>>>>>>> they are quite small and don't have other dependencies. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Jark >>>>>>>>>>> >>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang < >> zjf...@gmail.com> < >>>>>>>>>> zjf...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Aljoscha, >>>>>>>>>>> >>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan >> to >>>>>>>>>>> >>>>>>>>>>> put >>>>>>>>>>> >>>>>>>>>>> these >>>>>>>>>>> >>>>>>>>>>> connectors ? opt or lib ? >>>>>>>>>>> >>>>>>>>>>> Aljoscha Krettek <aljos...@apache.org> < >> aljos...@apache.org> >>>>>>>>>> 于2020年4月15日周三 >>>>>>>>>>> 下午3:30写道: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Everyone, >>>>>>>>>>> >>>>>>>>>>> I'd like to discuss about releasing a more full-featured >>>>>>>>>>> >>>>>>>>>>> Flink >>>>>>>>>>> >>>>>>>>>>> distribution. The motivation is that there is friction >> for >>>>>>>>>>> >>>>>>>>>>> SQL/Table >>>>>>>>>>> >>>>>>>>>>> API >>>>>>>>>>> >>>>>>>>>>> users that want to use Table connectors which are not >> there >>>>>>>>>>> >>>>>>>>>>> in >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> current Flink Distribution. For these users the workflow >> is >>>>>>>>>>> >>>>>>>>>>> currently >>>>>>>>>>> >>>>>>>>>>> roughly: >>>>>>>>>>> >>>>>>>>>>> - download Flink dist >>>>>>>>>>> - configure csv/Kafka/json connectors per configuration >>>>>>>>>>> - run SQL client or program >>>>>>>>>>> - decrypt error message and research the solution >>>>>>>>>>> - download additional connector jars >>>>>>>>>>> - program works correctly >>>>>>>>>>> >>>>>>>>>>> I realize that this can be made to work but if every SQL >>>>>>>>>>> >>>>>>>>>>> user >>>>>>>>>>> >>>>>>>>>>> has >>>>>>>>>>> >>>>>>>>>>> this >>>>>>>>>>> >>>>>>>>>>> as their first experience that doesn't seem good to me. >>>>>>>>>>> >>>>>>>>>>> My proposal is to provide two versions of the Flink >>>>>>>>>>> >>>>>>>>>>> Distribution >>>>>>>>>>> >>>>>>>>>>> in >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> future: "fat" and "slim" (names to be discussed): >>>>>>>>>>> >>>>>>>>>>> - slim would be even trimmer than todays distribution >>>>>>>>>>> - fat would contain a lot of convenience connectors (yet >>>>>>>>>>> >>>>>>>>>>> to >>>>>>>>>>> >>>>>>>>>>> be >>>>>>>>>>> >>>>>>>>>>> determined which one) >>>>>>>>>>> >>>>>>>>>>> And yes, I realize that there are already more >> dimensions of >>>>>>>>>>> >>>>>>>>>>> Flink >>>>>>>>>>> >>>>>>>>>>> releases (Scala version and Java version). >>>>>>>>>>> >>>>>>>>>>> For background, our current Flink dist has these in the >> opt >>>>>>>>>>> >>>>>>>>>>> directory: >>>>>>>>>>> >>>>>>>>>>> - flink-azure-fs-hadoop-1.10.0.jar >>>>>>>>>>> - flink-cep-scala_2.12-1.10.0.jar >>>>>>>>>>> - flink-cep_2.12-1.10.0.jar >>>>>>>>>>> - flink-gelly-scala_2.12-1.10.0.jar >>>>>>>>>>> - flink-gelly_2.12-1.10.0.jar >>>>>>>>>>> - flink-metrics-datadog-1.10.0.jar >>>>>>>>>>> - flink-metrics-graphite-1.10.0.jar >>>>>>>>>>> - flink-metrics-influxdb-1.10.0.jar >>>>>>>>>>> - flink-metrics-prometheus-1.10.0.jar >>>>>>>>>>> - flink-metrics-slf4j-1.10.0.jar >>>>>>>>>>> - flink-metrics-statsd-1.10.0.jar >>>>>>>>>>> - flink-oss-fs-hadoop-1.10.0.jar >>>>>>>>>>> - flink-python_2.12-1.10.0.jar >>>>>>>>>>> - flink-queryable-state-runtime_2.12-1.10.0.jar >>>>>>>>>>> - flink-s3-fs-hadoop-1.10.0.jar >>>>>>>>>>> - flink-s3-fs-presto-1.10.0.jar >>>>>>>>>>> - >>>>>>>>>>> >>>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar >>>>>>>>>>> >>>>>>>>>>> - flink-sql-client_2.12-1.10.0.jar >>>>>>>>>>> - flink-state-processor-api_2.12-1.10.0.jar >>>>>>>>>>> - flink-swift-fs-hadoop-1.10.0.jar >>>>>>>>>>> >>>>>>>>>>> Current Flink dist is 267M. If we removed everything from >>>>>>>>>>> >>>>>>>>>>> opt >>>>>>>>>>> >>>>>>>>>>> we >>>>>>>>>>> >>>>>>>>>>> would >>>>>>>>>>> >>>>>>>>>>> go down to 126M. I would reccomend this, because the >> large >>>>>>>>>>> >>>>>>>>>>> majority >>>>>>>>>>> >>>>>>>>>>> of >>>>>>>>>>> >>>>>>>>>>> the files in opt are probably unused. >>>>>>>>>>> >>>>>>>>>>> What do you think? >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Aljoscha >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Best Regards >>>>>>>>>>> >>>>>>>>>>> Jeff Zhang >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Best, Jingsong Lee >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Best, Jingsong Lee >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>> >>>>> >>>>> >>>> >>>> -- >>>> Best, Jingsong Lee >>>> >>> >>> >>> -- >>> Best, Jingsong Lee >> > > > -- > > Best, > Benchao Li