I'm not sure about filesystems though; is there a clear 1:1 mapping of scheme <-> filesystem?
On 24/04/2020 14:28, Aljoscha Krettek wrote:
re (1): I don't know about that, probably the people that did the metrics reporter plugin support had some thoughts about that.re (2): I agree, that's why I initially suggested to split it into "slim" and "fat" because our current "medium fat" selection of jars in Flink dist does not serve anyone too well. It's too fat for people that want to build lean application images. It's to lean for people that want a good first out-of-box experience.Aljoscha On 17.04.20 16:38, Stephan Ewen wrote:@Aljoscha I think that is an interesting line of thinking. the swift-fs maybe rarely enough used to move it to an optional download. I would still drop two more thoughts:(1) Now that we have plugins support, is there a reason to have a metrics reporter or file system in /opt instead of /plugins? They don't spoil theclass path any more. (2) I can imagine there still being a desire to have a "minimal" docker file, for users that want to keep the container images as small as possible, to speed up deployment. It is fine if that would not be the default, though. On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <aljos...@apache.org> wrote:I think having such tools and/or tailor-made distributions can be nice but I also think the discussion is missing the main point: The initial observation/motivation is that apparently a lot of users (Kurt and I talked about this) on the chinese DingTalk support groups, and other support channels have problems when first using the SQL client because of these missing connectors/formats. For these, having additional tools would not solve anything because they would also not take that extra step. I think that even tiny friction should be avoided because theannoyance from it accumulates of the (hopefully) many users that we wantto have. Maybe we should take a step back from discussing the "fat"/"slim" idea and instead think about the composition of the current dist. As mentioned we have these jars in opt/: 17M flink-azure-fs-hadoop-1.10.0.jar 52K flink-cep-scala_2.11-1.10.0.jar 180K flink-cep_2.11-1.10.0.jar 746K flink-gelly-scala_2.11-1.10.0.jar 626K flink-gelly_2.11-1.10.0.jar 512K flink-metrics-datadog-1.10.0.jar 159K flink-metrics-graphite-1.10.0.jar 1.0M flink-metrics-influxdb-1.10.0.jar 102K flink-metrics-prometheus-1.10.0.jar 10K flink-metrics-slf4j-1.10.0.jar 12K flink-metrics-statsd-1.10.0.jar 36M flink-oss-fs-hadoop-1.10.0.jar 28M flink-python_2.11-1.10.0.jar 22K flink-queryable-state-runtime_2.11-1.10.0.jar 18M flink-s3-fs-hadoop-1.10.0.jar 31M flink-s3-fs-presto-1.10.0.jar 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar 518K flink-sql-client_2.11-1.10.0.jar 99K flink-state-processor-api_2.11-1.10.0.jar 25M flink-swift-fs-hadoop-1.10.0.jar 160M opt The "filesystem" connectors ar ethe heavy hitters, there. I downloaded most of the SQL connectors/formats and this is what I got: 73K flink-avro-1.10.0.jar 36K flink-csv-1.10.0.jar 55K flink-hbase_2.11-1.10.0.jar 88K flink-jdbc_2.11-1.10.0.jar 42K flink-json-1.10.0.jar 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar 24M sql-connectors-formats We could just add these to the Flink distribution without blowing it up by much. We could drop any of the existing "filesystem" connectors from opt and add the SQL connectors/formats and not change the size of Flink dist. So maybe we should do that instead? We would need some tooling for the sql-client shell script to pick-upthe connectors/formats up from opt/ because we don't want to add them tolib/. We're already doing that for finding the flink-sql-client jar, which is also not in lib/. What do you think? Best, Aljoscha On 17.04.20 05:22, Jark Wu wrote:Hi, I like the idea of web tool to assemble fat distribution. And the https://code.quarkus.io/ looks very nice. All the users need to do is just select what he/she need (I think thisstepcan't be omitted anyway).We can also provide a default fat distribution on the web which defaultselects some popular connectors. Best, Jark On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <rafi.ar...@gmail.com> wrote:As a reference for a nice first-experience I had, take a look at https://code.quarkus.io/ You reach this page after you click "Start Coding" at the projecthomepage.Rafi On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <ykt...@gmail.com> wrote:I'm not saying pre-bundle some jars will make this problem go away, andyou're right that only hides the problem for some users. But what if this solution can hide the problem for 90%users?Would't that be good enough for us to try? Regarding to would users following instructions really be such a big problem?I'm afraid yes. Otherwise I won't answer such questions for at least adozen times and I won't see such questions coming up from time to time. During some periods, I even saw such questionseveryday. Best, KurtOn Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <ches...@apache.org>wrote:The problem with having a distribution with "popular" stuff is that it doesn't really *solve* a problem, it just hides it for users who fallinto these particular use-cases. Move out of it and you once again run into exact same problemsout-lined.This is exactly why I like the tooling approach; you have to deal withitfrom the start and transitioning to a custom use-case is easier. Would users following instructions really be such a big problem? I would expect that users generally know *what *they need, just not necessarily how it is assembled correctly (where do get which jar,whichdirectory to put it in). It seems like these are exactly the problem this would solve?I just don't see how moving a jar corresponding to some feature fromoptto some directory (lib/plugins) is less error-prone than justselectingthefeature and having the tool handle the rest. As for re-distributions, it depends on the form that the tool wouldtake.It could be an application that runs locally and works against maven central (note: not necessarily *using* maven); this should would workinChina, no? A web tool would of course be fancy, but I don't know how feasiblethisiswith the ASF infrastructure.You wouldn't be able to mirror the distribution, so the load can't bedistributed. I doubt INFRA would like this. Note that third-parties could also start distributing use-caseorienteddistributions, which would be perfectly fine as far as I'm concerned.On 16/04/2020 16:57, Kurt Young wrote:I'm not so sure about the web tool solution though. The concern I haveforthis approach is the final generateddistribution is kind of non-deterministic. We might generate too manydifferent combinations when user trying to package different types of connector, format, and even maybe hadoop releases. As far as I can tell, most open source projects and apache projects will only release some pre-defined distributions, which most users are alreadyfamiliar with, thus hard to change IMO. And I also have went throughinsome cases, users will try to re-distributethe release package, because of the unstable network of apache websitefromChina. In web tool solution, I don't think this kind of re-distribution would be possible anymore.In the meantime, I also have a concern that we will fall back into ourtrapagain if we try to offer this smart & flexible solution. Because it needs users to cooperate with such mechanism.It'sexactly the situation what we currently fell into: 1. We offered a smart solution. 2. We hope users will follow the correct instructions. 3. Everything will work as expected if users followed the right instructions.In reality, I suspect not all users will do the second step correctly.Andfor new users who only trying to have a quick experience with Flink, I would bet most users will do it wrong. So, my proposal would be one of the following 2 options:1. Provide a slim distribution for advanced product users and provideadistribution which will have some popular builtin jars. 2. Only provide a distribution which will have some popular builtinjars.If we are trying to reduce the distributions we released, I wouldprefer21. Best, KurtOn Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <trohrm...@apache.org><trohrm...@apache.org> wrote:I think what Chesnay and Dawid proposed would be the ideal solution.Ideally, we would also have a nice web tool for the website whichgeneratesthe corresponding distribution for download. To get things started we could start with only supporting todownload/creating the "fat" version with the script. The fat versionwouldthen consist of the slim distribution and whatever we deem importantfornew users to get started. Cheers, Till On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <dwysakow...@apache.org> <dwysakow...@apache.org>wrote: Hi all, Few points from my side:1. I like the idea of simplifying the experience for first time users. As for production use cases I share Jark's opinion that in this case I would expect users to combine their distribution manually. I think insuch scenarios it is important to understand interconnections. Personally I'd expect the slimmest possible distribution that I can extend further with what I need in my production scenario. 2. I think there is also the problem that the matrix of possiblecombinations that can be useful is already big. Do we want to have adistribution for:SQL users: which connectors should we include? should we includehive? which other catalog? DataStream users: which connectors should we include? For both of the above should we include yarn/kubernetes? I would opt for providing only the "slim" distribution as a release artifact. 3. However, as I said I think its worth investigating how we canimproveusers experience. What do you think of providing a tool, could be e.g.ashell script that constructs a distribution based on users choice. Ithink that was also what Chesnay mentioned as "tooling to assemble custom distributions" In the end how I see the differencebetween a slim and fat distribution is which jars do we put into thelib, right? It could have a few "screens". 1. Which API are you interested in: a. SQL API b. DataStream API 2. [SQL] Which connectors do you want to use? [multichoice]: a. Kafka b. Elasticsearch ... 3. [SQL] Which catalog you want to use? ... Such a tool would download all the dependencies from maven and puttheminto the correct folder. In the future we can extend it withadditionalrules e.g. kafka-0.9 cannot be chosen at the same time with kafka-universal etc.The benefit of it would be that the distribution that we release couldremain "slim" or we could even make it slimmer. I might be missing something here though. Best, Dawdi On 16/04/2020 11:02, Aljoscha Krettek wrote:I want to reinforce my opinion from earlier: This is about improving the situation both for first-time users and for experienced users that want to use a Flink dist in production. The current Flink dist is too"thin" for first-time SQL users and it is too "fat" for production users, that is where serving no-one properly with the current middle-ground. That's why I think introducing those specialized "spins" of Flink dist would be good. By the way, at some point in the future production users might not even need to get a Flink dist anymore. They should be able to have Flink as a dependency of their project (including the runtime) and then build an image from this for Kubernetes or a fat jar for YARN. Aljoscha On 15.04.20 18:14, wenlong.lwl wrote: Hi all,Regarding slim and fat distributions, I think different kinds of jobsmay prefer different type of distribution: For DataStream job, I think we may not like fat distribution containingconnectors because user would always need to depend on the connectorin user code, it is easy to include the connector jar in the user lib. Less jar in lib means less class conflicts and problems. For SQL job, I think we are trying to encourage user to user pure sql(DDL +DML) to construct their job, In order to improve user experience, Itmay be important for flink, not only providing as many connector jar indistribution as possible especially the connector and format we havewell documented, but also providing an mechanism to load connectors according to the DDLs, So I think it could be good to place connector/format jars in some dir likeopt/connector which would not affect jobs by default, and introduce amechanism of dynamic discovery for SQL. Best, WenlongOn Wed, 15 Apr 2020 at 22:46, Jingsong Li <jingsongl...@gmail.com> <jingsongl...@gmail.com>wrote: Hi,I am thinking both "improve first experience" and "improve productionexperience". I'm thinking about what's the common mode of Flink? Streaming job use Kafka? Batch job use Hive? Hive 1.2.1 dependencies can be compatible with most of Hive server versions. So Spark and Presto have built-in Hive 1.2.1 dependency. Flink is currently mainly used for streaming, so let's not talk about hive.For streaming jobs, first of all, the jobs in my mind is (related toconnectors): - ETL jobs: Kafka -> Kafka - Join jobs: Kafka -> DimJDBC -> Kafka - Aggregation jobs: Kafka -> JDBCSink So Kafka and JDBC are probably the most commonly used. Of course, also includes CSV, JSON's formats. So when we provide such a fat distribution: - With CSV, JSON. - With flink-kafka-universal and kafka dependencies. - With flink-jdbc. Using this fat distribution, most users can run their jobs well. (jdbc driver jar required, but this is very natural to do) Can these dependencies lead to kinds of conflicts? Only Kafka may have conflicts, but if our goal is to use kafka-universal to support all Kafka versions, it is hopeful to target the vast majority of users. We don't want to plug all jars into the fat distribution. Only need less conflict and common. of course, it is a matter of consideration to put which jar into fat distribution. We have the opportunity to facilitate the majority of users, but also left opportunities for customization. Best, Jingsong Lee On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <imj...@gmail.com> <imj...@gmail.com> wrote:Hi, I think we should first reach an consensus on "what problem do we want to solve?" (1) improve first experience? or (2) improve production experience? As far as I can see, with the above discussion, I think what we want to solve is the "first experience". And I think the slim jar is still the best distribution for production, because it's easier to assembling jars than excluding jars and can avoid potential class conflicts. If we want to improve "first experience", I think it make sense to have a fat distribution to give users a more smooth first experience. But I would like to call it "playground distribution" or something like that to explicitly differ from the "slim production-purpose distribution". The "playground distribution" can contains some widely used jars, like universal-kafka-sql-connector, elasticsearch7-sql-connector, avro, json, csv, etc.. Even we can provide a playground docker which may contain the fat distribution, python3, and hive. Best, JarkOn Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ches...@apache.org> <ches...@apache.org>wrote: I don't see a lot of value in having multiple distributions. The simple reality is that no fat distribution we could provide would satisfy all use-cases, so why even try. If users commonly run into issues for certain jars, then maybe those should be added to the current distribution. Personally though I still believe we should only distribute a slim version. I'd rather have users always add required jars to the distribution than only when they go outside our "expected" use-cases. Then we might finally address this issue properly, i.e., tooling to assemble custom distributions and/or better error messages if Flink-provided extensions cannot be found. On 15/04/2020 15:23, Kurt Young wrote: Regarding to the specific solution, I'm not sure about the "fat" and "slim" solution though. I get the idea that we can make the slim one even more lightweight than current distribution, but what about the "fat" one? Do you mean that we would package all connectors and formats into this? I'm not sure if this is feasible. For example, we can't put all versions of kafka and hive connector jars into lib directory, and we also might need hadoop jars when using filesystem connector to access data from HDFS. So my guess would be we might hand-pick some of the most frequently used connectors and formats into our "lib" directory, like kafka, csv, json metioned above, and still leave some other connectors out of it. If this is the case, then why not we just provide this distribution to user? I'm not sure i get the benefit of providing another super "slim" jar (we have to pay some costs to provide another suit of distribution). What do you think? Best, Kurt On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li < jingsongl...@gmail.com wrote: Big +1. I like "fat" and "slim". For csv and json, like Jark said, they are quite small and don't have other dependencies. They are important to kafka connector, and important to upcoming file system connector too. So can we move them to both "fat" and "slim"? They're so important, and they're so lightweight. Best, Jingsong Lee On Wed, Apr 15, 2020 at 4:53 PM godfrey he <godfre...@gmail.com> <godfre...@gmail.com>wrote: Big +1. This will improve user experience (special for Flink new users). We answered so many questions about "class not found". Best, GodfreyDian Fu <dian0511...@gmail.com> <dian0511...@gmail.com> 于2020年4月15日周三下午4:30写道:+1 to this proposal. Missing connector jars is also a big problem for PyFlink users. Currently, after a Python user has installed PyFlink using `pip`, he has to manually copy the connector fat jars to the PyFlink installation directory for the connectors to be used if he wants to run jobs locally. This process is very confuse for users and affects the experience a lot. Regards, Dian在 2020年4月15日,下午3:51,Jark Wu <imj...@gmail.com> <imj...@gmail.com> 写道:+1 to the proposal. I also found the "download additional jar" step is really verbose when I prepare webinars. At least, I think the flink-csv and flink-json should in the distribution, they are quite small and don't have other dependencies. Best, Jark On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zjf...@gmail.com> <zjf...@gmail.com>wrote: Hi Aljoscha, Big +1 for the fat flink distribution, where do you plan to put these connectors ? opt or lib ? Aljoscha Krettek <aljos...@apache.org> <aljos...@apache.org>于2020年4月15日周三下午3:30写道: Hi Everyone, I'd like to discuss about releasing a more full-featured Flink distribution. The motivation is that there is friction for SQL/Table API users that want to use Table connectors which are not there in the current Flink Distribution. For these users the workflow is currently roughly: - download Flink dist - configure csv/Kafka/json connectors per configuration - run SQL client or program - decrypt error message and research the solution - download additional connector jars - program works correctly I realize that this can be made to work but if every SQL user has this as their first experience that doesn't seem good to me. My proposal is to provide two versions of the Flink Distribution in the future: "fat" and "slim" (names to be discussed): - slim would be even trimmer than todays distribution - fat would contain a lot of convenience connectors (yet to be determined which one) And yes, I realize that there are already more dimensions of Flink releases (Scala version and Java version). For background, our current Flink dist has these in the opt directory: - flink-azure-fs-hadoop-1.10.0.jar - flink-cep-scala_2.12-1.10.0.jar - flink-cep_2.12-1.10.0.jar - flink-gelly-scala_2.12-1.10.0.jar - flink-gelly_2.12-1.10.0.jar - flink-metrics-datadog-1.10.0.jar - flink-metrics-graphite-1.10.0.jar - flink-metrics-influxdb-1.10.0.jar - flink-metrics-prometheus-1.10.0.jar - flink-metrics-slf4j-1.10.0.jar - flink-metrics-statsd-1.10.0.jar - flink-oss-fs-hadoop-1.10.0.jar - flink-python_2.12-1.10.0.jar - flink-queryable-state-runtime_2.12-1.10.0.jar - flink-s3-fs-hadoop-1.10.0.jar - flink-s3-fs-presto-1.10.0.jar - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar - flink-sql-client_2.12-1.10.0.jar - flink-state-processor-api_2.12-1.10.0.jar - flink-swift-fs-hadoop-1.10.0.jar Current Flink dist is 267M. If we removed everything from opt we would go down to 126M. I would reccomend this, because the large majority of the files in opt are probably unused. What do you think? Best, Aljoscha -- Best Regards Jeff Zhang -- Best, Jingsong Lee -- Best, Jingsong Lee