Hi all, Sorry for jumping in at this late point of the discussion. I see a lot of things I really like and I would like to put my "needs" and observations here too so you take them into account (where possible). I suspect that there will be overlap with things you already have taken into account.
1. No more 'flink:latest' docker image tag. Related to https://issues.apache.org/jira/browse/FLINK-15794 What I have learned is that the 'latest' version of a docker image only makes sense IFF this is an almost standalone thing. So if I have a servlet that does something in isolation (like my hobby project https://hub.docker.com/r/nielsbasjes/yauaa ) then 'latest' makes sense. With Flink you have the application code and all nodes in the cluster that are depending on each other and as such must run the exact same versions of the base software. So if you run flink in a cluster (local/yarn/k8s/mesos/swarm/...) where the application and the nodes inter communicate and closely depend on each other then 'latest' is a bad idea. 1. Assume I have an application built against the Flink N api and the cluster downloads the latest which is also Flink N. Then a week later Flink N+1 is released and the API I use changes (Deprecated) and a while later Flink N+2 is released and the deprecated API is removed: Then my application no longer works even though I have not changed anything. So I want my application to be 'pinned' to the exact version I built it with. 2. I have a running cluster with my application and cluster running Flink N. I add some additional nodes and the new nodes pick up the Flink N+1 image ... now I have a cluster with mixed versions. 3. The version of flink is really the "Flink+Scala" version pair. If you have the right flink but the wrong scala you get really nasty errors: https://issues.apache.org/jira/browse/FLINK-16289 2. Deploy SNAPSHOT docker images (i.e. something like *flink:1.11-SNAPSHOT_2.12*) . More and more use cases will be running on the code delivered via Docker images instead of bare jar files. So if a "SNAPSHOT" is released and deployed into a 'staging' maven repo (which may be locally on the developers workstation) then it is my opinion that at the same moment a "SNAPSHOT" docker image should be created/deployed. Each time a "SNAPSHOT" docker image is released this will overwrite the previous "SNAPSHOT". If the final version is released the SNAPSHOTs of that version can/should be removed. This will make testing in clusters a lot easier. Also building a local fix and then running it locally will work without additional modifications to the code. 3. Support for a 'single application cluster' I've been playing around with the S3 plugin and what I have found is that this essentially requires all nodes to have full access to the credentials needed to connect to S3. This essentially means that a multi-tenant setup is not possible in these cases. So I think the single application cluster should be a feature available in all cases. 4. I would like a native-kubernetes-single-application base image. I can then create a derived image where I only add the jar of my application. My desire is that I can then create a k8s yaml file for kubectl that adds the needed configs/secrets/arguments/environment variables and starts the cluster and application. Because the native kubernetes support makes it automatically scale based on the application this should 'just work'. Additional note: 1. Job/Task attempt logging instead of task manager logging. *I realize this has nothing to do with the docker images* I found something "hard to work with" while running some tests last week. The logging is done to a single log for the task manager. So if I have multiple things running in the single task manager then the logs are mixed together. Also several attempts of the same task are mixed which makes it very hard to find out 'what went wrong'. On Fri, Apr 3, 2020 at 4:27 PM Ufuk Celebi <u...@apache.org> wrote: > Thanks for the summary, Andrey. Good idea to link Patrick's document from > the FLIP as a future direction so it doesn't get lost. Could you make sure > to revive that discussion when FLIP-111 nears an end? > > This is good to go on my part. +1 to start the VOTE. > > > @Till, @Yang: Thanks for the clarification with the output redirection. I > didn't see that. The concern with the `tee` approach is that the file would > grow indefinitely. I think we can solve this with regular logging by > redirecting stderr to ERROR log level, but I'm not sure. We can look at a > potential solution when we get to that point. :-) > > > > On Fri, Apr 3, 2020 at 3:36 PM Andrey Zagrebin <azagre...@apache.org> > wrote: > > > Hi everyone, > > > > Patrick and Ufuk, thanks a lot for more ideas and suggestions! > > > > I have updated the FLIP according to the current state of discussion. > > Now it also contains the implementation steps and future follow-ups. > > Please, review if there are any concerns. > > The order of the steps aims for keeping Flink releasable at any point if > > something does not have enough time to get in. > > > > It looks that we are reaching mostly a consensus for the open questions. > > There is also a list of items, which have been discussed in this thread, > > and short summary below. > > As soon as there are no concerns, I will create a voting thread. > > > > I also added some thoughts for further customising logging setup. This > may > > be an optional follow-up > > which is additional to the default logging into files for Web UI. > > > > # FLIP scope > > The focus is users of the official releases. > > Create docs for how to use the official docker image. > > Remove other Dockerfiles in Flink repo. > > Rely on running the official docker image in different modes (JM/TM). > > Customise running the official image with env vars (This should minimise > > manual manipulating of local files and creation of a custom image). > > > > # Base oficial image > > > > ## Java versions > > There is a separate effort for this: > > https://github.com/apache/flink-docker/pull/9 > > > > # Run image > > > > ## Entry point modes > > JM session, JM job, TM > > > > ## Entry point config > > We use env vars for this, e.g. FLINK_PROPERTIES and > ENABLE_BUILT_IN_PLUGINS > > > > ## Flink config options > > We document the existing FLINK_PROPERTIES env var to override config > > options in flink-conf.yaml. > > Then later, we do not need to expose and handle any other special env > vars > > for config options (address, port etc). > > The future plan is to make Flink process configurable by env vars, e.g. > > 'some.yaml.option: val' -> FLINK_SOME_YAML_OPTION=val > > > > ## Extra files: jars, custom logging properties > > We can provide env vars to point to custom locations, e.g. in mounted > > volumes. > > > > # Extend image > > > > ## Python/hadoop versions, activating certain libs/plugins > > Users can install extra dependencies and change configs in their custom > > image which extends our base image. > > > > # Logging > > > > ## Web UI > > Modify the *log4j-console.properties* to also output logs into the files > > for WebUI. Limit log file size. > > > > ## Container output > > Separate effort for proper split of Flink process stdout and stderr into > > files and container output > > (idea with tee command: `program start-foreground &2>1 | tee > > flink-user-taskexecutor.out`) > > > > # Docker bash utils > > We are not going to expose it to users as an API. > > They should be able either to configure and run the standard entry point > > or the documentation should give short examples about how to extend and > > customise the base image. > > > > During the implementation, we will see if it makes sense to factor out > > certain bash procedures > > to reuse them e.g. in custom dev versions of docker image. > > > > # Dockerfile / image for developers > > We keep it on our future roadmap. This effort should help to understand > > what we can reuse there. > > > > Best, > > Andrey > > > > > > On Fri, Apr 3, 2020 at 12:57 PM Till Rohrmann <trohrm...@apache.org> > > wrote: > > > >> Hi everyone, > >> > >> just a small inline comment. > >> > >> On Fri, Apr 3, 2020 at 11:42 AM Ufuk Celebi <u...@apache.org> wrote: > >> > >> > Hey Yang, > >> > > >> > thanks! See inline answers. > >> > > >> > On Fri, Apr 3, 2020 at 5:11 AM Yang Wang <danrtsey...@gmail.com> > wrote: > >> > > >> > > Hi Ufuk, > >> > > > >> > > Thanks for make the conclusion and directly point out what need to > be > >> > done > >> > > in > >> > > FLIP-111. I agree with you that we should narrow down the scope and > >> focus > >> > > the > >> > > most important and basic part about docker image unification. > >> > > > >> > > (1) Extend the entrypoint script in apache/flink-docker to start the > >> job > >> > >> cluster entry point > >> > > > >> > > I want to add a small requirement for the entry point script. > >> Currently, > >> > > for the native > >> > > K8s integration, we are using the apache/flink-docker image, but > with > >> > > different entry > >> > > point("kubernetes-entry.sh"). Generate the java cmd in > KubernetesUtils > >> > and > >> > > run it > >> > > in the entry point. I really hope it could merge to > >> apache/flink-docker > >> > > "docker-entrypoint.sh". > >> > > > >> > > >> > The script [1] only adds the FLINK_CLASSPATH env var which seems > >> generally > >> > reasonable to me. But since principled classpath and entrypoint > >> > configuration is somewhat related to the follow-up improvement > >> proposals, I > >> > could also see this being done after FLIP-111. > >> > > >> > > >> > > (2) Extend the example log4j-console configuration > >> > >> => support log retrieval from the Flink UI out of the box > >> > > > >> > > If you mean to update the "flink-dist/conf/log4j-console.properties" > >> to > >> > > support console and > >> > > local log files. I will say "+1". But we need to find a proper way > to > >> > make > >> > > stdout/stderr output > >> > > both available for console and log files. Maybe till's proposal > could > >> > help > >> > > to solve this. > >> > > "`program &2>1 | tee flink-user-taskexecutor.out`" > >> > > > >> > > >> > I think we can simply add a rolling file appender with a limit on the > >> log > >> > size. > >> > > >> > I think this won't solve Yang's concern. What he wants to achieve is > >> that > >> STDOUT and STDERR go to STDOUT and STDERR as well as into some *.out and > >> *.err file which are accessible from the web ui. I don't think that log > >> appender will help with this problem. > >> > >> Cheers, > >> Till > >> > >> > >> > – Ufuk > >> > > >> > [1] > >> > > >> > > >> > https://github.com/apache/flink/blob/master/flink-dist/src/main/flink-bin/kubernetes-bin/kubernetes-entry.sh > >> > > >> > > > -- Best regards / Met vriendelijke groeten, Niels Basjes