The code from Daniel has been written for the old YARN client. I think the most important change is this one: https://github.com/warneke/flink/commit/9843a14637594fb7ee265f5326af9007f2a3191c and it can be backported easily to the new YARN client.
On Tue, Jan 27, 2015 at 7:00 AM, Stephan Ewen <se...@apache.org> wrote: > Hi Ankit! > > Kerberos support is not yet in the system, but one of the Flink committers > (Daniel Warneke) has made a prototype here: https://github.com/warneke/ > flink/tree/security > > @Daniel Can you give us an update on the status? How do you think is > missing before a first version is ready to be merged into the master? > > Greetings, > Stephan > > > On Sun, Jan 18, 2015 at 10:00 AM, Robert Metzger <rmetz...@apache.org> > wrote: > > > Hi Daniel, > > > > let me answer your questions: > > 1. Basically all features you are requesting are implemented in this pull > > request: https://github.com/apache/flink/pull/292 (Per Job YARN cluster > & > > programmatical control of the cluster). Feel free to review the pull > > request. It is pending for more than one week now and hasn't gotten much > > feedback. Also, I would recommend you to base the work on security > support > > on that branch. > > > > 2. I agree that the whole configuration loading process is not nicely > > implemented. When I was working on this, I didn't understand all the > > features offered by Hadoop's Configuration object. I implemented the > stuff > > that complicated for making it as easy as possible for users to use Flink > > on YARN. As you can see in the code, it is trying different commonly used > > environment variables to detect the location of the configuration files. > > These config files are then used and respected by the YARN client (for > > example the default file system name). > > I'll have a look at the "yarn jar" command. One concern I have with this > is > > that we have an additional requirement through this: We expect the user > to > > have the "yarn" binary in PATH. I know quite a few environments (for > > example some users in the Hortonworks Sandbox) which don't have "hadoop" > > and "yarn" in the PATH. The "yarn jar" command as well is accessing the > > environment variables required to locate the hadoop configuration. But I > > will carefully check if using the "yarn jar" command brings us an > > advantage. > > > > 3. I'm also not completely convinced that this is the right approach. > When > > I was implementing the first version of Flink on YARN, I though that > > deploying many small files to HDFS will cause some load on the NameNode > and > > need some time. Right now, we have 146 jars in the lib/ directory. I > > haven't done a performance comparison but I guess its slower to upload > 146 > > files to HDFS instead of 1. (it is not only uploading the files to HDFS, > > YARN also needs to download and "localize" them prior to allocating new > > containers). > > Also, when deploying Flink on YARN on Google Compute cloud, the google > > compute storage is configured by default ... and its quite slow. So this > > would probably lead to a bad user experience. > > I completely agree that we need an option for users to use a > pre-installed > > Flink sitting on HDFS or somewhere else in the cluster. > > There is another issue in this area in our project: I don't like that the > > "hadoop2" build of flink is producing two binary directories with almost > > the same content and layout. We could actually merge the whole YARN stuff > > into the regular hadoop2 build. Therefore, I would suggest to put one > flink > > fat jar into the lib/ directory. This would also make shading of our > > dependencies much easier. I will start a separate discussion on that > when I > > have more time again. Right now, I have more pressing issues to solve. > > > > Regarding your changes in the "security" branch: I'm super happy that > > others are starting to work on the YARN client as well. The whole > codebase > > has grown over time and its certainly good to have more eyes looking at > it. > > The security features of YARN and Hadoop in general are something that > I've > > avoided in the past, because its so difficult to properly test. But its > > something we certainly need to address. > > > > Best, > > Robert > > > > > > > > > > > > On Sun, Jan 18, 2015 at 6:28 PM, Daniel Warneke <warn...@apache.org> > > wrote: > > > > > Hi, > > > > > > I just pushed my first version of Flink supporting YARN environments > with > > > security/Kerberos enabled [1]. While working with the current Flink > > > version, I was really impressed by how easy it is to deploy the > software > > on > > > a YARN cluster. However, there are a few things a stumbled upon and I > > would > > > be interested in your opinion: > > > > > > 1. Separation between YARN session and Flink job > > > Currently, we separate the Flink YARN session from the Flink jobs, > i.e. a > > > user first has to bring up the Flink cluster on YARN through a separate > > > command and can then submit an arbitrary number of jobs to this > cluster. > > > Through this separation it is possible to submit individual jobs with a > > > really low latency, but it introduces two major problems: First, it is > > > currently impossible to programmatically launch a Flink YARN cluster, > > > submit a job, wait for its completion and then tear the cluster down > > again > > > (correct me if I’m wrong here) although this is actually a very > important > > > use case. Second, with the security enabled, all jobs are executed with > > the > > > security credentials of the user who launched the Flink cluster. This > > > causes massive authorization problems. Therefore, I would propose to > move > > > to a model where we launch one Flink cluster per job (or at least to > make > > > this a very prominent option). > > > > > > 2. Loading Hadoop configuration settings for Flink > > > In the current release, we use custom code to identify and load the > > > relevant Hadoop XML configuration files (e.g. core-site.xml, > > yarn-site.xml) > > > for the Flink YARN client. I found this mechanism to be quite fragile > as > > it > > > depends on certain environment variables to be set and assumes certain > > > configuration keys to be specified in certain files. For example, with > > > Hadoop security enabled, the Flink YARN client needs to know what kind > of > > > authentication mechanisms HDFS expects for the data transfer. This > > setting > > > is usually specified in hdfs-site.xml. In the current Flink version, > the > > > YARN client ignores this file and hence cannot talk to HDFS when > security > > > is enabled. > > > As an alternative, I propose to launch the Flink cluster on YARN > through > > > the “yarn jar” command. With this command, you get the entire > > configuration > > > setup for free and no longer have to worry about names of configuration > > > files, configuration paths and environment variables. > > > > > > 3. The uberjar deployment model > > > In my opinion, the current Flink deployment model for YARN, with the > one > > > fat uberjar, is unnecessarily bulky. With the last release the Flink > > > uberjar has grown to over 100 MB in size, amounting to almost 400 MB of > > > class files when uncompressed. Many of the includes are not even > > necessary. > > > For example, when using the “yarn jar” hook to deploy Flink, all > relevant > > > Hadoop libraries are added to the classpath anyway, so there is no need > > to > > > include them in the uberjar (unless you assume the client does not > have a > > > Hadoop environment installed). Personally, I would favor a more > > > fine-granular deployment model. Especially, when we move to a > > > one-job-per-session model, I think we should allow having Flink > > > preinstalled on the cluster nodes and not always require to > redistribute > > > the 100 MB uberjar to each and every node. > > > > > > Any thoughts on that? > > > > > > Best regards, > > > > > > Daniel > > > > > > [1] https://github.com/warneke/flink/tree/security > > > > > >