Re: Future directions for Flink’s YARN support?

Daniel Warneke Wed, 28 Jan 2015 23:41:17 -0800

Hi Ankit,

it looks like your Flink client does not pick up the proper Hadoop
configuration. In my prototype I fixed this problem by starting the
Flink client through the "hadoop jar" command line interface. However,
as Robert pointed out, the code still needs to be merged with the master
branch and potentially with Robert's changes to the Flink YARN client
itself.


I hope I have time to do the merge tonight. Otherwise I'll try to have a
look at it on the weekend.

Best regards,

    Daniel

On 29.01.2015 00:43, Ankit Jhalaria wrote:
> Hi Robert,
> I tried adding Daniel's changes to the 0.9 version of flink. So far I haven't 
> been able to get it working. Still getting the same errors.
> Best,Ankit 
>
>      On Tuesday, January 27, 2015 2:57 AM, Robert Metzger 
> <rmetz...@apache.org> wrote:
>    
>
>  The code from Daniel has been written for the old YARN client.I think the 
> most important change is this one: 
> https://github.com/warneke/flink/commit/9843a14637594fb7ee265f5326af9007f2a3191c
>  and it can be backported easily to the new YARN client.
>
>
> On Tue, Jan 27, 2015 at 7:00 AM, Stephan Ewen <se...@apache.org> wrote:
>
> Hi Ankit!
>
> Kerberos support is not yet in the system, but one of the Flink committers
> (Daniel Warneke) has made a prototype here: https://github.com/warneke/
> flink/tree/security
>
> @Daniel Can you give us an update on the status?  How do you think is
> missing before a first version is ready to be merged into the master?
>
> Greetings,
> Stephan
>
>
> On Sun, Jan 18, 2015 at 10:00 AM, Robert Metzger <rmetz...@apache.org>
> wrote:
>
>> Hi Daniel,
>>
>> let me answer your questions:
>> 1. Basically all features you are requesting are implemented in this pull
>> request: https://github.com/apache/flink/pull/292 (Per Job YARN cluster &
>> programmatical control of the cluster). Feel free to review the pull
>> request. It is pending for more than one week now and hasn't gotten much
>> feedback. Also, I would recommend you to base the work on security support
>> on that branch.
>>
>> 2. I agree that the whole configuration loading process is not nicely
>> implemented. When I was working on this, I didn't understand all the
>> features offered by Hadoop's Configuration object. I implemented the stuff
>> that complicated for making it as easy as possible for users to use Flink
>> on YARN. As you can see in the code, it is trying different commonly used
>> environment variables to detect the location of the configuration files.
>> These config files are then used and respected by the YARN client (for
>> example the default file system name).
>> I'll have a look at the "yarn jar" command. One concern I have with this is
>> that we have an additional requirement through this: We expect the user to
>> have the "yarn" binary in PATH. I know quite a few environments (for
>> example some users in the Hortonworks Sandbox) which don't have "hadoop"
>> and "yarn" in the PATH. The "yarn jar" command as well is accessing the
>> environment variables required to locate the hadoop configuration. But I
>> will carefully check if using the "yarn jar" command brings us an
>> advantage.
>>
>> 3. I'm also not completely convinced that this is the right approach. When
>> I was implementing the first version of Flink on YARN, I though that
>> deploying many small files to HDFS will cause some load on the NameNode and
>> need some time. Right now, we have 146 jars in the lib/ directory. I
>> haven't done a performance comparison but I guess its slower to upload 146
>> files to HDFS instead of 1. (it is not only uploading the files to HDFS,
>> YARN also needs to download and "localize" them prior to allocating new
>> containers).
>> Also, when deploying Flink on YARN on Google Compute cloud, the google
>> compute storage is configured by default ... and its quite slow. So this
>> would probably lead to a bad user experience.
>> I completely agree that we need an option for users to use a pre-installed
>> Flink sitting on HDFS or somewhere else in the cluster.
>> There is another issue in this area in our project: I don't like that the
>> "hadoop2" build of flink is producing two binary directories with almost
>> the same content and layout. We could actually merge the whole YARN stuff
>> into the regular hadoop2 build. Therefore, I would suggest to put one flink
>> fat jar into the lib/ directory. This would also make shading of our
>> dependencies much easier. I will start a separate discussion on that when I
>> have more time again. Right now, I have more pressing issues to solve.
>>
>> Regarding your changes in the "security" branch: I'm super happy that
>> others are starting to work on the YARN client as well. The whole codebase
>> has grown over time and its certainly good to have more eyes looking at it.
>> The security features of YARN and Hadoop in general are something that I've
>> avoided in the past, because its so difficult to properly test. But its
>> something we certainly need to address.
>>
>> Best,
>> Robert
>>
>>
>>
>>
>>
>> On Sun, Jan 18, 2015 at 6:28 PM, Daniel Warneke <warn...@apache.org>
>> wrote:
>>
>>> Hi,
>>>
>>> I just pushed my first version of Flink supporting YARN environments with
>>> security/Kerberos enabled [1]. While working with the current Flink
>>> version, I was really impressed by how easy it is to deploy the software
>> on
>>> a YARN cluster. However, there are a few things a stumbled upon and I
>> would
>>> be interested in your opinion:
>>>
>>> 1. Separation between YARN session and Flink job
>>> Currently, we separate the Flink YARN session from the Flink jobs, i.e. a
>>> user first has to bring up the Flink cluster on YARN through a separate
>>> command and can then submit an arbitrary number of jobs to this cluster.
>>> Through this separation it is possible to submit individual jobs with a
>>> really low latency, but it introduces two major problems: First, it is
>>> currently impossible to programmatically launch a Flink YARN cluster,
>>> submit a job, wait for its completion and then tear the cluster down
>> again
>>> (correct me if I’m wrong here) although this is actually a very important
>>> use case. Second, with the security enabled, all jobs are executed with
>> the
>>> security credentials of the user who launched the Flink cluster. This
>>> causes massive authorization problems. Therefore, I would propose to move
>>> to a model where we launch one Flink cluster per job (or at least to make
>>> this a very prominent option).
>>>
>>> 2. Loading Hadoop configuration settings for Flink
>>> In the current release, we use custom code to identify and load the
>>> relevant Hadoop XML configuration files (e.g. core-site.xml,
>> yarn-site.xml)
>>> for the Flink YARN client. I found this mechanism to be quite fragile as
>> it
>>> depends on certain environment variables to be set and assumes certain
>>> configuration keys to be specified in certain files. For example, with
>>> Hadoop security enabled, the Flink YARN client needs to know what kind of
>>> authentication mechanisms HDFS expects for the data transfer. This
>> setting
>>> is usually specified in hdfs-site.xml. In the current Flink version, the
>>> YARN client ignores this file and hence cannot talk to HDFS when security
>>> is enabled.
>>> As an alternative, I propose to launch the Flink cluster on YARN through
>>> the “yarn jar” command. With this command, you get the entire
>> configuration
>>> setup for free and no longer have to worry about names of configuration
>>> files, configuration paths and environment variables.
>>>
>>> 3. The uberjar deployment model
>>> In my opinion, the current Flink deployment model for YARN, with the one
>>> fat uberjar, is unnecessarily bulky. With the last release the Flink
>>> uberjar has grown to over 100 MB in size, amounting to almost 400 MB of
>>> class files when uncompressed. Many of the includes are not even
>> necessary.
>>> For example, when using the “yarn jar” hook to deploy Flink, all relevant
>>> Hadoop libraries are added to the classpath anyway, so there is no need
>> to
>>> include them in the uberjar (unless you assume the client does not have a
>>> Hadoop environment installed). Personally, I would favor a more
>>> fine-granular deployment model. Especially, when we move to a
>>> one-job-per-session model, I think we should allow having Flink
>>> preinstalled on the cluster nodes and not always require to redistribute
>>> the 100 MB uberjar to each and every node.
>>>
>>> Any thoughts on that?
>>>
>>> Best regards,
>>>
>>>      Daniel
>>>
>>> [1] https://github.com/warneke/flink/tree/security
>>>
>
>
>
>

Re: Future directions for Flink’s YARN support?

Reply via email to