Re: Running continuously on yarn with kerberos

Maximilian Michels Mon, 09 Nov 2015 07:55:32 -0800

Great to hear you sorted things out. Looking forward to the pull request!


On Mon, Nov 9, 2015 at 4:50 PM, Stephan Ewen <se...@apache.org> wrote:
> Super nice to hear :-)
>
>
> On Mon, Nov 9, 2015 at 4:48 PM, Niels Basjes <ni...@basjes.nl> wrote:
>>
>> Apparently I just had to wait a bit longer for the first run.
>> Now I'm able to package the project in about 7 minutes.
>>
>> Current status: I am now able to access HBase from within Flink on a
>> Kerberos secured cluster.
>> Cleaning up the patch so I can submit it in a few days.
>>
>> On Sat, Nov 7, 2015 at 10:01 PM, Stephan Ewen <se...@apache.org> wrote:
>>>
>>> The single shading step on my machine (SSD, 10 GB RAM) takes about 45
>>> seconds. HDD may be significantly longer, but should really not be more than
>>> 10 minutes.
>>>
>>> Is your maven build always stuck in that stage (flink-dist) showing a
>>> long list of dependencies (saying including org.x.y, including com.foo.bar,
>>> ...) ?
>>>
>>>
>>> On Sat, Nov 7, 2015 at 9:57 PM, Sachin Goel <sachingoel0...@gmail.com>
>>> wrote:
>>>>
>>>> Usually, if all the dependencies are being downloaded, i.e., on the
>>>> first build, it'll likely take 30-40 minutes. Subsequent builds might take
>>>> 10 minutes approx. [I have the same PC configuration.]
>>>>
>>>> -- Sachin Goel
>>>> Computer Science, IIT Delhi
>>>> m. +91-9871457685
>>>>
>>>> On Sun, Nov 8, 2015 at 2:05 AM, Niels Basjes <ni...@basjes.nl> wrote:
>>>>>
>>>>> How long should this take if you have HDD and about 8GB of RAM?
>>>>> Is that 10 minutes? 20?
>>>>>
>>>>> Niels
>>>>>
>>>>> On Sat, Nov 7, 2015 at 2:51 PM, Stephan Ewen <se...@apache.org> wrote:
>>>>>>
>>>>>> Hi Niels!
>>>>>>
>>>>>> Usually, you simply build the binaries by invoking "mvn -DskipTests
>>>>>> clean package" in the root flink directory. The resulting program should 
>>>>>> be
>>>>>> in the "build-target" directory.
>>>>>>
>>>>>> If the program gets stuck, let us know where and what the last message
>>>>>> on the command line is.
>>>>>>
>>>>>> Please be aware that the final step of building the "flink-dist"
>>>>>> project may take a while, especially on systems with hard disks (as 
>>>>>> opposed
>>>>>> to SSDs) and a comparatively low amount of memory. The reason is that the
>>>>>> building of the final JAR file is quite expensive, because the system
>>>>>> re-packages certain libraries in order to avoid conflicts between 
>>>>>> different
>>>>>> versions.
>>>>>>
>>>>>> Stephan
>>>>>>
>>>>>>
>>>>>> On Sat, Nov 7, 2015 at 2:40 PM, Niels Basjes <ni...@basj.es> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Excellent.
>>>>>>> What you can help me with are the commands to build the binary
>>>>>>> distribution from source.
>>>>>>> I tried it last Thursday and the build seemed to get stuck at some
>>>>>>> point (at the end of/just after building the dist module).
>>>>>>> I haven't been able to figure out why yet.
>>>>>>>
>>>>>>> Niels
>>>>>>>
>>>>>>> On 5 Nov 2015 14:57, "Maximilian Michels" <m...@apache.org> wrote:
>>>>>>>>
>>>>>>>> Thank you for looking into the problem, Niels. Let us know if you
>>>>>>>> need anything. We would be happy to merge a pull request once you have
>>>>>>>> verified the fix.
>>>>>>>>
>>>>>>>> On Thu, Nov 5, 2015 at 1:38 PM, Niels Basjes <ni...@basjes.nl>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I created https://issues.apache.org/jira/browse/FLINK-2977
>>>>>>>>>
>>>>>>>>> On Thu, Nov 5, 2015 at 12:25 PM, Robert Metzger
>>>>>>>>> <rmetz...@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Niels,
>>>>>>>>>> thank you for analyzing the issue so properly. I agree with you.
>>>>>>>>>> It seems that HDFS and HBase are using their own tokes which we need 
>>>>>>>>>> to
>>>>>>>>>> transfer from the client to the YARN containers. We should be able 
>>>>>>>>>> to port
>>>>>>>>>> the fix from Spark (which they got from Storm) into our YARN client.
>>>>>>>>>> I think we would add this in
>>>>>>>>>> org.apache.flink.yarn.Utils#setTokensFor().
>>>>>>>>>>
>>>>>>>>>> Do you want to implement and verify the fix yourself? If you are
>>>>>>>>>> to busy at the moment, we can also discuss how we share the work (I'm
>>>>>>>>>> implementing it, you test the fix)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Robert
>>>>>>>>>>
>>>>>>>>>> On Tue, Nov 3, 2015 at 5:26 PM, Niels Basjes <ni...@basjes.nl>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Update on the status so far.... I suspect I found a problem in a
>>>>>>>>>>> secure setup.
>>>>>>>>>>>
>>>>>>>>>>> I have created a very simple Flink topology consisting of a
>>>>>>>>>>> streaming Source (the outputs the timestamp a few times per second) 
>>>>>>>>>>> and a
>>>>>>>>>>> Sink (that puts that timestamp into a single record in HBase).
>>>>>>>>>>> Running this on a non-secure Yarn cluster works fine.
>>>>>>>>>>>
>>>>>>>>>>> To run it on a secured Yarn cluster my main routine now looks
>>>>>>>>>>> like this:
>>>>>>>>>>>
>>>>>>>>>>> public static void main(String[] args) throws Exception {
>>>>>>>>>>>     System.setProperty("java.security.krb5.conf",
>>>>>>>>>>> "/etc/krb5.conf");
>>>>>>>>>>>
>>>>>>>>>>> UserGroupInformation.loginUserFromKeytab("nbas...@xxxxxx.net",
>>>>>>>>>>> "/home/nbasjes/.krb/nbasjes.keytab");
>>>>>>>>>>>
>>>>>>>>>>>     final StreamExecutionEnvironment env =
>>>>>>>>>>> StreamExecutionEnvironment.getExecutionEnvironment();
>>>>>>>>>>>     env.setParallelism(1);
>>>>>>>>>>>
>>>>>>>>>>>     DataStream<String> stream = env.addSource(new
>>>>>>>>>>> TimerTicksSource());
>>>>>>>>>>>     stream.addSink(new SetHBaseRowSink());
>>>>>>>>>>>     env.execute("Long running Flink application");
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> When I run this
>>>>>>>>>>>      flink run -m yarn-cluster -yn 1 -yjm 1024 -ytm 4096
>>>>>>>>>>> ./kerberos-1.0-SNAPSHOT.jar
>>>>>>>>>>>
>>>>>>>>>>> I see after the startup messages:
>>>>>>>>>>>
>>>>>>>>>>> 17:13:24,466 INFO
>>>>>>>>>>> org.apache.hadoop.security.UserGroupInformation               - 
>>>>>>>>>>> Login
>>>>>>>>>>> successful for user nbas...@xxxxxx.net using keytab file
>>>>>>>>>>> /home/nbasjes/.krb/nbasjes.keytab
>>>>>>>>>>> 11/03/2015 17:13:25 Job execution switched to status RUNNING.
>>>>>>>>>>> 11/03/2015 17:13:25 Custom Source -> Stream Sink(1/1) switched to
>>>>>>>>>>> SCHEDULED
>>>>>>>>>>> 11/03/2015 17:13:25 Custom Source -> Stream Sink(1/1) switched to
>>>>>>>>>>> DEPLOYING
>>>>>>>>>>> 11/03/2015 17:13:25 Custom Source -> Stream Sink(1/1) switched to
>>>>>>>>>>> RUNNING
>>>>>>>>>>>
>>>>>>>>>>> Which looks good.
>>>>>>>>>>>
>>>>>>>>>>> However ... no data goes into HBase.
>>>>>>>>>>> After some digging I found this error in the task managers log:
>>>>>>>>>>>
>>>>>>>>>>> 17:13:42,677 WARN  org.apache.hadoop.hbase.ipc.RpcClient
>>>>>>>>>>> - Exception encountered while connecting to the server :
>>>>>>>>>>> javax.security.sasl.SaslException: GSS initiate failed [Caused by
>>>>>>>>>>> GSSException: No valid credentials provided (Mechanism level: 
>>>>>>>>>>> Failed to find
>>>>>>>>>>> any Kerberos tgt)]
>>>>>>>>>>> 17:13:42,677 FATAL org.apache.hadoop.hbase.ipc.RpcClient
>>>>>>>>>>> - SASL authentication failed. The most likely cause is missing or 
>>>>>>>>>>> invalid
>>>>>>>>>>> credentials. Consider 'kinit'.
>>>>>>>>>>> javax.security.sasl.SaslException: GSS initiate failed [Caused by
>>>>>>>>>>> GSSException: No valid credentials provided (Mechanism level: 
>>>>>>>>>>> Failed to find
>>>>>>>>>>> any Kerberos tgt)]
>>>>>>>>>>>     at
>>>>>>>>>>> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>>>>>>>>>>>     at
>>>>>>>>>>> org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:177)
>>>>>>>>>>>     at
>>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupSaslConnection(RpcClient.java:815)
>>>>>>>>>>>     at
>>>>>>>>>>> org.apache.hadoop.hbase.ipc.RpcClient$Connection.access$800(RpcClient.java:349)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> First starting a yarn-session and then loading my job gives the
>>>>>>>>>>> same error.
>>>>>>>>>>>
>>>>>>>>>>> My best guess at this point is that Flink needs the same fix as
>>>>>>>>>>> described here:
>>>>>>>>>>>
>>>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-6918   (
>>>>>>>>>>> https://github.com/apache/spark/pull/5586 )
>>>>>>>>>>>
>>>>>>>>>>> What do you guys think?
>>>>>>>>>>>
>>>>>>>>>>> Niels Basjes
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Oct 27, 2015 at 6:12 PM, Maximilian Michels
>>>>>>>>>>> <m...@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Niels,
>>>>>>>>>>>>
>>>>>>>>>>>> You're welcome. Some more information on how this would be
>>>>>>>>>>>> configured:
>>>>>>>>>>>>
>>>>>>>>>>>> In the kdc.conf, there are two variables:
>>>>>>>>>>>>
>>>>>>>>>>>>         max_life = 2h 0m 0s
>>>>>>>>>>>>         max_renewable_life = 7d 0h 0m 0s
>>>>>>>>>>>>
>>>>>>>>>>>> max_life is the maximum life of the current ticket. However, it
>>>>>>>>>>>> may be renewed up to a time span of max_renewable_life from the 
>>>>>>>>>>>> first ticket
>>>>>>>>>>>> issue on. This means that from the first ticket issue, new tickets 
>>>>>>>>>>>> may be
>>>>>>>>>>>> requested for one week. Each renewed ticket has a life time of 
>>>>>>>>>>>> max_life (2
>>>>>>>>>>>> hours in this case).
>>>>>>>>>>>>
>>>>>>>>>>>> Please let us know about any difficulties with long-running
>>>>>>>>>>>> streaming application and Kerberos.
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Max
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Oct 27, 2015 at 2:46 PM, Niels Basjes <ni...@basjes.nl>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for your feedback.
>>>>>>>>>>>>> So I guess I'll have to talk to the security guys about having
>>>>>>>>>>>>> special
>>>>>>>>>>>>> kerberos ticket expiry times for these types of jobs.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Niels Basjes
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Oct 23, 2015 at 11:45 AM, Maximilian Michels
>>>>>>>>>>>>> <m...@apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Niels,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you for your question. Flink relies entirely on the
>>>>>>>>>>>>>> Kerberos
>>>>>>>>>>>>>> support of Hadoop. So your question could also be rephrased to
>>>>>>>>>>>>>> "Does
>>>>>>>>>>>>>> Hadoop support long-term authentication using Kerberos?". And
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> answer is: Yes!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> While Hadoop uses Kerberos tickets to authenticate users with
>>>>>>>>>>>>>> services
>>>>>>>>>>>>>> initially, the authentication process continues differently
>>>>>>>>>>>>>> afterwards. Instead of saving the ticket to authenticate on a
>>>>>>>>>>>>>> later
>>>>>>>>>>>>>> access, Hadoop creates its own security tockens
>>>>>>>>>>>>>> (DelegationToken) that
>>>>>>>>>>>>>> it passes around. These are authenticated to Kerberos
>>>>>>>>>>>>>> periodically. To
>>>>>>>>>>>>>> my knowledge, the tokens have a life span identical to the
>>>>>>>>>>>>>> Kerberos
>>>>>>>>>>>>>> ticket maximum life span. So be sure to set the maximum life
>>>>>>>>>>>>>> span very
>>>>>>>>>>>>>> high for long streaming jobs. The renewal time, on the other
>>>>>>>>>>>>>> hand, is
>>>>>>>>>>>>>> not important because Hadoop abstracts this away using its own
>>>>>>>>>>>>>> security tockens.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm afraid there is not Kerberos how-to yet. If you are on
>>>>>>>>>>>>>> Yarn, then
>>>>>>>>>>>>>> it is sufficient to authenticate the client with Kerberos. On
>>>>>>>>>>>>>> a Flink
>>>>>>>>>>>>>> standalone cluster you need to ensure that, initially, all
>>>>>>>>>>>>>> nodes are
>>>>>>>>>>>>>> authenticated with Kerberos using the kinit tool.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Feel free to ask if you have more questions and let us know
>>>>>>>>>>>>>> about any
>>>>>>>>>>>>>> difficulties.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>> Max
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Oct 22, 2015 at 2:06 PM, Niels Basjes
>>>>>>>>>>>>>> <ni...@basjes.nl> wrote:
>>>>>>>>>>>>>> > Hi,
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > I want to write a long running (i.e. never stop it)
>>>>>>>>>>>>>> > streaming flink
>>>>>>>>>>>>>> > application on a kerberos secured Hadoop/Yarn cluster. My
>>>>>>>>>>>>>> > application needs
>>>>>>>>>>>>>> > to do things with files on HDFS and HBase tables on that
>>>>>>>>>>>>>> > cluster so having
>>>>>>>>>>>>>> > the correct kerberos tickets is very important. The stream
>>>>>>>>>>>>>> > is to be ingested
>>>>>>>>>>>>>> > from Kafka.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > One of the things with Kerberos is that the tickets expire
>>>>>>>>>>>>>> > after a
>>>>>>>>>>>>>> > predetermined time. My knowledge about kerberos is very
>>>>>>>>>>>>>> > limited so I hope
>>>>>>>>>>>>>> > you guys can help me.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > My question is actually quite simple: Is there an howto
>>>>>>>>>>>>>> > somewhere on how to
>>>>>>>>>>>>>> > correctly run a long running flink application with kerberos
>>>>>>>>>>>>>> > that includes a
>>>>>>>>>>>>>> > solution for the kerberos ticket timeout  ?
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Thanks
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Niels Basjes
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Best regards / Met vriendelijke groeten,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Niels Basjes
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Best regards / Met vriendelijke groeten,
>>>>>>>>>>>
>>>>>>>>>>> Niels Basjes
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best regards / Met vriendelijke groeten,
>>>>>>>>>
>>>>>>>>> Niels Basjes
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards / Met vriendelijke groeten,
>>>>>
>>>>> Niels Basjes
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Best regards / Met vriendelijke groeten,
>>
>> Niels Basjes
>
>

Re: Running continuously on yarn with kerberos

Reply via email to