Re: Running continuously on yarn with kerberos

Stephan Ewen Sat, 07 Nov 2015 05:52:27 -0800

Hi Niels!

Usually, you simply build the binaries by invoking "mvn -DskipTests clean
package" in the root flink directory. The resulting program should be in
the "build-target" directory.


If the program gets stuck, let us know where and what the last message on
the command line is.

Please be aware that the final step of building the "flink-dist" project
may take a while, especially on systems with hard disks (as opposed to
SSDs) and a comparatively low amount of memory. The reason is that the
building of the final JAR file is quite expensive, because the system
re-packages certain libraries in order to avoid conflicts between different
versions.

Stephan


On Sat, Nov 7, 2015 at 2:40 PM, Niels Basjes <[email protected]> wrote:

> Hi,
>
> Excellent.
> What you can help me with are the commands to build the binary
> distribution from source.
> I tried it last Thursday and the build seemed to get stuck at some point
> (at the end of/just after building the dist module).
> I haven't been able to figure out why yet.
>
> Niels
> On 5 Nov 2015 14:57, "Maximilian Michels" <[email protected]> wrote:
>
>> Thank you for looking into the problem, Niels. Let us know if you need
>> anything. We would be happy to merge a pull request once you have verified
>> the fix.
>>
>> On Thu, Nov 5, 2015 at 1:38 PM, Niels Basjes <[email protected]> wrote:
>>
>>> I created https://issues.apache.org/jira/browse/FLINK-2977
>>>
>>> On Thu, Nov 5, 2015 at 12:25 PM, Robert Metzger <[email protected]>
>>> wrote:
>>>
>>>> Hi Niels,
>>>> thank you for analyzing the issue so properly. I agree with you. It
>>>> seems that HDFS and HBase are using their own tokes which we need to
>>>> transfer from the client to the YARN containers. We should be able to port
>>>> the fix from Spark (which they got from Storm) into our YARN client.
>>>> I think we would add this in org.apache.flink.yarn.Utils#
>>>> setTokensFor().
>>>>
>>>> Do you want to implement and verify the fix yourself? If you are to
>>>> busy at the moment, we can also discuss how we share the work (I'm
>>>> implementing it, you test the fix)
>>>>
>>>>
>>>> Robert
>>>>
>>>> On Tue, Nov 3, 2015 at 5:26 PM, Niels Basjes <[email protected]> wrote:
>>>>
>>>>> Update on the status so far.... I suspect I found a problem in a
>>>>> secure setup.
>>>>>
>>>>> I have created a very simple Flink topology consisting of a streaming
>>>>> Source (the outputs the timestamp a few times per second) and a Sink (that
>>>>> puts that timestamp into a single record in HBase).
>>>>> Running this on a non-secure Yarn cluster works fine.
>>>>>
>>>>> To run it on a secured Yarn cluster my main routine now looks like
>>>>> this:
>>>>>
>>>>> public static void main(String[] args) throws Exception {
>>>>>     System.setProperty("java.security.krb5.conf", "/etc/krb5.conf");
>>>>>     UserGroupInformation.loginUserFromKeytab("[email protected]", 
>>>>> "/home/nbasjes/.krb/nbasjes.keytab");
>>>>>
>>>>>     final StreamExecutionEnvironment env = 
>>>>> StreamExecutionEnvironment.getExecutionEnvironment();
>>>>>     env.setParallelism(1);
>>>>>
>>>>>     DataStream<String> stream = env.addSource(new TimerTicksSource());
>>>>>     stream.addSink(new SetHBaseRowSink());
>>>>>     env.execute("Long running Flink application");
>>>>> }
>>>>>
>>>>> When I run this
>>>>>      flink run -m yarn-cluster -yn 1 -yjm 1024 -ytm 4096
>>>>> ./kerberos-1.0-SNAPSHOT.jar
>>>>>
>>>>> I see after the startup messages:
>>>>>
>>>>> 17:13:24,466 INFO  org.apache.hadoop.security.UserGroupInformation
>>>>>           - Login successful for user [email protected] using keytab
>>>>> file /home/nbasjes/.krb/nbasjes.keytab
>>>>> 11/03/2015 17:13:25 Job execution switched to status RUNNING.
>>>>> 11/03/2015 17:13:25 Custom Source -> Stream Sink(1/1) switched to
>>>>> SCHEDULED
>>>>> 11/03/2015 17:13:25 Custom Source -> Stream Sink(1/1) switched to
>>>>> DEPLOYING
>>>>> 11/03/2015 17:13:25 Custom Source -> Stream Sink(1/1) switched to
>>>>> RUNNING
>>>>>
>>>>> Which looks good.
>>>>>
>>>>> However ... no data goes into HBase.
>>>>> After some digging I found this error in the task managers log:
>>>>>
>>>>> 17:13:42,677 WARN  org.apache.hadoop.hbase.ipc.RpcClient                  
>>>>>        - Exception encountered while connecting to the server : 
>>>>> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
>>>>> GSSException: No valid credentials provided (Mechanism level: Failed to 
>>>>> find any Kerberos tgt)]
>>>>> 17:13:42,677 FATAL org.apache.hadoop.hbase.ipc.RpcClient                  
>>>>>        - SASL authentication failed. The most likely cause is missing or 
>>>>> invalid credentials. Consider 'kinit'.
>>>>> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
>>>>> GSSException: No valid credentials provided (Mechanism level: Failed to 
>>>>> find any Kerberos tgt)]
>>>>>   at 
>>>>> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>>>>>   at 
>>>>> org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:177)
>>>>>   at 
>>>>> org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupSaslConnection(RpcClient.java:815)
>>>>>   at 
>>>>> org.apache.hadoop.hbase.ipc.RpcClient$Connection.access$800(RpcClient.java:349)
>>>>>
>>>>>
>>>>> First starting a yarn-session and then loading my job gives the same
>>>>> error.
>>>>>
>>>>> My best guess at this point is that Flink needs the same fix as
>>>>> described here:
>>>>>
>>>>> https://issues.apache.org/jira/browse/SPARK-6918   (
>>>>> https://github.com/apache/spark/pull/5586 )
>>>>>
>>>>> What do you guys think?
>>>>>
>>>>> Niels Basjes
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Oct 27, 2015 at 6:12 PM, Maximilian Michels <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Niels,
>>>>>>
>>>>>> You're welcome. Some more information on how this would be configured:
>>>>>>
>>>>>> In the kdc.conf, there are two variables:
>>>>>>
>>>>>>         max_life = 2h 0m 0s
>>>>>>         max_renewable_life = 7d 0h 0m 0s
>>>>>>
>>>>>> max_life is the maximum life of the current ticket. However, it may
>>>>>> be renewed up to a time span of max_renewable_life from the first ticket
>>>>>> issue on. This means that from the first ticket issue, new tickets may be
>>>>>> requested for one week. Each renewed ticket has a life time of max_life 
>>>>>> (2
>>>>>> hours in this case).
>>>>>>
>>>>>> Please let us know about any difficulties with long-running streaming
>>>>>> application and Kerberos.
>>>>>>
>>>>>> Best regards,
>>>>>> Max
>>>>>>
>>>>>> On Tue, Oct 27, 2015 at 2:46 PM, Niels Basjes <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Thanks for your feedback.
>>>>>>> So I guess I'll have to talk to the security guys about having
>>>>>>> special
>>>>>>> kerberos ticket expiry times for these types of jobs.
>>>>>>>
>>>>>>> Niels Basjes
>>>>>>>
>>>>>>> On Fri, Oct 23, 2015 at 11:45 AM, Maximilian Michels <[email protected]
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi Niels,
>>>>>>>>
>>>>>>>> Thank you for your question. Flink relies entirely on the Kerberos
>>>>>>>> support of Hadoop. So your question could also be rephrased to "Does
>>>>>>>> Hadoop support long-term authentication using Kerberos?". And the
>>>>>>>> answer is: Yes!
>>>>>>>>
>>>>>>>> While Hadoop uses Kerberos tickets to authenticate users with
>>>>>>>> services
>>>>>>>> initially, the authentication process continues differently
>>>>>>>> afterwards. Instead of saving the ticket to authenticate on a later
>>>>>>>> access, Hadoop creates its own security tockens (DelegationToken)
>>>>>>>> that
>>>>>>>> it passes around. These are authenticated to Kerberos periodically.
>>>>>>>> To
>>>>>>>> my knowledge, the tokens have a life span identical to the Kerberos
>>>>>>>> ticket maximum life span. So be sure to set the maximum life span
>>>>>>>> very
>>>>>>>> high for long streaming jobs. The renewal time, on the other hand,
>>>>>>>> is
>>>>>>>> not important because Hadoop abstracts this away using its own
>>>>>>>> security tockens.
>>>>>>>>
>>>>>>>> I'm afraid there is not Kerberos how-to yet. If you are on Yarn,
>>>>>>>> then
>>>>>>>> it is sufficient to authenticate the client with Kerberos. On a
>>>>>>>> Flink
>>>>>>>> standalone cluster you need to ensure that, initially, all nodes are
>>>>>>>> authenticated with Kerberos using the kinit tool.
>>>>>>>>
>>>>>>>> Feel free to ask if you have more questions and let us know about
>>>>>>>> any
>>>>>>>> difficulties.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Max
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Oct 22, 2015 at 2:06 PM, Niels Basjes <[email protected]>
>>>>>>>> wrote:
>>>>>>>> > Hi,
>>>>>>>> >
>>>>>>>> > I want to write a long running (i.e. never stop it) streaming
>>>>>>>> flink
>>>>>>>> > application on a kerberos secured Hadoop/Yarn cluster. My
>>>>>>>> application needs
>>>>>>>> > to do things with files on HDFS and HBase tables on that cluster
>>>>>>>> so having
>>>>>>>> > the correct kerberos tickets is very important. The stream is to
>>>>>>>> be ingested
>>>>>>>> > from Kafka.
>>>>>>>> >
>>>>>>>> > One of the things with Kerberos is that the tickets expire after a
>>>>>>>> > predetermined time. My knowledge about kerberos is very limited
>>>>>>>> so I hope
>>>>>>>> > you guys can help me.
>>>>>>>> >
>>>>>>>> > My question is actually quite simple: Is there an howto somewhere
>>>>>>>> on how to
>>>>>>>> > correctly run a long running flink application with kerberos that
>>>>>>>> includes a
>>>>>>>> > solution for the kerberos ticket timeout  ?
>>>>>>>> >
>>>>>>>> > Thanks
>>>>>>>> >
>>>>>>>> > Niels Basjes
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best regards / Met vriendelijke groeten,
>>>>>>>
>>>>>>> Niels Basjes
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best regards / Met vriendelijke groeten,
>>>>>
>>>>> Niels Basjes
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best regards / Met vriendelijke groeten,
>>>
>>> Niels Basjes
>>>
>>
>>

Re: Running continuously on yarn with kerberos

Reply via email to