Re: Running continuously on yarn with kerberos

Niels Basjes Thu, 05 Nov 2015 04:39:59 -0800

I created https://issues.apache.org/jira/browse/FLINK-2977


On Thu, Nov 5, 2015 at 12:25 PM, Robert Metzger <rmetz...@apache.org> wrote:

> Hi Niels,
> thank you for analyzing the issue so properly. I agree with you. It seems
> that HDFS and HBase are using their own tokes which we need to transfer
> from the client to the YARN containers. We should be able to port the fix
> from Spark (which they got from Storm) into our YARN client.
> I think we would add this in org.apache.flink.yarn.Utils#setTokensFor().
>
> Do you want to implement and verify the fix yourself? If you are to busy
> at the moment, we can also discuss how we share the work (I'm implementing
> it, you test the fix)
>
>
> Robert
>
> On Tue, Nov 3, 2015 at 5:26 PM, Niels Basjes <ni...@basjes.nl> wrote:
>
>> Update on the status so far.... I suspect I found a problem in a secure
>> setup.
>>
>> I have created a very simple Flink topology consisting of a streaming
>> Source (the outputs the timestamp a few times per second) and a Sink (that
>> puts that timestamp into a single record in HBase).
>> Running this on a non-secure Yarn cluster works fine.
>>
>> To run it on a secured Yarn cluster my main routine now looks like this:
>>
>> public static void main(String[] args) throws Exception {
>>     System.setProperty("java.security.krb5.conf", "/etc/krb5.conf");
>>     UserGroupInformation.loginUserFromKeytab("nbas...@xxxxxx.net", 
>> "/home/nbasjes/.krb/nbasjes.keytab");
>>
>>     final StreamExecutionEnvironment env = 
>> StreamExecutionEnvironment.getExecutionEnvironment();
>>     env.setParallelism(1);
>>
>>     DataStream<String> stream = env.addSource(new TimerTicksSource());
>>     stream.addSink(new SetHBaseRowSink());
>>     env.execute("Long running Flink application");
>> }
>>
>> When I run this
>>      flink run -m yarn-cluster -yn 1 -yjm 1024 -ytm 4096
>> ./kerberos-1.0-SNAPSHOT.jar
>>
>> I see after the startup messages:
>>
>> 17:13:24,466 INFO  org.apache.hadoop.security.UserGroupInformation
>>         - Login successful for user nbas...@xxxxxx.net using keytab file
>> /home/nbasjes/.krb/nbasjes.keytab
>> 11/03/2015 17:13:25 Job execution switched to status RUNNING.
>> 11/03/2015 17:13:25 Custom Source -> Stream Sink(1/1) switched to
>> SCHEDULED
>> 11/03/2015 17:13:25 Custom Source -> Stream Sink(1/1) switched to
>> DEPLOYING
>> 11/03/2015 17:13:25 Custom Source -> Stream Sink(1/1) switched to
>> RUNNING
>>
>> Which looks good.
>>
>> However ... no data goes into HBase.
>> After some digging I found this error in the task managers log:
>>
>> 17:13:42,677 WARN  org.apache.hadoop.hbase.ipc.RpcClient                     
>>     - Exception encountered while connecting to the server : 
>> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
>> GSSException: No valid credentials provided (Mechanism level: Failed to find 
>> any Kerberos tgt)]
>> 17:13:42,677 FATAL org.apache.hadoop.hbase.ipc.RpcClient                     
>>     - SASL authentication failed. The most likely cause is missing or 
>> invalid credentials. Consider 'kinit'.
>> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
>> GSSException: No valid credentials provided (Mechanism level: Failed to find 
>> any Kerberos tgt)]
>>      at 
>> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>>      at 
>> org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:177)
>>      at 
>> org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupSaslConnection(RpcClient.java:815)
>>      at 
>> org.apache.hadoop.hbase.ipc.RpcClient$Connection.access$800(RpcClient.java:349)
>>
>>
>> First starting a yarn-session and then loading my job gives the same
>> error.
>>
>> My best guess at this point is that Flink needs the same fix as described
>> here:
>>
>> https://issues.apache.org/jira/browse/SPARK-6918   (
>> https://github.com/apache/spark/pull/5586 )
>>
>> What do you guys think?
>>
>> Niels Basjes
>>
>>
>>
>> On Tue, Oct 27, 2015 at 6:12 PM, Maximilian Michels <m...@apache.org>
>> wrote:
>>
>>> Hi Niels,
>>>
>>> You're welcome. Some more information on how this would be configured:
>>>
>>> In the kdc.conf, there are two variables:
>>>
>>>         max_life = 2h 0m 0s
>>>         max_renewable_life = 7d 0h 0m 0s
>>>
>>> max_life is the maximum life of the current ticket. However, it may be
>>> renewed up to a time span of max_renewable_life from the first ticket issue
>>> on. This means that from the first ticket issue, new tickets may be
>>> requested for one week. Each renewed ticket has a life time of max_life (2
>>> hours in this case).
>>>
>>> Please let us know about any difficulties with long-running streaming
>>> application and Kerberos.
>>>
>>> Best regards,
>>> Max
>>>
>>> On Tue, Oct 27, 2015 at 2:46 PM, Niels Basjes <ni...@basjes.nl> wrote:
>>>
>>>> Hi,
>>>>
>>>> Thanks for your feedback.
>>>> So I guess I'll have to talk to the security guys about having special
>>>> kerberos ticket expiry times for these types of jobs.
>>>>
>>>> Niels Basjes
>>>>
>>>> On Fri, Oct 23, 2015 at 11:45 AM, Maximilian Michels <m...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Niels,
>>>>>
>>>>> Thank you for your question. Flink relies entirely on the Kerberos
>>>>> support of Hadoop. So your question could also be rephrased to "Does
>>>>> Hadoop support long-term authentication using Kerberos?". And the
>>>>> answer is: Yes!
>>>>>
>>>>> While Hadoop uses Kerberos tickets to authenticate users with services
>>>>> initially, the authentication process continues differently
>>>>> afterwards. Instead of saving the ticket to authenticate on a later
>>>>> access, Hadoop creates its own security tockens (DelegationToken) that
>>>>> it passes around. These are authenticated to Kerberos periodically. To
>>>>> my knowledge, the tokens have a life span identical to the Kerberos
>>>>> ticket maximum life span. So be sure to set the maximum life span very
>>>>> high for long streaming jobs. The renewal time, on the other hand, is
>>>>> not important because Hadoop abstracts this away using its own
>>>>> security tockens.
>>>>>
>>>>> I'm afraid there is not Kerberos how-to yet. If you are on Yarn, then
>>>>> it is sufficient to authenticate the client with Kerberos. On a Flink
>>>>> standalone cluster you need to ensure that, initially, all nodes are
>>>>> authenticated with Kerberos using the kinit tool.
>>>>>
>>>>> Feel free to ask if you have more questions and let us know about any
>>>>> difficulties.
>>>>>
>>>>> Best regards,
>>>>> Max
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Oct 22, 2015 at 2:06 PM, Niels Basjes <ni...@basjes.nl> wrote:
>>>>> > Hi,
>>>>> >
>>>>> > I want to write a long running (i.e. never stop it) streaming flink
>>>>> > application on a kerberos secured Hadoop/Yarn cluster. My
>>>>> application needs
>>>>> > to do things with files on HDFS and HBase tables on that cluster so
>>>>> having
>>>>> > the correct kerberos tickets is very important. The stream is to be
>>>>> ingested
>>>>> > from Kafka.
>>>>> >
>>>>> > One of the things with Kerberos is that the tickets expire after a
>>>>> > predetermined time. My knowledge about kerberos is very limited so I
>>>>> hope
>>>>> > you guys can help me.
>>>>> >
>>>>> > My question is actually quite simple: Is there an howto somewhere on
>>>>> how to
>>>>> > correctly run a long running flink application with kerberos that
>>>>> includes a
>>>>> > solution for the kerberos ticket timeout  ?
>>>>> >
>>>>> > Thanks
>>>>> >
>>>>> > Niels Basjes
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best regards / Met vriendelijke groeten,
>>>>
>>>> Niels Basjes
>>>>
>>>
>>>
>>
>>
>> --
>> Best regards / Met vriendelijke groeten,
>>
>> Niels Basjes
>>
>
>


-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Running continuously on yarn with kerberos

Reply via email to