Re: Flink job on secure Yarn fails after many hours

Maximilian Michels Fri, 18 Mar 2016 23:07:33 -0700

Hi Niels,

Thanks for the feedback. As far as I know, Hadoop deliberately
defaults to the one week maximum life time of delegation tokens. Have
you tried increasing the maximum token life time or was that not an
option?


I wonder why do you use a while loop? Would it be possible to use the
Yarn failover mechanism which starts a new ApplicationMaster and
resubmits the job?

Thanks,
Max


On Thu, Mar 17, 2016 at 12:43 PM, Niels Basjes <ni...@basjes.nl> wrote:
> Hi,
>
> In my environment doing the "proxy" thing didn't work.
> With an token expire of 168 hours (1 week) the job consistently terminates
> at exactly (within a margin of 10 seconds) 173.5 hours.
> So far we have not been able to solve this problem.
>
> Our teams now simply assume the thing fails once in a while and have an
> automatic restart feature (i.e. shell script with a while true loop).
> The best guess at a root cause is this
> https://issues.apache.org/jira/browse/HDFS-9276
>
> If you have a real solution or a reference to a related bug report to this
> problem then please share!
>
> Niels Basjes
>
>
>
> On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault
> <thomas.lamira...@ericsson.com> wrote:
>>
>> Hi Max,
>>
>> I will try these workaround.
>> Thanks
>>
>> Thomas
>>
>> ________________________________________
>> De : Maximilian Michels [m...@apache.org]
>> Envoyé : mardi 15 mars 2016 16:51
>> À : user@flink.apache.org
>> Cc : Niels Basjes
>> Objet : Re: Flink job on secure Yarn fails after many hours
>>
>> Hi Thomas,
>>
>> Nils (CC) and I found out that you need at least Hadoop version 2.6.1
>> to properly run Kerberos applications on Hadoop clusters. Versions
>> before that have critical bugs related to the internal security token
>> handling that may expire the token although it is still valid.
>>
>> That said, there is another limitation of Hadoop that the maximum
>> internal token life time is one week. To work around this limit, you
>> have two options:
>>
>> a) increasing the maximum token life time
>>
>> In yarn-site.xml:
>>
>> <property>
>>   <name>yarn.resourcemanager.delegation.token.max-lifetime</name>
>>   <value>9223372036854775807</value>
>> </property>
>>
>> In hdfs-site.xml
>>
>> <property>
>>   <name>dfs.namenode.delegation.token.max-lifetime</name>
>>   <value>9223372036854775807</value>
>> </property>
>>
>>
>> b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:
>>
>> From
>> http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html
>>
>> "You can work around this by configuring the ResourceManager as a
>> proxy user for the corresponding HDFS NameNode so that the
>> ResourceManager can request new tokens when the existing ones are past
>> their maximum lifetime."
>>
>> @Nils: Could you comment on what worked best for you?
>>
>> Best,
>> Max
>>
>>
>> On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
>> <thomas.lamira...@ericsson.com> wrote:
>> >
>> > Hello everyone,
>> >
>> >
>> >
>> > We are facing the same probleme now in our Flink applications, launch
>> > using YARN.
>> >
>> > Just want to know if there is any update about this exception ?
>> >
>> >
>> >
>> > Thanks
>> >
>> >
>> >
>> > Thomas
>> >
>> >
>> >
>> > ________________________________
>> >
>> > De : ni...@basj.es [ni...@basj.es] de la part de Niels Basjes
>> > [ni...@basjes.nl]
>> > Envoyé : vendredi 4 décembre 2015 10:40
>> > À : user@flink.apache.org
>> > Objet : Re: Flink job on secure Yarn fails after many hours
>> >
>> > Hi Maximilian,
>> >
>> > I just downloaded the version from your google drive and used that to
>> > run my test topology that accesses HBase.
>> > I deliberately started it twice to double the chance to run into this
>> > situation.
>> >
>> > I'll keep you posted.
>> >
>> > Niels
>> >
>> >
>> > On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <m...@apache.org>
>> > wrote:
>> >>
>> >> Hi Niels,
>> >>
>> >> Just got back from our CI. The build above would fail with a
>> >> Checkstyle error. I corrected that. Also I have built the binaries for
>> >> your Hadoop version 2.6.0.
>> >>
>> >> Binaries:
>> >>
>> >>
>> >> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip
>> >>
>> >> Thanks,
>> >> Max
>> >>
>> >> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
>> >> >>>> >> >> > 21:30:28,185 ERROR
>> >> >>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager
>> >> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>> >> >>>> >> >> > stopping
>> >> >>>> >> >> > process...
>> >> >>>> >> >> > 21:30:28,286 INFO
>> >> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >> >>>> >> >> > - Removing web root dir
>> >> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >> >>>> >> >> >
>> >> >>>> >> >> >
>> >> >>>> >> >> > --
>> >> >>>> >> >> > Best regards / Met vriendelijke groeten,
>> >> >>>> >> >> >
>> >> >>>> >> >> > Niels Basjes
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> >
>> >> >>>> >> > --
>> >> >>>> >> > Best regards / Met vriendelijke groeten,
>> >> >>>> >> >
>> >> >>>> >> > Niels Basjes
>> >> >>>> >
>> >> >>>> >
>> >> >>>> >
>> >> >>>> >
>> >> >>>> > --
>> >> >>>> > Best regards / Met vriendelijke groeten,
>> >> >>>> >
>> >> >>>> > Niels Basjes
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> Best regards / Met vriendelijke groeten,
>> >> >>>
>> >> >>> Niels Basjes
>> >
>> >
>> >
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

Re: Flink job on secure Yarn fails after many hours

Reply via email to