Hi Niels, Thanks for the feedback. As far as I know, Hadoop deliberately defaults to the one week maximum life time of delegation tokens. Have you tried increasing the maximum token life time or was that not an option?
I wonder why do you use a while loop? Would it be possible to use the Yarn failover mechanism which starts a new ApplicationMaster and resubmits the job? Thanks, Max On Thu, Mar 17, 2016 at 12:43 PM, Niels Basjes <ni...@basjes.nl> wrote: > Hi, > > In my environment doing the "proxy" thing didn't work. > With an token expire of 168 hours (1 week) the job consistently terminates > at exactly (within a margin of 10 seconds) 173.5 hours. > So far we have not been able to solve this problem. > > Our teams now simply assume the thing fails once in a while and have an > automatic restart feature (i.e. shell script with a while true loop). > The best guess at a root cause is this > https://issues.apache.org/jira/browse/HDFS-9276 > > If you have a real solution or a reference to a related bug report to this > problem then please share! > > Niels Basjes > > > > On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault > <thomas.lamira...@ericsson.com> wrote: >> >> Hi Max, >> >> I will try these workaround. >> Thanks >> >> Thomas >> >> ________________________________________ >> De : Maximilian Michels [m...@apache.org] >> Envoyé : mardi 15 mars 2016 16:51 >> À : user@flink.apache.org >> Cc : Niels Basjes >> Objet : Re: Flink job on secure Yarn fails after many hours >> >> Hi Thomas, >> >> Nils (CC) and I found out that you need at least Hadoop version 2.6.1 >> to properly run Kerberos applications on Hadoop clusters. Versions >> before that have critical bugs related to the internal security token >> handling that may expire the token although it is still valid. >> >> That said, there is another limitation of Hadoop that the maximum >> internal token life time is one week. To work around this limit, you >> have two options: >> >> a) increasing the maximum token life time >> >> In yarn-site.xml: >> >> <property> >> <name>yarn.resourcemanager.delegation.token.max-lifetime</name> >> <value>9223372036854775807</value> >> </property> >> >> In hdfs-site.xml >> >> <property> >> <name>dfs.namenode.delegation.token.max-lifetime</name> >> <value>9223372036854775807</value> >> </property> >> >> >> b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode: >> >> From >> http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html >> >> "You can work around this by configuring the ResourceManager as a >> proxy user for the corresponding HDFS NameNode so that the >> ResourceManager can request new tokens when the existing ones are past >> their maximum lifetime." >> >> @Nils: Could you comment on what worked best for you? >> >> Best, >> Max >> >> >> On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault >> <thomas.lamira...@ericsson.com> wrote: >> > >> > Hello everyone, >> > >> > >> > >> > We are facing the same probleme now in our Flink applications, launch >> > using YARN. >> > >> > Just want to know if there is any update about this exception ? >> > >> > >> > >> > Thanks >> > >> > >> > >> > Thomas >> > >> > >> > >> > ________________________________ >> > >> > De : ni...@basj.es [ni...@basj.es] de la part de Niels Basjes >> > [ni...@basjes.nl] >> > Envoyé : vendredi 4 décembre 2015 10:40 >> > À : user@flink.apache.org >> > Objet : Re: Flink job on secure Yarn fails after many hours >> > >> > Hi Maximilian, >> > >> > I just downloaded the version from your google drive and used that to >> > run my test topology that accesses HBase. >> > I deliberately started it twice to double the chance to run into this >> > situation. >> > >> > I'll keep you posted. >> > >> > Niels >> > >> > >> > On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels <m...@apache.org> >> > wrote: >> >> >> >> Hi Niels, >> >> >> >> Just got back from our CI. The build above would fail with a >> >> Checkstyle error. I corrected that. Also I have built the binaries for >> >> your Hadoop version 2.6.0. >> >> >> >> Binaries: >> >> >> >> >> >> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip >> >> >> >> Thanks, >> >> Max >> >> >> >> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281 >> >> >>>> >> >> > 21:30:28,185 ERROR >> >> >>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager >> >> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated, >> >> >>>> >> >> > stopping >> >> >>>> >> >> > process... >> >> >>>> >> >> > 21:30:28,286 INFO >> >> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor >> >> >>>> >> >> > - Removing web root dir >> >> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd >> >> >>>> >> >> > >> >> >>>> >> >> > >> >> >>>> >> >> > -- >> >> >>>> >> >> > Best regards / Met vriendelijke groeten, >> >> >>>> >> >> > >> >> >>>> >> >> > Niels Basjes >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > >> >> >>>> >> > -- >> >> >>>> >> > Best regards / Met vriendelijke groeten, >> >> >>>> >> > >> >> >>>> >> > Niels Basjes >> >> >>>> > >> >> >>>> > >> >> >>>> > >> >> >>>> > >> >> >>>> > -- >> >> >>>> > Best regards / Met vriendelijke groeten, >> >> >>>> > >> >> >>>> > Niels Basjes >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> -- >> >> >>> Best regards / Met vriendelijke groeten, >> >> >>> >> >> >>> Niels Basjes >> > >> > >> > >> > >> > -- >> > Best regards / Met vriendelijke groeten, >> > >> > Niels Basjes > > > > > -- > Best regards / Met vriendelijke groeten, > > Niels Basjes