Re: Flink job on secure Yarn fails after many hours

2017-04-14 Thread Niels Basjes
ated bug report to >> this >> > problem then please share! >> > >> > Niels Basjes >> > >> > >> > >> > On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault >> > wrote: >> >> >> >> Hi Max, >> >> >&g

Re: Flink job on secure Yarn fails after many hours

2017-04-12 Thread Robert Metzger
t; > > > Niels Basjes > > > > > > > > On Thu, Mar 17, 2016 at 10:20 AM, Thomas Lamirault > > wrote: > >> > >> Hi Max, > >> > >> I will try these workaround. > >> Thanks > >> > >> Thomas > >> >

Re: Flink job on secure Yarn fails after many hours

2016-03-19 Thread Niels Basjes
] > Envoyé : mardi 15 mars 2016 16:51 > À : user@flink.apache.org > Cc : Niels Basjes > Objet : Re: Flink job on secure Yarn fails after many hours > > Hi Thomas, > > Nils (CC) and I found out that you need at least Hadoop version 2.6.1 > to properly run Kerberos applications on Hadoop

Re: Flink job on secure Yarn fails after many hours

2016-03-18 Thread Maximilian Michels
irault > wrote: >> >> Hi Max, >> >> I will try these workaround. >> Thanks >> >> Thomas >> >> >> De : Maximilian Michels [m...@apache.org] >> Envoyé : mardi 15 mars 2016 16:51 >> À : user

Re: Flink job on secure Yarn fails after many hours

2016-03-15 Thread Maximilian Michels
_ > > De : ni...@basj.es [ni...@basj.es] de la part de Niels Basjes > [ni...@basjes.nl] > Envoyé : vendredi 4 décembre 2015 10:40 > À : user@flink.apache.org > Objet : Re: Flink job on secure Yarn fails after many hours > > Hi Maximilian, > > I just do

Re: Flink job on secure Yarn fails after many hours

2015-12-04 Thread Niels Basjes
Hi Maximilian, I just downloaded the version from your google drive and used that to run my test topology that accesses HBase. I deliberately started it twice to double the chance to run into this situation. I'll keep you posted. Niels On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels wrote

Re: Flink job on secure Yarn fails after many hours

2015-12-03 Thread Maximilian Michels
Hi Niels, Just got back from our CI. The build above would fail with a Checkstyle error. I corrected that. Also I have built the binaries for your Hadoop version 2.6.0. Binaries: https://drive.google.com/file/d/0BziY9U_qva1sZ1FVR3RWeVNrNzA/view?usp=sharing Source: https://github.com/mxm/flink/

Re: Flink job on secure Yarn fails after many hours

2015-12-02 Thread Maximilian Michels
I forgot you're using Flink 0.10.1. The above was for the master. So here's the commit for Flink 0.10.1: https://github.com/mxm/flink/commit/a41f3866f4097586a7b2262093088861b62930cd git fetch https://github.com/mxm/flink/ \ a41f3866f4097586a7b2262093088861b62930cd && git checkout FETCH_HEAD http

Re: Flink job on secure Yarn fails after many hours

2015-12-02 Thread Maximilian Michels
Great. Here is the commit to try out: https://github.com/mxm/flink/commit/f49b9635bec703541f19cb8c615f302a07ea88b3 If you already have the Flink repository, check it out using git fetch https://github.com/mxm/flink/ f49b9635bec703541f19cb8c615f302a07ea88b3 && git checkout FETCH_HEAD Alternativel

Re: Flink job on secure Yarn fails after many hours

2015-12-02 Thread Niels Basjes
Sure, just give me the git repo url to build and I'll give it a try. Niels On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels wrote: > I mentioned that the exception gets thrown when requesting container > status information. We need this to send a heartbeat to YARN but it is > not very crucial

Re: Flink job on secure Yarn fails after many hours

2015-12-02 Thread Maximilian Michels
I mentioned that the exception gets thrown when requesting container status information. We need this to send a heartbeat to YARN but it is not very crucial if this fails once for the running job. Possibly, we could work around this problem by retrying N times in case of an exception. Would it be

Re: Flink job on secure Yarn fails after many hours

2015-12-02 Thread Niels Basjes
No, I was just asking. No upgrade is possible for the next month or two. This week is our busiest day of the year ... Our shop is doing about 10 orders per second these days ... So they won't upgrade until next January/February Niels On Wed, Dec 2, 2015 at 3:59 PM, Maximilian Michels wrote: >

Re: Flink job on secure Yarn fails after many hours

2015-12-02 Thread Maximilian Michels
Hi Niels, You mentioned you have the option to update Hadoop and redeploy the job. Would be great if you could do that and let us know how it turns out. Cheers, Max On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes wrote: > Hi, > > I posted the entire log from the first log line at the moment of fai

Re: Flink job on secure Yarn fails after many hours

2015-12-02 Thread Niels Basjes
Hi, I posted the entire log from the first log line at the moment of failure to the very end of the logfile. This is all I have. As far as I understand the Kerberos and Keytab mechanism in Hadoop Yarn is that it catches the "Invalid Token" and then (if keytab) gets a new Kerberos ticket (or tgt?)

Re: Flink job on secure Yarn fails after many hours

2015-12-02 Thread Maximilian Michels
Hi Niels, Sorry for hear you experienced this exception. From a first glance, it looks like a bug in Hadoop to me. > "Not retrying because the invoked method is not idempotent, and unable to > determine whether it was invoked" That is nothing to worry about. This is Hadoop's internal retry mech

Flink job on secure Yarn fails after many hours

2015-12-02 Thread Niels Basjes
Hi, We have a Kerberos secured Yarn cluster here and I'm experimenting with Apache Flink on top of that. A few days ago I started a very simple Flink application (just stream the time as a String into HBase 10 times per second). I (deliberately) asked our IT-ops guys to make my account have a m