Re: All but one TMs connect when JM has more than 16G of memory

2015-10-01 Thread Robert Schmidtke
So for anyone who is interested, here are some code references for getting started with Flink on Slurm. I added basic start and stop scripts for Flink on Slurm in my fork: https://github.com/robert-schmidtke/flink/tree/flink-slurm/flink-dist/src/main/flink-bin/bin And I also created an example of

Re: All but one TMs connect when JM has more than 16G of memory

2015-10-01 Thread Robert Metzger
Feel free to contribute a documentation to Flink on how to run Flink on SLURM. On Thu, Oct 1, 2015 at 11:45 AM, Robert Schmidtke wrote: > I see, thanks for the info. I only have access to my cluster via SLURM and > we don't have ssh between our nodes which is why I haven't really > considered th

Re: All but one TMs connect when JM has more than 16G of memory

2015-10-01 Thread Robert Schmidtke
I see, thanks for the info. I only have access to my cluster via SLURM and we don't have ssh between our nodes which is why I haven't really considered the Standalone mode. A colleague has set up YARN on SLURM and it was just the easiest to use. I briefly looked into the Flink Standalone mode but d

Re: All but one TMs connect when JM has more than 16G of memory

2015-10-01 Thread Robert Metzger
Hi, there is currently no option for forcing certain containers onto specific machines. For running the JM (or any other YARN container) on the AM host, you first need to have a NodeManager running on the host with the RM. Maybe YARN is smart enough to schedule the small JM container onto that mach

Re: All but one TMs connect when JM has more than 16G of memory

2015-10-01 Thread Robert Schmidtke
Hi Robert, I had a job failure yesterday with what I believe is the setup I have described above. However when trying to reproduce now, the behavior is the same: Flink waiting for resources to become available. So no hard error. Ok, the looping makes sense then. I haven't thought about shared set

Re: All but one TMs connect when JM has more than 16G of memory

2015-10-01 Thread Robert Metzger
Hi, It is interesting to note that when I set both yarn.nodemanager.resource.memory-mb > and yarn.scheduler.maximum-allocation-mb to 56G I get a proper error when > requesting 56G and 1M, but when setting yarn.nodemanager.resource.memory-mb > to 56G and yarn.scheduler.maximum-allocation-mb to 54G

Re: All but one TMs connect when JM has more than 16G of memory

2015-09-30 Thread Robert Schmidtke
Hi Robert, thanks for your reply. It got me digging into my setup and I discovered that one TM was scheduled next to the JM. When specifying -yn 7 the documentation suggests that this is the number of TMs (of which I wanted 7), and I thought an additional container would be used for the JM (my YAR

Re: All but one TMs connect when JM has more than 16G of memory

2015-09-30 Thread Robert Metzger
Hi Robert, the problem here is that YARN's scheduler (there are different schedulers in YARN: FIFO, CapacityScheduler, ...) is not giving Flink's ApplicationMaster/JobManager all the containers it is requesting. By increasing the size of the AM/JM container, there is probably no memory left to fit

Re: All but one TMs connect when JM has more than 16G of memory

2015-09-30 Thread Robert Schmidtke
I should say I'm running the current Flink master branch. On Wed, Sep 30, 2015 at 5:02 PM, Robert Schmidtke wrote: > It's me again. This is a strange issue, I hope I managed to find the right > keywords. I got 8 machines, 1 for the JM, the other 7 are TMs with 64G of > memory each. > > When runn

All but one TMs connect when JM has more than 16G of memory

2015-09-30 Thread Robert Schmidtke
It's me again. This is a strange issue, I hope I managed to find the right keywords. I got 8 machines, 1 for the JM, the other 7 are TMs with 64G of memory each. When running my job like so: $FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16384 -ytm 40960 -yn 7 . The job completes without any