Re: All but one TMs connect when JM has more than 16G of memory

Robert Schmidtke Thu, 01 Oct 2015 09:56:36 -0700

So for anyone who is interested, here are some code references for getting
started with Flink on Slurm.


I added basic start and stop scripts for Flink on Slurm in my fork:
https://github.com/robert-schmidtke/flink/tree/flink-slurm/flink-dist/src/main/flink-bin/bin

And I also created an example of how to configure and run it:
https://github.com/robert-schmidtke/flink-slurm/blob/master/flink-slurm-example.sh

I'm not sure I will add much more effort because it works for my setup
right now. However if there's a wider interest I can add a bit more
documentation and insight.

Robert

On Thu, Oct 1, 2015 at 11:51 AM, Robert Metzger <rmetz...@apache.org> wrote:

> Feel free to contribute a documentation to Flink on how to run Flink on
> SLURM.
>
> On Thu, Oct 1, 2015 at 11:45 AM, Robert Schmidtke <ro.schmid...@gmail.com>
> wrote:
>
>> I see, thanks for the info. I only have access to my cluster via SLURM
>> and we don't have ssh between our nodes which is why I haven't really
>> considered the Standalone mode. A colleague has set up YARN on SLURM and it
>> was just the easiest to use. I briefly looked into the Flink Standalone
>> mode but dropped it because I thought YARN would be possible after all. It
>> seems I'm going to have a deeper look into starting the master and slaves
>> with SLURM's srun instead of ssh (I guess a slight modification of
>> start-cluster.sh should do the job).
>>
>> On Thu, Oct 1, 2015 at 11:30 AM, Robert Metzger <rmetz...@apache.org>
>> wrote:
>>
>>> Hi,
>>> there is currently no option for forcing certain containers onto
>>> specific machines.
>>> For running the JM (or any other YARN container) on the AM host, you
>>> first need to have a NodeManager running on the host with the RM. Maybe
>>> YARN is smart enough to schedule the small JM container onto that machine.
>>>
>>> I don't know your exact setup, but maybe it would make sense for you to
>>> run Flink in the standalone cluster mode instead with YARN. It seems that
>>> you have a very good idea how and where you want to run the Flink services
>>> in your cluster. YARN is designed to be an abstraction between the cluster
>>> and the application, that's why its a bit difficult to schedule the
>>> containers to specific machines.
>>>
>>> Robert
>>>
>>>
>>>
>>> On Thu, Oct 1, 2015 at 11:24 AM, Robert Schmidtke <
>>> ro.schmid...@gmail.com> wrote:
>>>
>>>> Hi Robert,
>>>>
>>>> I had a job failure yesterday with what I believe is the setup I have
>>>> described above. However when trying to reproduce now, the behavior is the
>>>> same: Flink waiting for resources to become available. So no hard error.
>>>>
>>>> Ok, the looping makes sense then. I haven't thought about shared
>>>> setups. I'm still figuring out how all parameters play together, i.e. -yn,
>>>> -yjm, -ytm and the memory limits in yarn-site.xml. This will need some
>>>> testing and I'll come back with a proper description once I think I know
>>>> what's going on.
>>>>
>>>> When running Flink on YARN, is it easily possible to place the Flink JM
>>>> where the YARN Resource Manager sits, and all the TMs with the remaining
>>>> Node Managers?
>>>>
>>>> Robert
>>>>
>>>> On Thu, Oct 1, 2015 at 10:53 AM, Robert Metzger <rmetz...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> It is interesting to note that when I set both 
>>>>> yarn.nodemanager.resource.memory-mb
>>>>>> and yarn.scheduler.maximum-allocation-mb to 56G I get a proper error
>>>>>> when requesting 56G and 1M, but when setting 
>>>>>> yarn.nodemanager.resource.memory-mb
>>>>>> to 56G and yarn.scheduler.maximum-allocation-mb to 54G I don't get
>>>>>> an error but the aforementioned endless loop.
>>>>>
>>>>>
>>>>> is it a "hard error" (failing) you're getting or just "WARN" log
>>>>> messages. I'm asking because I've added some code some time ago to do some
>>>>> checks before deploying Flink on YARN. These checks will print WARN log
>>>>> messages if the requested YARN session/job does not fit onto the cluster.
>>>>> This "endless loop" exists because in many production environments
>>>>> Flink can just wait for resources to become available, for example when
>>>>> other containers are finishing.
>>>>>
>>>>>
>>>>> Robert
>>>>>
>>>>> On Wed, Sep 30, 2015 at 6:33 PM, Robert Schmidtke <
>>>>> ro.schmid...@gmail.com> wrote:
>>>>>
>>>>>> Hi Robert,
>>>>>>
>>>>>> thanks for your reply. It got me digging into my setup and I
>>>>>> discovered that one TM was scheduled next to the JM. When specifying -yn 
>>>>>> 7
>>>>>> the documentation suggests that this is the number of TMs (of which I
>>>>>> wanted 7), and I thought an additional container would be used for the JM
>>>>>> (my YARN cluster has 8 containers). Anyway with this setup the memory 
>>>>>> added
>>>>>> up to 56G and 1M (40G per TM and 16G 1M for the JM), but I set a hard
>>>>>> maximum of 56G in my yarn-site.xml which is why the request could not be
>>>>>> fulfilled. It is interesting to note that when I set
>>>>>> both yarn.nodemanager.resource.memory-mb
>>>>>> and yarn.scheduler.maximum-allocation-mb to 56G I get a proper error when
>>>>>> requesting 56G and 1M, but when setting 
>>>>>> yarn.nodemanager.resource.memory-mb
>>>>>> to 56G and yarn.scheduler.maximum-allocation-mb to 54G I don't get an 
>>>>>> error
>>>>>> but the aforementioned endless loop. Note I
>>>>>> have yarn.nodemanager.vmem-check-enabled set to false. This is probably a
>>>>>> YARN issue then / my bad configuration.
>>>>>>
>>>>>> I'm in a rush now (to get to the Flink meetup) and thus will check
>>>>>> the documentation later to see how to deploy the TMs and JM on separate
>>>>>> machines each, since that is not what's happening at the moment, but this
>>>>>> is what I'd like to have. Thanks again and see you in an hour.
>>>>>>
>>>>>> Cheers
>>>>>> Robert
>>>>>>
>>>>>> On Wed, Sep 30, 2015 at 5:19 PM, Robert Metzger <rmetz...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Robert,
>>>>>>>
>>>>>>> the problem here is that YARN's scheduler (there are different
>>>>>>> schedulers in YARN: FIFO, CapacityScheduler, ...) is not giving Flink's
>>>>>>> ApplicationMaster/JobManager all the containers it is requesting. By
>>>>>>> increasing the size of the AM/JM container, there is probably no memory
>>>>>>> left to fit the last TaskManager container.
>>>>>>> I also experienced this issue, when I wanted to run a Flink job on
>>>>>>> YARN and the containers were fitting theoretically, but YARN was not 
>>>>>>> giving
>>>>>>> me all the containers I requested.
>>>>>>> Back then, I asked on the yarn-dev list [1] (there were also some
>>>>>>> off-list emails) but we could not resolve the issue.
>>>>>>>
>>>>>>> Can you check the resource manager logs? Maybe there is a log
>>>>>>> message which explains why the container request of Flink's AM is not
>>>>>>> fulfilled.
>>>>>>>
>>>>>>>
>>>>>>> [1]
>>>>>>> http://search-hadoop.com/m/AsBtCilK5r1pKLjf1&subj=Re+QUESTION+Allocating+a+full+YARN+cluster
>>>>>>>
>>>>>>> On Wed, Sep 30, 2015 at 5:02 PM, Robert Schmidtke <
>>>>>>> ro.schmid...@gmail.com> wrote:
>>>>>>>
>>>>>>>> It's me again. This is a strange issue, I hope I managed to find
>>>>>>>> the right keywords. I got 8 machines, 1 for the JM, the other 7 are TMs
>>>>>>>> with 64G of memory each.
>>>>>>>>
>>>>>>>> When running my job like so:
>>>>>>>>
>>>>>>>> $FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16384 -ytm 40960 -yn
>>>>>>>> 7 .....
>>>>>>>>
>>>>>>>> The job completes without any problems. When running it like so:
>>>>>>>>
>>>>>>>> $FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16385 -ytm 40960 -yn
>>>>>>>> 7 .....
>>>>>>>>
>>>>>>>> (note the one more M of memory for the JM), the execution stalls,
>>>>>>>> continuously reporting:
>>>>>>>>
>>>>>>>> .....
>>>>>>>> TaskManager status (6/7)
>>>>>>>> TaskManager status (6/7)
>>>>>>>> TaskManager status (6/7)
>>>>>>>> .....
>>>>>>>>
>>>>>>>> I did some poking around, but I couldn't find any direct
>>>>>>>> correlation with the code.
>>>>>>>>
>>>>>>>> The JM log says:
>>>>>>>>
>>>>>>>> .....
>>>>>>>> 16:49:01,893 INFO  org.apache.flink.yarn.ApplicationMaster$
>>>>>>>>              -  JVM Options:
>>>>>>>> 16:49:01,893 INFO  org.apache.flink.yarn.ApplicationMaster$
>>>>>>>>              -     -Xmx12289M
>>>>>>>> .....
>>>>>>>>
>>>>>>>> but then continues to report
>>>>>>>>
>>>>>>>> .....
>>>>>>>> 16:52:59,311 INFO
>>>>>>>>  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The 
>>>>>>>> user
>>>>>>>> requested 7 containers, 6 running. 1 containers missing
>>>>>>>> 16:52:59,831 INFO
>>>>>>>>  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The 
>>>>>>>> user
>>>>>>>> requested 7 containers, 6 running. 1 containers missing
>>>>>>>> 16:53:00,351 INFO
>>>>>>>>  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1    - The 
>>>>>>>> user
>>>>>>>> requested 7 containers, 6 running. 1 containers missing
>>>>>>>> .....
>>>>>>>>
>>>>>>>> forever until I cancel the job.
>>>>>>>>
>>>>>>>> If you have any ideas I'm happy to try them out. Thanks in advance
>>>>>>>> for any hints! Cheers.
>>>>>>>>
>>>>>>>> Robert
>>>>>>>> --
>>>>>>>> My GPG Key ID: 336E2680
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> My GPG Key ID: 336E2680
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> My GPG Key ID: 336E2680
>>>>
>>>
>>>
>>
>>
>> --
>> My GPG Key ID: 336E2680
>>
>
>


-- 
My GPG Key ID: 336E2680

Re: All but one TMs connect when JM has more than 16G of memory

Reply via email to