Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

Saliya Ekanayake Mon, 11 Jul 2016 09:20:23 -0700

I meant, I'll check when current jobs are done and will let you know.

On Mon, Jul 11, 2016 at 12:19 PM, Saliya Ekanayake <esal...@gmail.com>
wrote:


> I am running some jobs now. I'll stop and restart using pdsh to see what
> was the issue again
>
> On Mon, Jul 11, 2016 at 12:15 PM, Greg Hogan <c...@greghogan.com> wrote:
>
>> I'd definitely be interested to hear any insight into what failed when
>> starting the taskmanagers with pdsh. Did the command fail, or fallback to
>> standard ssh, a parse error on the slaves file?
>>
>> I'm wondering if we need to escape
>>   PDSH_SSH_ARGS_APPEND=$FLINK_SSH_OPTS
>> as
>>   PDSH_SSH_ARGS_APPEND="${FLINK_SSH_OPTS}"
>>
>> On Mon, Jul 11, 2016 at 12:02 AM, Saliya Ekanayake <esal...@gmail.com>
>> wrote:
>>
>>> pdsh is available in head node only, but when I tried to do *start-cluster
>>> *from head node (note Job manager node is not head node) it didn't
>>> work, which is why I modified the scripts.
>>>
>>> Yes, exactly, this is what I was trying to do. My research area has been
>>> on these NUMA related issues and binding a process to a socket (CPU) and
>>> then its thread to individual cores have shown great advantage. I actually
>>> have Java code that automatically (user configurable as well) bind
>>> processes and threads. For Flink, I've manually done this using  shell
>>> script that scans TMs in a node and pin them appropriately. This approach
>>> is OK, but it's better if the support is integrated to Flink.
>>>
>>> On Sun, Jul 10, 2016 at 8:33 PM, Greg Hogan <c...@greghogan.com> wrote:
>>>
>>>> Hi Saliya,
>>>>
>>>> Would you happen to have pdsh (parallel distributed shell) installed?
>>>> If so the TaskManager startup in start-cluster.sh will run in parallel.
>>>>
>>>> As to running 24 TaskManagers together, are these running across
>>>> multiple NUMA nodes? I had filed FLINK-3163 (
>>>> https://issues.apache.org/jira/browse/FLINK-3163) last year as I have
>>>> seen that even with only two NUMA nodes performance is improved by binding
>>>> TaskManagers, both memory and CPU. I think we can improve configuration of
>>>> task slots as we do with memory, where the latter can be a fixed measure or
>>>> a fraction relative to total memory.
>>>>
>>>> Greg
>>>>
>>>> On Sat, Jul 9, 2016 at 3:44 AM, Saliya Ekanayake <esal...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> The current start/stop scripts SSH worker nodes each time they appear
>>>>> in the slaves file. When spawning multiple TMs (like 24 per node), this is
>>>>> very inefficient.
>>>>>
>>>>> I've changed the scripts to do one SSH per node and spawn a given N
>>>>> number of TMs afterwards. I can make a pull request if this seems usable 
>>>>> to
>>>>> others. For now, I assume slaves file will indicate the number of TMs per
>>>>> slave in "IP N" format.
>>>>>
>>>>> Thank you,
>>>>> Saliya
>>>>>
>>>>> --
>>>>> Saliya Ekanayake
>>>>> Ph.D. Candidate | Research Assistant
>>>>> School of Informatics and Computing | Digital Science Center
>>>>> Indiana University, Bloomington
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Saliya Ekanayake
>>> Ph.D. Candidate | Research Assistant
>>> School of Informatics and Computing | Digital Science Center
>>> Indiana University, Bloomington
>>>
>>>
>>
>
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
>
>


-- 
Saliya Ekanayake
Ph.D. Candidate | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington

Re: Modifying start-cluster scripts to efficiently spawn multiple TMs

Reply via email to