Hi,

I have a problem when setting the processes of a parallel job  with specified 
order.  Suppose a job with 6 processes (rank0 to rank5) needs to run on 3 hosts 
(A, B, C) with following order:
        Rank0  -- A
        Rank1  -- B
        Rank2  -- B
        Rank3  -- C
        Rank4  -- A
        Rank5  -- C
Specifying this order (ABBCAC) in  hostfile doesn't work because Open MPI only 
supports "byslot" (AABBCC) or "bynode" (ABCABC) ranking orders.

However, if I use rankfile to implement this order in the format of
        rank 0=A slot=<slot setting>
        rank 0=B slot=<slot setting>
        rank 0=B slot=<slot setting>
        rank 0=C slot=<slot setting>
        rank 0=A slot=<slot setting>
        rank 0=C slot=<slot setting>
I run into another problem on how to determine the <slot setting> for each 
rank. If I bind each rank to all cores/CPUs on a node (e.g. rank 0=A slot=0-n,  
where n is the maximal CPU number), I run into the following errors:

*** An error occurred in MPI_comm_size
*** on a NULL communicator
*** Unknown error
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
forrtl: severe (174): SIGSEGV, segmentation fault occurred

If I don't select all cores, I need to identify which cores are available to my 
job in order to avoid CPU oversubscribing since the nodes are shared by 
multiple jobs.

Our system is the intel based cluster (12 or 16 cores per node) and the job is 
submitted by LSF batch submitter.

Here is my question: how to implement a specified order of processes at node 
level without binding at core/cpu level?

Any help and suggestions would be appreciated.

Thanks,
Chee


Reply via email to