Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application

Ralph Castain via users Wed, 02 Feb 2022 15:25:59 -0800
Errr...what version OMPI are you using?

> On Feb 2, 2022, at 3:03 PM, David Perozzi via users 
> <users@lists.open-mpi.org> wrote:
> 
> Helo,
> 
> I'm trying to run a code implemented with OpenMPI and OpenMP (for threading) 
> on a large cluster that uses LSF for the job scheduling and dispatch. The 
> problem with LSF is that it is not very straightforward to allocate and bind 
> the right amount of threads to an MPI rank inside a single node. Therefore, I 
> have to create a rankfile myself, as soon as the (a priori unknown) 
> ressources are allocated.
> 
> So, after my job get dispatched, I run:
> 
> mpirun -n "$nslots" -display-allocation -nooversubscribe --map-by core:PE=1 
> --bind-to core mpi_allocation/show_numactl.sh 
> >mpi_allocation/allocation_files/allocation.txt
> 
> where show_numactl.sh consists of just one line:
> 
> { hostname; numactl --show; } | sed ':a;N;s/\n/ /;ba'
> 
> If I ask for 16 slots, in blocks of 4 (i.e., bsub -n 16 -R "span[block=4]"), 
> I get something like:
> 
> ======================   ALLOCATED NODES   ======================
>     eu-g1-006-1: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
>     eu-g1-009-2: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
>     eu-g1-002-3: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
>     eu-g1-005-1: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
> =================================================================
> eu-g1-006-1 policy: default preferred node: current physcpubind: 16  cpubind: 
> 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
> eu-g1-006-1 policy: default preferred node: current physcpubind: 24  cpubind: 
> 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
> eu-g1-006-1 policy: default preferred node: current physcpubind: 32  cpubind: 
> 2  nodebind: 2  membind: 0 1 2 3 4 5 6 7
> eu-g1-002-3 policy: default preferred node: current physcpubind: 21  cpubind: 
> 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
> eu-g1-002-3 policy: default preferred node: current physcpubind: 22  cpubind: 
> 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
> eu-g1-009-2 policy: default preferred node: current physcpubind: 0  cpubind: 
> 0  nodebind: 0  membind: 0 1 2 3 4 5 6 7
> eu-g1-009-2 policy: default preferred node: current physcpubind: 1  cpubind: 
> 0  nodebind: 0  membind: 0 1 2 3 4 5 6 7
> eu-g1-009-2 policy: default preferred node: current physcpubind: 2  cpubind: 
> 0  nodebind: 0  membind: 0 1 2 3 4 5 6 7
> eu-g1-002-3 policy: default preferred node: current physcpubind: 19  cpubind: 
> 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
> eu-g1-002-3 policy: default preferred node: current physcpubind: 23  cpubind: 
> 1  nodebind: 1  membind: 0 1 2 3 4 5 6 7
> eu-g1-006-1 policy: default preferred node: current physcpubind: 52  cpubind: 
> 3  nodebind: 3  membind: 0 1 2 3 4 5 6 7
> eu-g1-009-2 policy: default preferred node: current physcpubind: 3  cpubind: 
> 0  nodebind: 0  membind: 0 1 2 3 4 5 6 7
> eu-g1-005-1 policy: default preferred node: current physcpubind: 90  cpubind: 
> 5  nodebind: 5  membind: 0 1 2 3 4 5 6 7
> eu-g1-005-1 policy: default preferred node: current physcpubind: 91  cpubind: 
> 5  nodebind: 5  membind: 0 1 2 3 4 5 6 7
> eu-g1-005-1 policy: default preferred node: current physcpubind: 94  cpubind: 
> 5  nodebind: 5  membind: 0 1 2 3 4 5 6 7
> eu-g1-005-1 policy: default preferred node: current physcpubind: 95  cpubind: 
> 5  nodebind: 5  membind: 0 1 2 3 4 5 6 7
> 
> After that, I parse this allocation file in python and I create a hostfile 
> and a rankfile.
> 
> The hostfile reads:
> 
> eu-g1-006-1
> eu-g1-009-2
> eu-g1-002-3
> eu-g1-005-1
> 
> The rankfile:
> 
> rank 0=eu-g1-006-1 slot=16,24,32,52
> rank 1=eu-g1-009-2 slot=0,1,2,3
> rank 2=eu-g1-002-3 slot=21,22,19,23
> rank 3=eu-g1-005-1 slot=90,91,94,95
> 
> Following OpenMPI's manpages and FAQs, I then run my application using
> 
> mpirun -n "$nmpiproc" --rankfile mpi_allocation/hostfiles/rankfile --mca 
> rmaps_rank_file_physical 1 ./build/"$executable_name" true "$input_file"
> 
> where the bash variables are passed in directly in the bsub command (I 
> basically run bsub -n 16 -R "span[block=4]" "my_script.sh num_slots 
> num_thread_per_rank executable_name input_file").
> 
> 
> Now, this procedure sometimes works just fine, sometimes not. When it 
> doesn't, the problem is that I don't get any error message (I noticed that if 
> an error is made inside the rankfile, one does not get any error). Strangely, 
> it seems that for 16 slots and four threads (so 4 MPI ranks), it works better 
> if I have 8 slots allocated in two nodes than if I have 4 slots in 4 
> different nodes. My goal is tu run the application with 256 slots and 32 
> threads per rank (the cluster has mainly AMD EPYC based nodes).
> 
> The ompi information of the nodes running a failed job and the rankfile for 
> that failed job can be found at https://pastebin.com/40f6FigH and the 
> allocation file at https://pastebin.com/jeWnkU40
> 
> 
> Do you see any problem with my procedure? Why is it failing seemingly 
> randomly? Can I somehow get more informtion about what's failing from mpirun?
> 
> 
> I hope not having omitted to much information but, in case, just ask and I'll 
> provide more details.
> 
> 
> Cheers,
> 
> David
> 
>
Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application

Reply via email to