Errr...what version OMPI are you using?
> On Feb 2, 2022, at 3:03 PM, David Perozzi via users
> <users@lists.open-mpi.org> wrote:
>
> Helo,
>
> I'm trying to run a code implemented with OpenMPI and OpenMP (for threading)
> on a large cluster that uses LSF for the job scheduling and dispatch. The
> problem with LSF is that it is not very straightforward to allocate and bind
> the right amount of threads to an MPI rank inside a single node. Therefore, I
> have to create a rankfile myself, as soon as the (a priori unknown)
> ressources are allocated.
>
> So, after my job get dispatched, I run:
>
> mpirun -n "$nslots" -display-allocation -nooversubscribe --map-by core:PE=1
> --bind-to core mpi_allocation/show_numactl.sh
> >mpi_allocation/allocation_files/allocation.txt
>
> where show_numactl.sh consists of just one line:
>
> { hostname; numactl --show; } | sed ':a;N;s/\n/ /;ba'
>
> If I ask for 16 slots, in blocks of 4 (i.e., bsub -n 16 -R "span[block=4]"),
> I get something like:
>
> ====================== ALLOCATED NODES ======================
> eu-g1-006-1: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
> eu-g1-009-2: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
> eu-g1-002-3: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
> eu-g1-005-1: flags=0x11 slots=4 max_slots=0 slots_inuse=0 state=UP
> =================================================================
> eu-g1-006-1 policy: default preferred node: current physcpubind: 16 cpubind:
> 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7
> eu-g1-006-1 policy: default preferred node: current physcpubind: 24 cpubind:
> 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7
> eu-g1-006-1 policy: default preferred node: current physcpubind: 32 cpubind:
> 2 nodebind: 2 membind: 0 1 2 3 4 5 6 7
> eu-g1-002-3 policy: default preferred node: current physcpubind: 21 cpubind:
> 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7
> eu-g1-002-3 policy: default preferred node: current physcpubind: 22 cpubind:
> 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7
> eu-g1-009-2 policy: default preferred node: current physcpubind: 0 cpubind:
> 0 nodebind: 0 membind: 0 1 2 3 4 5 6 7
> eu-g1-009-2 policy: default preferred node: current physcpubind: 1 cpubind:
> 0 nodebind: 0 membind: 0 1 2 3 4 5 6 7
> eu-g1-009-2 policy: default preferred node: current physcpubind: 2 cpubind:
> 0 nodebind: 0 membind: 0 1 2 3 4 5 6 7
> eu-g1-002-3 policy: default preferred node: current physcpubind: 19 cpubind:
> 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7
> eu-g1-002-3 policy: default preferred node: current physcpubind: 23 cpubind:
> 1 nodebind: 1 membind: 0 1 2 3 4 5 6 7
> eu-g1-006-1 policy: default preferred node: current physcpubind: 52 cpubind:
> 3 nodebind: 3 membind: 0 1 2 3 4 5 6 7
> eu-g1-009-2 policy: default preferred node: current physcpubind: 3 cpubind:
> 0 nodebind: 0 membind: 0 1 2 3 4 5 6 7
> eu-g1-005-1 policy: default preferred node: current physcpubind: 90 cpubind:
> 5 nodebind: 5 membind: 0 1 2 3 4 5 6 7
> eu-g1-005-1 policy: default preferred node: current physcpubind: 91 cpubind:
> 5 nodebind: 5 membind: 0 1 2 3 4 5 6 7
> eu-g1-005-1 policy: default preferred node: current physcpubind: 94 cpubind:
> 5 nodebind: 5 membind: 0 1 2 3 4 5 6 7
> eu-g1-005-1 policy: default preferred node: current physcpubind: 95 cpubind:
> 5 nodebind: 5 membind: 0 1 2 3 4 5 6 7
>
> After that, I parse this allocation file in python and I create a hostfile
> and a rankfile.
>
> The hostfile reads:
>
> eu-g1-006-1
> eu-g1-009-2
> eu-g1-002-3
> eu-g1-005-1
>
> The rankfile:
>
> rank 0=eu-g1-006-1 slot=16,24,32,52
> rank 1=eu-g1-009-2 slot=0,1,2,3
> rank 2=eu-g1-002-3 slot=21,22,19,23
> rank 3=eu-g1-005-1 slot=90,91,94,95
>
> Following OpenMPI's manpages and FAQs, I then run my application using
>
> mpirun -n "$nmpiproc" --rankfile mpi_allocation/hostfiles/rankfile --mca
> rmaps_rank_file_physical 1 ./build/"$executable_name" true "$input_file"
>
> where the bash variables are passed in directly in the bsub command (I
> basically run bsub -n 16 -R "span[block=4]" "my_script.sh num_slots
> num_thread_per_rank executable_name input_file").
>
>
> Now, this procedure sometimes works just fine, sometimes not. When it
> doesn't, the problem is that I don't get any error message (I noticed that if
> an error is made inside the rankfile, one does not get any error). Strangely,
> it seems that for 16 slots and four threads (so 4 MPI ranks), it works better
> if I have 8 slots allocated in two nodes than if I have 4 slots in 4
> different nodes. My goal is tu run the application with 256 slots and 32
> threads per rank (the cluster has mainly AMD EPYC based nodes).
>
> The ompi information of the nodes running a failed job and the rankfile for
> that failed job can be found at https://pastebin.com/40f6FigH and the
> allocation file at https://pastebin.com/jeWnkU40
>
>
> Do you see any problem with my procedure? Why is it failing seemingly
> randomly? Can I somehow get more informtion about what's failing from mpirun?
>
>
> I hope not having omitted to much information but, in case, just ask and I'll
> provide more details.
>
>
> Cheers,
>
> David
>
>