Sigh - of course it wouldn’t be simple :-( All right, let’s suppose we look for SLURM_CPU_BIND:
* if it includes the word “none”, then we know the user specified that they don’t want us to bind * if it includes the word mask_cpu, then we have to check the value of that option. * If it is all F’s, then they didn’t specify a binding and we should do our thing. * If it is anything else, then we assume they _did_ specify a binding, and we leave it alone Would that make sense? Is there anything else that could be in that envar which would trip us up? > On Oct 27, 2016, at 10:37 AM, Andy Riebs <andy.ri...@hpe.com> wrote: > > Yes, they still exist: > > $ srun --ntasks-per-node=2 -N1 env | grep BIND | sort -u > SLURM_CPU_BIND_LIST=0xFFFF > SLURM_CPU_BIND=quiet,mask_cpu:0xFFFF > SLURM_CPU_BIND_TYPE=mask_cpu: > SLURM_CPU_BIND_VERBOSE=quiet > Here are the relevant Slurm configuration options that could conceivably > change the behavior from system to system: > SelectType = select/cons_res > SelectTypeParameters = CR_CPU > > > On 10/27/2016 01:17 PM, r...@open-mpi.org <mailto:r...@open-mpi.org> wrote: >> And if there is no --cpu_bind on the cmd line? Do these not exist? >> >>> On Oct 27, 2016, at 10:14 AM, Andy Riebs <andy.ri...@hpe.com >>> <mailto:andy.ri...@hpe.com>> wrote: >>> >>> Hi Ralph, >>> >>> I think I've found the magic keys... >>> >>> $ srun --ntasks-per-node=2 -N1 --cpu_bind=none env | grep BIND >>> SLURM_CPU_BIND_VERBOSE=quiet >>> SLURM_CPU_BIND_TYPE=none >>> SLURM_CPU_BIND_LIST= >>> SLURM_CPU_BIND=quiet,none >>> SLURM_CPU_BIND_VERBOSE=quiet >>> SLURM_CPU_BIND_TYPE=none >>> SLURM_CPU_BIND_LIST= >>> SLURM_CPU_BIND=quiet,none >>> $ srun --ntasks-per-node=2 -N1 --cpu_bind=core env | grep BIND >>> SLURM_CPU_BIND_VERBOSE=quiet >>> SLURM_CPU_BIND_TYPE=mask_cpu: >>> SLURM_CPU_BIND_LIST=0x1111,0x2222 >>> SLURM_CPU_BIND=quiet,mask_cpu:0x1111,0x2222 >>> SLURM_CPU_BIND_VERBOSE=quiet >>> SLURM_CPU_BIND_TYPE=mask_cpu: >>> SLURM_CPU_BIND_LIST=0x1111,0x2222 >>> SLURM_CPU_BIND=quiet,mask_cpu:0x1111,0x2222 >>> >>> Andy >>> >>> On 10/27/2016 11:57 AM, r...@open-mpi.org <mailto:r...@open-mpi.org> wrote: >>>> Hey Andy >>>> >>>> Is there a SLURM envar that would tell us the binding option from the srun >>>> cmd line? We automatically bind when direct launched due to user >>>> complaints of poor performance if we don’t. If the user specifies a >>>> binding option, then we detect that we were already bound and don’t >>>> do it. >>>> >>>> However, if the user specifies that they not be bound, then we think they >>>> simply didn’t specify anything - and that isn’t the case. If >>>> we can see something that tells us “they explicitly said not to do >>>> itâ€, then we can avoid the situation. >>>> >>>> Ralph >>>> >>>>> On Oct 27, 2016, at 8:48 AM, Andy Riebs <andy.ri...@hpe.com >>>>> <mailto:andy.ri...@hpe.com>> wrote: >>>>> >>>>> Hi All, >>>>> >>>>> We are running Open MPI version 1.10.2, built with support for Slurm >>>>> version 16.05.0. When a user specifies "--cpu_bind=none", MPI tries to >>>>> bind by core, which segv's if there are more processes than cores. >>>>> >>>>> The user reports: >>>>> >>>>> What I found is that >>>>> >>>>> % srun --ntasks-per-node=8 --cpu_bind=none \ >>>>> env SHMEM_SYMMETRIC_HEAP_SIZE=1024M bin/all2all.shmem.exe 0 >>>>> >>>>> will have the problem, but: >>>>> >>>>> % srun --ntasks-per-node=8 --cpu_bind=none \ >>>>> env SHMEM_SYMMETRIC_HEAP_SIZE=1024M ./bindit.sh bin/all2all.shmem.exe >>>>> 0 >>>>> >>>>> Will run as expected and print out the usage message because I >>>>> didn’t provide the right arguments to the code. >>>>> >>>>> So, it appears that the binding has something to do with the issue. My >>>>> binding script is as follows: >>>>> >>>>> % cat bindit.sh >>>>> #!/bin/bash >>>>> >>>>> #echo SLURM_LOCALID=$SLURM_LOCALID >>>>> >>>>> stride=1 >>>>> >>>>> if [ ! -z "$SLURM_LOCALID" ]; then >>>>> let bindCPU=$SLURM_LOCALID*$stride >>>>> exec numactl --membind=0 --physcpubind=$bindCPU $* >>>>> fi >>>>> >>>>> $* >>>>> >>>>> % >>>>> >>>>> >>>>> -- >>>>> Andy Riebs >>>>> andy.ri...@hpe.com <mailto:andy.ri...@hpe.com> >>>>> Hewlett-Packard Enterprise >>>>> High Performance Computing Software Engineering >>>>> +1 404 648 9024 >>>>> My opinions are not necessarily those of HPE >>>>> May the source be with you! >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>>> _______________________________________________ >>>> users mailing list >>>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >>> >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> >> >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org <mailto:users@lists.open-mpi.org> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users> > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users