It's in a container. Specifically horovod/horovod on the Docker hub. I'm
going into the container to investigate now (I think I have a link to the
dockerfile as well).

Thanks!

Jeff


On Mon, Aug 12, 2024 at 10:01 AM Paul Edmon <ped...@cfa.harvard.edu> wrote:

> Certainly a strange setup. I would probably talk with who ever is
> providing MPI for you and ask them to build it against Slurm properly. As
> in order to get correct process binding you definitely want to have it
> integrated properly with slurm either via PMI2 or PMIx. If you just use the
> bare hostlist, your ranks may not end up properly bound to the specific
> cores they are supposed to be allocated. So definitely proceed with caution
> and validate your ranks are being laid out properly, as you will be relying
> on mpirun/mpiexec to bootstrap rather than the scheduler.
>
> -Paul Edmon-
> On 8/12/2024 9:55 AM, Jeffrey Layton wrote:
>
> Paul,
>
> I tend not to rely on the MPI being built with Slurm :)  I find that the
> systems I use haven't done that. :(  I'm not exactly sure why, but that is
> the way it is :)
>
> Up to now, using scontrol has always worked for me. However, a new system
> is not cooperating (it is running on the submittal host and not the compute
> nodes) and I'm trying to debug it. My first step was to check that the job
> was getting the compute nodes names (the list of nodes from Slurm is
> empty). This led to my question about the "canonical" way to get the
> hostlist (I'm checking using the hostlist and just relying on Slurm being
> integrated into the mpi - both don't work since the hostlist is empty).
>
> It looks like there is a canonical way to do it as you mentioned. FAQ
> worthy? Definitely for my own Slurm FAQ. Others will decide if it is worthy
> for Slurm docs :)
>
> Thanks everyone for your help!
>
> Jeff
>
>
> On Mon, Aug 12, 2024 at 9:36 AM Paul Edmon via slurm-users <
> slurm-users@lists.schedmd.com> wrote:
>
>> Normally MPI will just pick up the host list from Slurm itself. You just
>> need to build MPI against Slurm and it will just grab it. Typically this is
>> transparent to the user. Normally you shouldn't need to pass a host list at
>> all. See: https://slurm.schedmd.com/mpi_guide.html
>>
>> The canonical way to do it if you need to would be the scontrol show
>> hostnames command against the $SLURM_JOB_NODELIST (
>> https://slurm.schedmd.com/scontrol.html#OPT_hostnames). That will give
>> you the list of hosts your job is set to run on.
>>
>> -Paul Edmon-
>> On 8/12/2024 8:34 AM, Jeffrey Layton via slurm-users wrote:
>>
>> Thanks! I admit I'm not that experienced in Bash. I will give this a
>> whirl as a test.
>>
>> In the meantime, let ask, what is the "canonical" way to create the host
>> list? It would be nice to have this in the Slurm FAQ somewhere.
>>
>> Thanks!
>>
>> Jeff
>>
>>
>>
>> On Fri, Aug 9, 2024 at 1:32 PM Hermann Schwärzler via slurm-users <
>> slurm-users@lists.schedmd.com> wrote:
>>
>>> Hi Paul,
>>>
>>> On 8/9/24 18:45, Paul Edmon via slurm-users wrote:
>>> > As I recall I think OpenMPI needs a list that has an entry on each
>>> line,
>>> > rather than one seperated by a space. See:
>>> >
>>> > [root@holy7c26401 ~]# echo $SLURM_JOB_NODELIST
>>> > holy7c[26401-26405]
>>> > [root@holy7c26401 ~]# scontrol show hostnames $SLURM_JOB_NODELIST
>>> > holy7c26401
>>> > holy7c26402
>>> > holy7c26403
>>> > holy7c26404
>>> > holy7c26405
>>> >
>>> > [root@holy7c26401 ~]# list=$(scontrol show hostname $SLURM_NODELIST)
>>> > [root@holy7c26401 ~]# echo $list
>>> > holy7c26401 holy7c26402 holy7c26403 holy7c26404 holy7c26405
>>>
>>> proper quoting does wonders here (please consult the man-page of bash).
>>> If you try
>>>
>>> echo "$list"
>>>
>>> you will see that you will get
>>>
>>> holy7c26401
>>> holy7c26402
>>> holy7c26403
>>> holy7c26404
>>> holy7c26405
>>>
>>> So you *can* pass this around in a variable if you use "$variable"
>>> whenever you provide it to a utility.
>>>
>>> Regards,
>>> Hermann
>>>
>>> --
>>> slurm-users mailing list -- slurm-users@lists.schedmd.com
>>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>>>
>>
>>
>> --
>> slurm-users mailing list -- slurm-users@lists.schedmd.com
>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>>
>
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to