I believe that fragmentation only happens on routers when passing traffic from 
one subnet to another. Since this traffic was all on a single subnet there was 
no router involved to fragment the packets.

Mike

On 11/26/18 1:49 PM, Kenneth Roberts wrote:
D’oh!

The compute nodes had different MTU on the network interfaces than the master.  
Once all set to 1500, it works!

So ... any ideas why that was a problem?  Maybe the interfaces had no 
fragmentation set and there were dropped packets?

Thanks for listening.
Ken

From: slurm-users 
<slurm-users-boun...@lists.schedmd.com><mailto:slurm-users-boun...@lists.schedmd.com>
 On Behalf Of Kenneth Roberts
Sent: Monday, November 26, 2018 9:38 AM
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Slurm / OpenHPC socket timeout errors

I wasn’t looking close enough at the times in the log file.

c2: [2018-11-26T10:09:40.963] debug3: in the service_connection
c2: [2018-11-26T10:10:00.983] debug:  slurm_recv_timeout at 0 of 9589, timeout
c2: [2018-11-26T10:10:00.983] error: slurm_receive_msg_and_forward: Socket 
timed out on send/recv operation
c2: [2018-11-26T10:10:00.994] error: service_connection: slurm_receive_msg: 
Socket timed out on send/recv operation
c2: [2018-11-26T10:10:01.106] debug3: in the service_connection

It looks like slurm_recv_timeout is attempting for 20 seconds and the call is 
just hitting continue without reading any data –

           if ((rc = poll(&ufds, 1, timeleft)) <= 0) {
                if ((errno == EINTR) || (errno == EAGAIN) || (rc == 0))
                     continue;
                else {
                     debug("%s at %d of %zu, poll error: %m",
                           __func__, recvlen, size);
                      slurm_seterrno(
                           SLURM_COMMUNICATIONS_RECEIVE_ERROR);
                     recvlen = SLURM_ERROR;
                     goto done;
                }
           }

So poll is timing out after 20 seconds.

Back to finding out why ...

From: slurm-users 
<slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Kenneth Roberts
Sent: Monday, November 26, 2018 8:35 AM
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Slurm / OpenHPC socket timeout errors

Here is the debug log on a node (c2) when the job fails ....

c2: [2018-11-26T07:35:56.261] debug3: in the service_connection
c2: [2018-11-26T07:36:16.281] debug:  slurm_recv_timeout at 0 of 9680, timeout
c2: [2018-11-26T07:36:16.282] error: slurm_receive_msg_and_forward: Socket 
timed out on send/recv operation
c2: [2018-11-26T07:36:16.292] error: service_connection: slurm_receive_msg: 
Socket timed out on send/recv operation
c2: [2018-11-26T07:36:16.334] debug3: in the service_connection

the line, debug:  slurm_recv_timeout at 0 of 9680, timeout – looks like it 
times out before reading even the first byte of the message.

Here is the code snippet that generates that debug message:

extern int slurm_recv_timeout(int fd, char *buffer, size_t size, uint32_t 
flags, int timeout )
.
.
.
while (recvlen < size) {
           timeleft = timeout - _tot_wait(&tstart);
           if (timeleft <= 0) {
                debug("%s at %d of %zu, timeout", __func__, recvlen, size);
                slurm_seterrno(SLURM_PROTOCOL_SOCKET_IMPL_TIMEOUT);
                recvlen = SLURM_ERROR;
                goto done;
           }

recvlen is 0 based on the log message, which might indicate it error’d on the 
first time through (timeleft <= 0).

MessageTimeout=20 in our slurm.conf
But this code acts like it was passed timeout = 0??

Up the call stack, slurm_receive_msg_and_forward, sets the timeout to the 
default:

if (timeout <= 0)
           /* convert secs to msec */
           timeout = slurm_get_msg_timeout() * 1000;

Unless slurm_get_msg_timeout() is not working?

It may be that the slurm.conf values aren’t getting set correctly or used 
correctly, though I don’t see anything like permission errors reading 
slurm.conf ...

Continuing the search ...

From: slurm-users 
<slurm-users-boun...@lists.schedmd.com<mailto:slurm-users-boun...@lists.schedmd.com>>
 On Behalf Of Kenneth Roberts
Sent: Friday, November 23, 2018 4:15 PM
To: slurm-users@lists.schedmd.com<mailto:slurm-users@lists.schedmd.com>
Subject: [slurm-users] Slurm / OpenHPC socket timeout errors

Hi –

I have the following on a new cluster with OpenHPC & Slurm built off the latest 
recipe and packages from OpenHPC (built this week).

One master node and 4 compute nodes.
NodeName=c[1-4] Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 State=UNKNOWN

With simple test scripts, sbatch produces the following error when running 
across more than one node –

The batch script is –

#!/bin/bash
srun hostname

$ sbatch -N4 -n4 hostname.sh

Out file --
c1
srun: error: Task launch for 151.0 failed on node c4: Socket timed out on 
send/recv operation
srun: error: Task launch for 151.0 failed on node c3: Socket timed out on 
send/recv operation
srun: error: Task launch for 151.0 failed on node c2: Socket timed out on 
send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Searching on this discovers a lot of info about large jobs and starting a lot 
of tasks really quickly with some timeout and large cluster setting 
recommendations. BUT I’m running four tasks that are just ‘hostname’!

AND   If I just execute command line srun it works across the nodes

$ srun -N4 -n4 hostname
c1
c2
c3
c4

Also, if I sbatch 20 tasks on one node max, it launches them fine. But 21 tasks 
(which tries to launch on two nodes) works on the c1 node (with 20 lines of 
output) and fails on the 21st task on c2 –
c1
c1
c1
... (17 more)
srun: error: Task launch for 156.0 failed on node c2: Socket timed out on 
send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete


Maybe I completely don’t get sbatch options/params (I’m using defaults). BUT 
I’m attempting the simplest thing I could think of just to test this out.

Trying another approach to test, a script that uses a job array and runs 32 
copies of a simple python script (so there’s no srun in the batch script) 
appears to work properly and utilizes all the nodes. But sbatch a script with 
srun in the script gives the errors.

Really hoping this is something obvious that as a noob to OpenHPC and Slurm I’m 
getting wrong.

Thanks in advance for any pointers or answers!

Ken

Reply via email to