Re: [slurm-users] Too many single-stream jobs?

Andy Riebs Mon, 12 Feb 2018 16:04:33 -0800

Many thanks Matthieu!

Andy



On 02/12/2018 06:42 PM, Matthieu Hautreux wrote:

Hi,

your login node may have a heavy load while starting such a largenumber of independant sruns.

This may induce issues not seen under normal load, like partialread/write on sockets, triggering bugs in slurm, for functions notproperly protected against such events.

Quickly looking at the source code of the function generating the"io_init_msg_read too small " message, it seems that at least this oneis not properly protected against partial write :


 217 | int
 218 | io_init_msg_write_to_fd(int fd, struct slurm_io_init_msg *msg)
 219 | {
 220 |         Buf buf;
 221 |         void *ptr;
 222 |         int n;
 223 |
 224 |         xassert(msg);
 225 |
 226 |         debug2("Entering io_init_msg_write_to_fd");
 227 |         msg->version = IO_PROTOCOL_VERSION;
 228 |         buf = init_buf(io_init_msg_packed_size());
 229 |         debug2("  msg->nodeid = %d", msg->nodeid);
 230 |         io_init_msg_pack(msg, buf);
 231 |
 232 |         ptr = get_buf_data(buf);
 233 | again:
 234 | =>      if ((n = write(fd, ptr, io_init_msg_packed_size())) < 0) {
 235 |                 if (errno == EINTR)
 236 |                         goto again;
 237 |                 free_buf(buf);
 238 |                 return SLURM_ERROR;
 239 |         }
 240 |         if (n != io_init_msg_packed_size()) {
 241 |                 error("io init msg write too small");
 242 |                 free_buf(buf);
 243 |                 return SLURM_ERROR;
 244 |         }
 245 |
 246 |         free_buf(buf);
 247 |         debug2("Leaving io_init_msg_write_to_fd");
 248 |         return SLURM_SUCCESS;
 249 | }

A proper way to handle partial write is the following (from somewhereelse in Slurm codebase) :


 188 | ssize_t fd_write_n(int fd, void *buf, size_t n)
 189 | {
 190 |         size_t nleft;
 191 |         ssize_t nwritten;
 192 |         unsigned char *p;
 193 |
 194 |         p = buf;
 195 |         nleft = n;
 196 |         while (nleft > 0) {
 197 | =>              if ((nwritten = write(fd, p, nleft)) < 0) {
 198 |                         if (errno == EINTR)
 199 |                                 continue;
 200 |                         else
 201 |                                 return(-1);
 202 |                 }
 203 |                 nleft -= nwritten;
 204 |                 p += nwritten;
 205 |         }
 206 |         return(n);
 207 | }

It seems that some code cleaning/factoring could be performed in Slurmto limit risks of this kind of issues. Not sure that it would resolveyour problem but at least it seems harmfull to still have that in thecode.


You should file a bug for that.

HTH
Matthieu

2018-02-12 22:42 GMT+01:00 Andy Riebs <andy.ri...@hpe.com<mailto:andy.ri...@hpe.com>>:


    We have a user who wants to run multiple instances of a single
    process job across a cluster, using a loop like

    -----
    for N in $nodelist; do
         srun -w $N program &
    done
    wait
    -----

    This works up to a thousand nodes or so (jobs are allocated by
    node here), but as the number of jobs submitted increases, we
    periodically see a variety of different error messages, such as

      * srun: error: Ignoring job_complete for job 100035 because our
        job ID is 102937
      * srun: error: io_init_msg_read too small
      * srun: error: task 0 launch failed: Unspecified error
      * srun: error: Unable to allocate resources: Job/step already
        completing or completed
      * srun: error: Unable to allocate resources: No error
      * srun: error: unpack error in io_init_msg_unpack
      * srun: Job step 211042.0 aborted before step completely launched.

    We have tried setting

        ulimit -n 500000
        ulimit -u 64000

    but that wasn't sufficient.

    The environment:

      * CentOS 7.3 (x86_64)
      * Slurm 17.11.0

    Does this ring any bells? Any thoughts about how we should proceed?

    Andy

--Andy Riebs

    andy.ri...@hpe.com <mailto:andy.ri...@hpe.com>
    Hewlett-Packard Enterprise
    High Performance Computing Software Engineering
    +1 404 648 9024 <tel:%28404%29%20648-9024>
    My opinions are not necessarily those of HPE
         May the source be with you!


--
Andy Riebs
andy.ri...@hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
    May the source be with you!

Re: [slurm-users] Too many single-stream jobs?

Reply via email to