I know there was an issue with Torque and cpusets at one time, but I believe that has been fixed (likely later in 1.10 series).
Regardless, the error message you are seeing indicates the failure to open a socket between daemons on different nodes. Could be hitting a file descriptor limit, or it may just be an issue with specific nodes (and so you’d only see it if your allocation included those nodes). You should only see that message during MPI_Init, not in the middle of the job. > On Nov 27, 2015, at 6:26 AM, Grigory Shamov <grigory.sha...@umanitoba.ca> > wrote: > > Hi Ralph, > > Thanks for the reply! > I have tried, but couldn't get 1.8.8 or 1.10 (tried 1.10.0 back then) to > work with our pretty old Torque 2.5.13 with cpusets . Under some > circumstances (process/node layout as given by Torque), it fails to bind > cores with messages like: > > Error message: hwloc_set_cpubind returned "Error" for bitmap "0" > Location: > ../../../../../openmpi-1.10.0/orte/mca/odls/default/odls_default_module.c:5 > 51 > > > > -- > Grigory Shamov > HPC Analist, > > Westgrid/ComputeCanada Site Lead > University of Manitoba > E2-588 EITC Building, > (204) 474-9625 > > > > > > On 15-11-26 6:42 PM, "users on behalf of Ralph Castain" > <users-boun...@open-mpi.org on behalf of r...@open-mpi.org> wrote: > >> You might want to upgrade to 1.10.1, or at least to 1.8.8 as 1.6.5 is >> pretty old >> >>> On Nov 26, 2015, at 1:49 PM, Grigory Shamov >>> <grigory.sha...@umanitoba.ca> wrote: >>> >>> Hi All, >>> >>> For a parallel MPI job, we sometimes (not always) get the following >>> message: >>> >>> [n047:25850] [[36630,0],1] -> [[36630,0],0] (node: n230) oob-tcp: Number >>> of attempts to create TCP connection has been exceeded. Can not >>> communicate with peer >>> [n047:25850] [[36630,0],1] ORTE_ERROR_LOG: Unreachable in file >>> ../../../../../openmpi-1.6.5/orte/mca/grpcomm/bad/grpcomm_bad_module.c >>> at >>> line 412 >>> [n047:25850] [[36630,0],1] -> [[36630,0],0] (node: n230) oob-tcp: Number >>> of attempts to create TCP connection has been exceeded. Can not >>> communicate with peer >>> >>> These appear in the middle of a running job; we use OpenMPI 1.6.5 and >>> OFED >>> 2.4 on CentOS 6. >>> >>> -- >>> Grigory Shamov >>> HPC Analist, >>> Westgrid/ComputeCanada Site Lead >>> University of Manitoba >>> E2-588 EITC Building, >>> (204) 474-9625 >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/11/28113.php >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/11/28114.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/11/28115.php