Hi Ralph, Thanks for the reply! I have tried, but couldn't get 1.8.8 or 1.10 (tried 1.10.0 back then) to work with our pretty old Torque 2.5.13 with cpusets . Under some circumstances (process/node layout as given by Torque), it fails to bind cores with messages like:
Error message: hwloc_set_cpubind returned "Error" for bitmap "0" Location: ../../../../../openmpi-1.10.0/orte/mca/odls/default/odls_default_module.c:5 51 -- Grigory Shamov HPC Analist, Westgrid/ComputeCanada Site Lead University of Manitoba E2-588 EITC Building, (204) 474-9625 On 15-11-26 6:42 PM, "users on behalf of Ralph Castain" <users-boun...@open-mpi.org on behalf of r...@open-mpi.org> wrote: >You might want to upgrade to 1.10.1, or at least to 1.8.8 as 1.6.5 is >pretty old > >> On Nov 26, 2015, at 1:49 PM, Grigory Shamov >><grigory.sha...@umanitoba.ca> wrote: >> >> Hi All, >> >> For a parallel MPI job, we sometimes (not always) get the following >> message: >> >> [n047:25850] [[36630,0],1] -> [[36630,0],0] (node: n230) oob-tcp: Number >> of attempts to create TCP connection has been exceeded. Can not >> communicate with peer >> [n047:25850] [[36630,0],1] ORTE_ERROR_LOG: Unreachable in file >> ../../../../../openmpi-1.6.5/orte/mca/grpcomm/bad/grpcomm_bad_module.c >>at >> line 412 >> [n047:25850] [[36630,0],1] -> [[36630,0],0] (node: n230) oob-tcp: Number >> of attempts to create TCP connection has been exceeded. Can not >> communicate with peer >> >> These appear in the middle of a running job; we use OpenMPI 1.6.5 and >>OFED >> 2.4 on CentOS 6. >> >> -- >> Grigory Shamov >> HPC Analist, >> Westgrid/ComputeCanada Site Lead >> University of Manitoba >> E2-588 EITC Building, >> (204) 474-9625 >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >>http://www.open-mpi.org/community/lists/users/2015/11/28113.php > >_______________________________________________ >users mailing list >us...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >Link to this post: >http://www.open-mpi.org/community/lists/users/2015/11/28114.php