[OMPI users] OpenFabrics (openib)
Hello all, First time poster, I recently installed openmpi 1.6.4 in my cluster with resource manager support as : ./configure --with-tm --prefix=/opt/openmpi/1.6.2/ it works well, but I always get some error saying : [[58551,1],0]: A high-performance Open MPI point-to-point messaging module was unable to find any relevant network interfaces: Module: OpenFabrics (openib) Host: login-1-2 Another transport will be used instead, although this may result in lower performance. This probably looking for infiniband network, and when i installed it used default to use infiniband support. however I do not have any infiniband in my cluster, how do I fix this problem. Perhaps re-configure the openmpi without openib ? if thats the case what flag I should use with configure ? K
Re: [OMPI users] OpenFabrics (openib)
--without-openib will do the trick On Feb 27, 2013, at 7:24 AM, Khapare Joshi wrote: > Hello all, > > First time poster, I recently installed openmpi 1.6.4 in my cluster with > resource manager support as : > > ./configure --with-tm --prefix=/opt/openmpi/1.6.2/ > > it works well, but I always get some error saying : > [[58551,1],0]: A high-performance Open MPI point-to-point messaging module > was unable to find any relevant network interfaces: > > Module: OpenFabrics (openib) > Host: login-1-2 > > Another transport will be used instead, although this may result in > lower performance. > > This probably looking for infiniband network, and when i installed it used > default to use infiniband support. however I do not have any infiniband in my > cluster, how do I fix this problem. > > Perhaps re-configure the openmpi without openib ? if thats the case what flag > I should use with configure ? > > K > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] rcu_sched stalls on CPU
We've resolved this issue, which appears to have been an early warning of a large-scale hardware failure. Twelve hours later the machine was unable to power-on or self-test. We are now running on a new machine, and the same jobs are finishing normally -- without having to worry about Send/Ssend/Isend buffering differences, and relying solely on blocking communication. Simon Research Fellow Santa Fe Institute http://santafe.edu/~simon On 25 Feb 2013, at 4:04 PM, Simon DeDeo wrote: > I have been having some trouble tracing the source of a CPU stall with open > MPI on Gentoo. > > My code is very simple: each process does a Monte Carlo run, saves some data > to disk, and sends back a single MPI_DOUBLE to node zero, which picks the > best value from all the computations (including the one it did itself). > > For some reason, this can cause CPUs to "stall" (see the error below, on > dmesg output) -- this stall actually causes the system to crash and reboot, > which seems pretty crazy. > > My best guess is that some of the nodes greater than zero have "MPI_Send"s > out, but node zero is not finished with its own computation yet, and so has > not put out an MPI_Recv. They get mad waiting? This happens when I give the > Monte Carlo runs large numbers, and so the variance in end time is larger. > > However, the behavior seems a bit extreme, and I am wondering if something > more subtle is going on. My sysadmin was trying to fix something on the > machine the last time it crashed, and it trashed the kernel! So I am also in > the sysadmin doghouse. > > Any help or advice greatly appreciated! Is it likely to be an > MPI_Send/MPI_Recv problem, or is there something else going on? > > [ 1273.079260] INFO: rcu_sched detected stalls on CPUs/tasks: { 12 13} > (detected by 17, t=60002 jiffies) > [ 1273.079272] Pid: 2626, comm: cluster Not tainted 3.6.11-gentoo #10 > [ 1273.079275] Call Trace: > [ 1273.079277][] rcu_check_callbacks+0x5a7/0x600 > [ 1273.079294] [] update_process_times+0x43/0x80 > [ 1273.079298] [] tick_sched_timer+0x76/0xc0 > [ 1273.079303] [] __run_hrtimer.isra.33+0x4e/0x100 > [ 1273.079306] [] hrtimer_interrupt+0xeb/0x220 > [ 1273.079311] [] smp_apic_timer_interrupt+0x64/0xa0 > [ 1273.079316] [] apic_timer_interrupt+0x67/0x70 > [ 1273.079317] > > Simon > > Research Fellow > Santa Fe Institute > http://santafe.edu/~simon > >
Re: [OMPI users] rcu_sched stalls on CPU
I'm glad you figured this out. Your mail was on my to-do list to reply to today; I didn't reply earlier simply because I had no idea what the problem could have been. I'm also kinda glad it wasn't related to MPI. ;-) On Feb 27, 2013, at 11:20 AM, Simon DeDeo wrote: > We've resolved this issue, which appears to have been an early warning of a > large-scale hardware failure. Twelve hours later the machine was unable to > power-on or self-test. > > We are now running on a new machine, and the same jobs are finishing normally > -- without having to worry about Send/Ssend/Isend buffering differences, and > relying solely on blocking communication. > > Simon > > Research Fellow > Santa Fe Institute > http://santafe.edu/~simon > > On 25 Feb 2013, at 4:04 PM, Simon DeDeo wrote: > >> I have been having some trouble tracing the source of a CPU stall with open >> MPI on Gentoo. >> >> My code is very simple: each process does a Monte Carlo run, saves some data >> to disk, and sends back a single MPI_DOUBLE to node zero, which picks the >> best value from all the computations (including the one it did itself). >> >> For some reason, this can cause CPUs to "stall" (see the error below, on >> dmesg output) -- this stall actually causes the system to crash and reboot, >> which seems pretty crazy. >> >> My best guess is that some of the nodes greater than zero have "MPI_Send"s >> out, but node zero is not finished with its own computation yet, and so has >> not put out an MPI_Recv. They get mad waiting? This happens when I give the >> Monte Carlo runs large numbers, and so the variance in end time is larger. >> >> However, the behavior seems a bit extreme, and I am wondering if something >> more subtle is going on. My sysadmin was trying to fix something on the >> machine the last time it crashed, and it trashed the kernel! So I am also in >> the sysadmin doghouse. >> >> Any help or advice greatly appreciated! Is it likely to be an >> MPI_Send/MPI_Recv problem, or is there something else going on? >> >> [ 1273.079260] INFO: rcu_sched detected stalls on CPUs/tasks: { 12 13} >> (detected by 17, t=60002 jiffies) >> [ 1273.079272] Pid: 2626, comm: cluster Not tainted 3.6.11-gentoo #10 >> [ 1273.079275] Call Trace: >> [ 1273.079277][] rcu_check_callbacks+0x5a7/0x600 >> [ 1273.079294] [] update_process_times+0x43/0x80 >> [ 1273.079298] [] tick_sched_timer+0x76/0xc0 >> [ 1273.079303] [] __run_hrtimer.isra.33+0x4e/0x100 >> [ 1273.079306] [] hrtimer_interrupt+0xeb/0x220 >> [ 1273.079311] [] smp_apic_timer_interrupt+0x64/0xa0 >> [ 1273.079316] [] apic_timer_interrupt+0x67/0x70 >> [ 1273.079317] >> >> Simon >> >> Research Fellow >> Santa Fe Institute >> http://santafe.edu/~simon >> >> > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] OpenFabrics (openib)
You can also just disable/unload the OpenFabrics drivers in your systems. Open MPI is reacting to the fact that it could the drivers loaded (even though there is no OpenFabrics-based hardware active, apparently). If you unload the drivers, this message should go away. On Feb 27, 2013, at 10:29 AM, Ralph Castain wrote: > --without-openib will do the trick > > On Feb 27, 2013, at 7:24 AM, Khapare Joshi wrote: > >> Hello all, >> >> First time poster, I recently installed openmpi 1.6.4 in my cluster with >> resource manager support as : >> >> ./configure --with-tm --prefix=/opt/openmpi/1.6.2/ >> >> it works well, but I always get some error saying : >> [[58551,1],0]: A high-performance Open MPI point-to-point messaging module >> was unable to find any relevant network interfaces: >> >> Module: OpenFabrics (openib) >> Host: login-1-2 >> >> Another transport will be used instead, although this may result in >> lower performance. >> >> This probably looking for infiniband network, and when i installed it used >> default to use infiniband support. however I do not have any infiniband in >> my cluster, how do I fix this problem. >> >> Perhaps re-configure the openmpi without openib ? if thats the case what >> flag I should use with configure ? >> >> K >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI users] Option -cpus-per-proc 2 not working with given machinefile?
Hi, I have an issue using the option -cpus-per-proc 2. As I have Bulldozer machines and I want only one process per FP core, I thought using -cpus-per-proc 2 would be the way to go. Initially I had this issue inside GridEngine but then tried it outside any queuingsystem and face exactly the same behavior. @) Each machine has 4 CPUs with each having 16 integer cores, hence 64 integer cores per machine in total. Used Open MPI is 1.6.4. a) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 ./mpihello and a hostfile containing only the two lines listing the machines: node006 node007 This works as I would like it (see working.txt) when initiated on node006. b) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 ./mpihello But changing the hostefile so that it is having a slot count which might mimic the behavior in case of a parsed machinefile out of any queuing system: node006 slots=64 node007 slots=64 This fails with: -- An invalid physical processor ID was returned when attempting to bind an MPI process to a unique processor on node: Node: node006 This usually means that you requested binding to more processors than exist (e.g., trying to bind N MPI processes to M processors, where N > M), or that the node has an unexpectedly different topology. Double check that you have enough unique processors for all the MPI processes that you are launching on this host, and that all nodes have identical topologies. You job will now abort. -- (see failed.txt) b1) mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 32 ./mpihello This works and the found universe is 128 as expected (see only32.txt). c) Maybe the used machinefile is not parsed in the correct way, so I checked: c1) mpiexec -hostfile machines -np 64 ./mpihello => works c2) mpiexec -hostfile machines -np 128 ./mpihello => works c3) mpiexec -hostfile machines -np 129 ./mpihello => fails as expected So, it got the slot counts in the correct way. What do I miss? -- Reuti reuti@node006:~> mpiexec -cpus-per-proc 2 -report-bindings -hostfile machines -np 64 ./mpihello -- An invalid physical processor ID was returned when attempting to bind an MPI process to a unique processor on node: Node: node006 This usually means that you requested binding to more processors than exist (e.g., trying to bind N MPI processes to M processors, where N > M), or that the node has an unexpectedly different topology. Double check that you have enough unique processors for all the MPI processes that you are launching on this host, and that all nodes have identical topologies. You job will now abort. -- [node006:44140] MCW rank 0 bound to socket 0[core 0-1]: [B B . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .] [node006:44140] MCW rank 1 bound to socket 0[core 2-3]: [. . B B . . . . . . . . . . . .][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .] [node006:44140] MCW rank 2 bound to socket 0[core 4-5]: [. . . . B B . . . . . . . . . .][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .] [node006:44140] MCW rank 3 bound to socket 0[core 6-7]: [. . . . . . B B . . . . . . . .][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .] [node006:44140] MCW rank 4 bound to socket 0[core 8-9]: [. . . . . . . . B B . . . . . .][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .] [node006:44140] MCW rank 5 bound to socket 0[core 10-11]: [. . . . . . . . . . B B . . . .][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .] [node006:44140] MCW rank 6 bound to socket 0[core 12-13]: [. . . . . . . . . . . . B B . .][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .] [node006:44140] MCW rank 7 bound to socket 0[core 14-15]: [. . . . . . . . . . . . . . B B][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .] [node006:44140] MCW rank 8 bound to socket 1[core 0-1]: [. . . . . . . . . . . . . . . .][B B . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .] [node006:44140] MCW rank 9 bound to socket 1[core 2-3]: [. . . . . . . . . . . . . . . .][. . B B . . . . . . . . . . . .][. . . . . . . . . . . . . . . .][. . . . . . . . . . . . . . . .] [node006:44140] MCW rank 10 bound to socket 1[core 4-5]: [. . . . . . . . . . . . . . . .][. . . . B B . . . . . . . . . .][. . . . . .