Re: [OMPI users] Infiniband performance Problem and stalling
Yevgeny, The ibstat results: CA 'mthca0' CA type: MT25208 (MT23108 compat mode) Number of ports: 2 Firmware version: 4.7.600 Hardware version: a0 Node GUID: 0x0005ad0c21e0 System image GUID: 0x0005ad000100d050 Port 1: State: Active Physical state: LinkUp Rate: 10 Base lid: 4 LMC: 0 SM lid: 2 Capability mask: 0x02510a68 Port GUID: 0x0005ad0c21e1 Link layer: IB Port 2: State: Down Physical state: Polling Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x02510a68 Port GUID: 0x0005ad0c21e2 Link layer: IB And more interestingly, ib_write_bw: RDMA_Write BW Test Number of qps : 1 Connection type : RC TX depth : 300 CQ Moderation : 50 Link type : IB Mtu : 2048 Inline data is used up to 0 bytes message local address: LID 0x04 QPN 0x1c0407 PSN 0x48ad9e RKey 0xd86a0051 VAddr 0x002ae36287 remote address: LID 0x03 QPN 0x2e0407 PSN 0xf57209 RKey 0x8d98003b VAddr 0x002b533d366000 -- #bytes #iterations BW peak[MB/sec] BW average[MB/sec] Conflicting CPU frequency values detected: 1600.00 != 3301.00 65536 5000 0.00 0.00 -- What does Conflicting CPU frequency values mean? Examining the /proc/cpuinfo file however shows: processor : 0 cpu MHz : 3301.000 processor : 1 cpu MHz : 3301.000 processor : 2 cpu MHz : 1600.000 processor : 3 cpu MHz : 1600.000 Which seems oddly wierd to me... From: Yevgeny Kliteynik To: Randolph Pullen ; OpenMPI Users Sent: Thursday, 6 September 2012 6:03 PM Subject: Re: [OMPI users] Infiniband performance Problem and stalling On 9/3/2012 4:14 AM, Randolph Pullen wrote: > No RoCE, Just native IB with TCP over the top. Sorry, I'm confused - still not clear what is "Melanox III HCA 10G card". Could you run "ibstat" and post the results? What is the expected BW on your cards? Could you run "ib_write_bw" between two machines? Also, please see below. > No I haven't used 1.6 I was trying to stick with the standards on the > mellanox disk. > Is there a known problem with 1.4.3 ? > > -- > *From:* Yevgeny Kliteynik > *To:* Randolph Pullen ; Open MPI Users > > *Sent:* Sunday, 2 September 2012 10:54 PM > *Subject:* Re: [OMPI users] Infiniband performance Problem and stalling > > Randolph, > > Some clarification on the setup: > > "Melanox III HCA 10G cards" - are those ConnectX 3 cards configured to > Ethernet? > That is, when you're using openib BTL, you mean RoCE, right? > > Also, have you had a chance to try some newer OMPI release? > Any 1.6.x would do. > > > -- YK > > On 8/31/2012 10:53 AM, Randolph Pullen wrote: > > (reposted with consolidatedinformation) > > I have a test rig comprising 2 i7 systems 8GB RAM with Melanox III HCA 10G >cards > > running Centos 5.7 Kernel 2.6.18-274 > > Open MPI 1.4.3 > > MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2): > > On a Cisco 24 pt switch > > Normal performance is: > > $ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts PingPong > > results in: > > Max rate = 958.388867 MB/sec Min latency = 4.529953 usec > > and: > > $ mpirun --mca btl tcp,self -n 2 -hostfile mpi.hosts PingPong > > Max rate = 653.547293 MB/sec Min latency = 19.550323 usec > > NetPipeMPI results show a max of 7.4 Gb/s at 8388605 bytes which seems >fine. > > log_num_mtt =20 and log_mtts_per_seg params =2 > > My application exchanges about a gig of data between the processes with 2 >send
Re: [OMPI users] Infiniband performance Problem and stalling
One system is actually an i5-2400 - maybe its throttling back on 2 cores to save power? The other(I7) shows consistent CPU MHz on all cores From: Yevgeny Kliteynik To: Randolph Pullen ; OpenMPI Users Sent: Thursday, 6 September 2012 6:03 PM Subject: Re: [OMPI users] Infiniband performance Problem and stalling On 9/3/2012 4:14 AM, Randolph Pullen wrote: > No RoCE, Just native IB with TCP over the top. Sorry, I'm confused - still not clear what is "Melanox III HCA 10G card". Could you run "ibstat" and post the results? What is the expected BW on your cards? Could you run "ib_write_bw" between two machines? Also, please see below. > No I haven't used 1.6 I was trying to stick with the standards on the > mellanox disk. > Is there a known problem with 1.4.3 ? > > -- > *From:* Yevgeny Kliteynik > *To:* Randolph Pullen ; Open MPI Users > > *Sent:* Sunday, 2 September 2012 10:54 PM > *Subject:* Re: [OMPI users] Infiniband performance Problem and stalling > > Randolph, > > Some clarification on the setup: > > "Melanox III HCA 10G cards" - are those ConnectX 3 cards configured to > Ethernet? > That is, when you're using openib BTL, you mean RoCE, right? > > Also, have you had a chance to try some newer OMPI release? > Any 1.6.x would do. > > > -- YK > > On 8/31/2012 10:53 AM, Randolph Pullen wrote: > > (reposted with consolidatedinformation) > > I have a test rig comprising 2 i7 systems 8GB RAM with Melanox III HCA 10G >cards > > running Centos 5.7 Kernel 2.6.18-274 > > Open MPI 1.4.3 > > MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2): > > On a Cisco 24 pt switch > > Normal performance is: > > $ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts PingPong > > results in: > > Max rate = 958.388867 MB/sec Min latency = 4.529953 usec > > and: > > $ mpirun --mca btl tcp,self -n 2 -hostfile mpi.hosts PingPong > > Max rate = 653.547293 MB/sec Min latency = 19.550323 usec > > NetPipeMPI results show a max of 7.4 Gb/s at 8388605 bytes which seems >fine. > > log_num_mtt =20 and log_mtts_per_seg params =2 > > My application exchanges about a gig of data between the processes with 2 >sender and 2 consumer processes on each node with 1 additional controller >process on the starting node. > > The program splits the data into 64K blocks and uses non blocking sends >and receives with busy/sleep loops to monitor progress until completion. > > Each process owns a single buffer for these 64K blocks. > > My problem is I see better performance under IPoIB then I do on native IB >(RDMA_CM). > > My understanding is that IPoIB is limited to about 1G/s so I am at a loss >to know why it is faster. > > These 2 configurations are equivelant (about 8-10 seconds per cycle) > > mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl >tcp,self -H vh2,vh1 -np 9 --bycore prog > > mpirun --mca btl_openib_flags 3 --mca mpi_leave_pinned 1 --mca btl >tcp,self -H vh2,vh1 -np 9 --bycore prog When you say "--mca btl tcp,self", it means that openib btl is not enabled. Hence "--mca btl_openib_flags" is irrelevant. > > And this one produces similar run times but seems to degrade with repeated >cycles: > > mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl >openib,self -H vh2,vh1 -np 9 --bycore prog You're running 9 ranks on two machines, but you're using IB for intra-node communication. Is it intentional? If not, you can add "sm" btl and have performance improved. -- YK > > Other btl_openib_flags settings result in much lower performance. > > Changing the first of the above configs to use openIB results in a 21 >second run time at best. Sometimes it takes up to 5 minutes. > > In all cases, OpenIB runs in twice the time it takes TCP,except if I push >the small message max to 64K and force short messages. Then the openib times >are the same as TCP and no faster. > > With openib: > > - Repeated cycles during a single run seem to slow down wit
Re: [OMPI users] some mpi processes "disappear" on a cluster of servers
George, I hace done some modifications to the code, however this is the first part my zmp_list: !ZEUSMP2 CONFIGURATION FILE &GEOMCONF LGEOM= 2, LDIMEN = 2 / &PHYSCONF LRAD = 0, XHYDRO = .TRUE., XFORCE = .TRUE., XMHD = .false., XTOTNRG = .false., XGRAV= .false., XGRVFFT = .false., XPTMASS = .false., XISO = .false., XSUBAV = .false., XVGRID = .false., !- - - - - - - - - - - - - - - - - - - XFIXFORCE = .TRUE., XFIXFORCE2 = .TRUE., !- - - - - - - - - - - - - - - - - - - XSOURCEENERGY = .TRUE., XSOURCEMASS = .TRUE., !- - - - - - - - - - - - - - - - - - - XRADCOOL= .TRUE., XA_RGB_WINDS= .TRUE., XSNIa = .TRUE./ != &IOCONFXASCII = .false., XA_MULT = .false., XHDF = .TRUE., XHST = .TRUE., XRESTART = .TRUE., XTSL = .false., XDPRCHDF = .TRUE., XTTY = .TRUE. , XAGRID = .false. / &PRECONF SMALL_NO = 1.0D-307, LARGE_NO = 1.0D+307 / &ARRAYCONF IZONES = 100, JZONES = 125, KZONES = 1, MAXIJK = 125/ &mpitop ntiles(1)=5,ntiles(2)=2,ntiles(3)=1,periodic=2*.false.,.true. / I have done some tests, and currently I'm able to perform a run with 10 processes on 10 nodes, ie I use only 1 of two CPUs in a node. It crashes after 6 hours, and not after 20 minutes! 2012/9/6 : > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > >1. Re: error compiling openmpi-1.6.1 on Windows 7 (Siegmar Gross) >2. Re: OMPI 1.6.x Hang on khugepaged 100% CPU time (Yong Qin) >3. Regarding the Pthreads (seshendra seshu) >4. Re: some mpi processes "disappear" on a cluster ofservers > (George Bosilca) >5. SIGSEGV in OMPI 1.6.x (Yong Qin) >6. Re: error compiling openmpi-1.6.1 on Windows 7 (Siegmar Gross) >7. Re: Infiniband performance Problem and stalling > (Yevgeny Kliteynik) >8. Re: SIGSEGV in OMPI 1.6.x (Jeff Squyres) >9. Re: Regarding the Pthreads (Jeff Squyres) > 10. Re: python-mrmpi() failed (Jeff Squyres) > 11. Re: MPI_Cart_sub periods (Jeff Squyres) > 12. Re: error compiling openmpi-1.6.1 on Windows 7 (Shiqing Fan) > > > -- > > Message: 1 > Date: Wed, 5 Sep 2012 17:43:50 +0200 (CEST) > From: Siegmar Gross > Subject: Re: [OMPI users] error compiling openmpi-1.6.1 on Windows 7 > To: f...@hlrs.de > Cc: us...@open-mpi.org > Message-ID: <201209051543.q85fhoba021...@tyr.informatik.hs-fulda.de> > Content-Type: TEXT/plain; charset=ISO-8859-1 > > Hi Shiqing, > >> Could you try set OPENMPI_HOME env var to the root of the Open MPI dir? >> This env is a backup option for the registry. > > It solves one problem but there is a new problem now :-(( > > > Without OPENMPI_HOME: Wrong pathname to help files. > > D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe > -- > Sorry! You were supposed to get help about: > invalid if_inexclude > But I couldn't open the help file: > D:\...\prog\mpi\small_prog\..\share\openmpi\help-mpi-btl-tcp.txt: > No such file or directory. Sorry! > -- > ... > > > > With OPENMPI_HOME: It nearly uses the correct directory. Unfortunately > the pathname contains the character " in the wrong place so that it > couldn't find the available help file. > > set OPENMPI_HOME="c:\Program Files (x86)\openmpi-1.6.1" > > D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe > -- > Sorry! You were supposed to get help about: > no-hostfile > But I couldn't open the help file: > "c:\Program Files (x86)\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: > Invalid argument. Sorry > ! > -- > [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file > ..\..\openmpi-1.6.1\orte\mca\ras\base > \ras_base_allocate.c at line 200 > [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file > ..\..\openmpi-1.6.1\orte\mca\plm\base > \plm_base_launch_suppo
Re: [OMPI users] some mpi processes "disappear" on a cluster of servers
On Sep 5, 2012, at 3:59 AM, Andrea Negri wrote: > I have tried with these flags (I use gcc 4.7 and open mpi 1.6), but > the program doesn't crash, a node go down and the rest of them remain > to wait a signal (there is an ALLREDUCE in the code). > > Anyway, yesterday some processes died (without a log) on the node 10, I suggest that you should probably start adding your own monitoring. *Something* is happening, but apparently it's not being captured in any logs that you see. For example: - run your program through valgrind, or other memory-checking debugger - ask you admin to increase the syslog levels to get more information - ensure that sys logging is going to both the local disk and to a remote server (in case your machines are getting re-imaged and local disk syslogs get wiped out upon reboot) - look at dmesg output immediately upon reboot - look at /var/log/syslog output immediately upon reboot - when your job launches continually capture some linux statistics (e.g., every N seconds -- pick N to meet your needs), such as: - top -b -n 999 -d N (use the same N value as above) - numastat -H - cat /proc/meminfo - ...etc. When a crash occurs, look an these logs you've made and see if you can find any trends, like running out of memory on any particular NUMA node (or overall), if any process size is growing arbitrarily large, etc. Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is it always the same node that crashes? And so on. > I logged almost immediately in the node and I found the process > > /usr/sbin/hal_lpadmin -x /org/freedesktop/Hal/devices/pci_10de_267 > > What is it? I know that hal is a device demon, but hal_lpadmin? It has to do with managing printers. > PS: What is the correct method to reply in this mailing list? I use > gmail and I usually I hit the reply butt, replace the object, but here > it seems the I opening a new thread each time I post. You seem to be replying to the daily digest mail rather than the individual mails in this thread. That's why it creates a new thread in the web mail archives. If you replied to the individual mails, they would thread properly on the web mail archives. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] some mpi processes "disappear" on a cluster of servers
On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote: > Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is > it always the same node that crashes? And so on. Another thought on hardware errors... I actually have seen bad RAM cause spontaneous reboots with no Linux warnings. Do you have any hardware diagnostics from your server vendor that you can run? A simple way to test your RAM (it's not completely comprehensive, but it does check for a surprisingly wide array of memory issues) is to do something like this (pseudocode): - size_t i, size, increment; increment = 1GB; size = 1GB; int *ptr; // Find the biggest amount of memory that you can malloc while (increment >= 1024) { ptr = malloc(size); if (NULL != ptr) { free(ptr); size += increment; } else { size -= increment; increment /= 2; } } printf("I can malloc %lu bytes\n", size); // Malloc that huge chunk of memory ptr = malloc(size); for (i = 0; i < size / sizeof(int); ++i, ++ptr) { *ptr = 37; if (*ptr != 37) { printf("Readback error!\n"); } } printf("All done\n"); - Depending on how much memory you have, that might take a little while to run (all the memory has to be paged in, etc.). You might want to add a status output to show progress, and/or write/read a page at a time for better efficiency, etc. But you get the idea. Hope that helps. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] problem with rankfile
Hi, are the following outputs helpful to find the error with a rankfile on Solaris? I wrapped long lines so that they are easier to read. Have you had time to look at the segmentation fault with a rankfile which I reported in my last email (see below)? "tyr" is a two processor single core machine. tyr fd1026 116 mpiexec -report-bindings -np 4 \ -bind-to-socket -bycore rank_size [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default: fork binding child [[27298,1],0] to socket 0 cpus 0001 [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default: fork binding child [[27298,1],1] to socket 1 cpus 0002 [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default: fork binding child [[27298,1],2] to socket 0 cpus 0001 [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default: fork binding child [[27298,1],3] to socket 1 cpus 0002 I'm process 0 of 4 ... tyr fd1026 121 mpiexec -report-bindings -np 4 \ -bind-to-socket -bysocket rank_size [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default: fork binding child [[27380,1],0] to socket 0 cpus 0001 [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default: fork binding child [[27380,1],1] to socket 1 cpus 0002 [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default: fork binding child [[27380,1],2] to socket 0 cpus 0001 [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default: fork binding child [[27380,1],3] to socket 1 cpus 0002 I'm process 0 of 4 ... tyr fd1026 117 mpiexec -report-bindings -np 4 \ -bind-to-core -bycore rank_size [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default: fork binding child [[27307,1],2] to cpus 0004 -- An attempt to set processor affinity has failed - please check to ensure that your system supports such functionality. If so, then this is probably something that should be reported to the OMPI developers. -- [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default: fork binding child [[27307,1],0] to cpus 0001 [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default: fork binding child [[27307,1],1] to cpus 0002 -- mpiexec was unable to start the specified application as it encountered an error on node tyr.informatik.hs-fulda.de. More information may be available above. -- 4 total processes failed to start tyr fd1026 118 mpiexec -report-bindings -np 4 \ -bind-to-core -bysocket rank_size -- An invalid physical processor ID was returned when attempting to bind an MPI process to a unique processor. This usually means that you requested binding to more processors than exist (e.g., trying to bind N MPI processes to M processors, where N > M). Double check that you have enough unique processors for all the MPI processes that you are launching on this host. You job will now abort. -- [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default: fork binding child [[27347,1],0] to socket 0 cpus 0001 [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default: fork binding child [[27347,1],1] to socket 1 cpus 0002 -- mpiexec was unable to start the specified application as it encountered an error on node tyr.informatik.hs-fulda.de. More information may be available above. -- 4 total processes failed to start tyr fd1026 119 "linpc3" and "linpc4" are two processor dual core machines. linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \ -np 4 -bind-to-core -bycore rank_size [linpc4:16842] [[40914,0],0] odls:default: fork binding child [[40914,1],1] to cpus 0001 [linpc4:16842] [[40914,0],0] odls:default: fork binding child [[40914,1],3] to cpus 0002 [linpc3:31384] [[40914,0],1] odls:default: fork binding child [[40914,1],0] to cpus 0001 [linpc3:31384] [[40914,0],1] odls:default: fork binding child [[40914,1],2] to cpus 0002 I'm process 1 of 4 ... linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \ -np 4 -bind-to-core -bysocket rank_size [linpc4:16846] [[40918,0],0] odls:default: fork binding child [[40918,1],1] to socket 0 cpus 0001 [linpc4:16846] [[40918,0],0] odls:default: fork binding child [[40918,1],3] to socket 0 cpus 0002 [linpc3:31435] [[40918,0],1] odls:default: fork binding child [[40918,1],0] to socket 0 cpus 0001 [linpc3:31435] [[40918,0],1] odls:default: fork binding child [[40918,1],2] to socket 0 cpus 0002 I'm process 1 of 4 ... linpc4 fd1026 104 mpiexec -report-bindings -host linpc3,linpc4 \ -np 4 -bind-to-socket -bycore rank_size --
Re: [OMPI users] problem with rankfile
On Sep 7, 2012, at 5:41 AM, Siegmar Gross wrote: > Hi, > > are the following outputs helpful to find the error with > a rankfile on Solaris? If you can't bind on the new Solaris machine, then the rankfile won't do you any good. It looks like we are getting the incorrect number of cores on that machine - is it possible that it has hardware threads, and doesn't report "cores"? Can you download and run a copy of lstopo to check the output? You get that from the hwloc folks: http://www.open-mpi.org/software/hwloc/v1.5/ > I wrapped long lines so that they > are easier to read. Have you had time to look at the > segmentation fault with a rankfile which I reported in my > last email (see below)? I'm afraid not - been too busy lately. I'd suggest first focusing on getting binding to work. > > "tyr" is a two processor single core machine. > > tyr fd1026 116 mpiexec -report-bindings -np 4 \ > -bind-to-socket -bycore rank_size > [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default: > fork binding child [[27298,1],0] to socket 0 cpus 0001 > [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default: > fork binding child [[27298,1],1] to socket 1 cpus 0002 > [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default: > fork binding child [[27298,1],2] to socket 0 cpus 0001 > [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default: > fork binding child [[27298,1],3] to socket 1 cpus 0002 > I'm process 0 of 4 ... > > > tyr fd1026 121 mpiexec -report-bindings -np 4 \ > -bind-to-socket -bysocket rank_size > [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default: > fork binding child [[27380,1],0] to socket 0 cpus 0001 > [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default: > fork binding child [[27380,1],1] to socket 1 cpus 0002 > [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default: > fork binding child [[27380,1],2] to socket 0 cpus 0001 > [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default: > fork binding child [[27380,1],3] to socket 1 cpus 0002 > I'm process 0 of 4 ... > > > tyr fd1026 117 mpiexec -report-bindings -np 4 \ > -bind-to-core -bycore rank_size > [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default: > fork binding child [[27307,1],2] to cpus 0004 > -- > An attempt to set processor affinity has failed - please check to > ensure that your system supports such functionality. If so, then > this is probably something that should be reported to the OMPI > developers. > -- > [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default: > fork binding child [[27307,1],0] to cpus 0001 > [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default: > fork binding child [[27307,1],1] to cpus 0002 > -- > mpiexec was unable to start the specified application > as it encountered an error > on node tyr.informatik.hs-fulda.de. More information may be > available above. > -- > 4 total processes failed to start > > > > tyr fd1026 118 mpiexec -report-bindings -np 4 \ > -bind-to-core -bysocket rank_size > -- > An invalid physical processor ID was returned when attempting to > bind > an MPI process to a unique processor. > > This usually means that you requested binding to more processors > than > > exist (e.g., trying to bind N MPI processes to M processors, > where N > > M). Double check that you have enough unique processors for > all the > MPI processes that you are launching on this host. > > You job will now abort. > -- > [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default: > fork binding child [[27347,1],0] to socket 0 cpus 0001 > [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default: > fork binding child [[27347,1],1] to socket 1 cpus 0002 > -- > mpiexec was unable to start the specified application as it > encountered an error > on node tyr.informatik.hs-fulda.de. More information may be > available above. > -- > 4 total processes failed to start > tyr fd1026 119 > > > > "linpc3" and "linpc4" are two processor dual core machines. > > linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \ > -np 4 -bind-to-core -bycore rank_size > [linpc4:16842] [[40914,0],0] odls:default: > fork binding child [[40914,1],1] to cpus 0001 > [linpc4:16842] [[40914,0],0] odls:default: > fork binding child [[40914,1],3] to cpus 0002 > [linpc3:31384] [[40914,0],1] odls:default: > fork binding child [[40914,1],0] to cpus 0001 > [linpc3:31384] [[40914,0],1] odls:default: > fork binding ch
Re: [OMPI users] some mpi processes "disappear" on a cluster of servers
On 09/03/2012 04:39 PM, Andrea Negri wrote: max locked memory (kbytes, -l) 32 max memory size(kbytes, -m) unlimited open files (-n) 1024 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size (kbytes, -s) 10240 Hi Andrea This is besides the possibilities of running out of physical memory, or even defective memory chips, which Jeff, Ralph, John, George have addressed, I still think that the system limits above may play a role. In a 8-year old cluster, hardware failures are not unexpected. 1) System limits For what it is worth, virtually none of the programs we run here, mostly atmosphere/ocean/climate codes, would run with these limits. On our compute nodes we set max locked memory and stack size to unlimited, to avoid problems with symptoms very similar to those you describe. Typically there are lots of automatic arrays in subroutines, etc, which require a large stack. Your sys admin could add these lines to the bottom of /etc/security/limits.conf [the last one sets the max number of open files]: * - memlock -1 * - stack -1 * - nofile 4096 2) Defective network interface/cable/switch port Yet another possibility, following Ralph's suggestion, is that you may have a failing network interface, or a bad Ethernet cable or connector, on the node that goes south, or on the switch port that serves that node. [I am assuming your network is Ethernet, probably GigE.] Again, in a 8-year old cluster, hardware failures are not unexpected. We had this sort of problems with old clusters, old nodes. 3) Quarantine the bad node Is it always the same node that fails, or does it vary? [Please answer, it helps us understand what's going on.] If it is always the same node, have you tried to quarantine it, either temporarily removing it from your job submission system or just turning it off, and run the job on the remaining nodes? I hope this helps, Gus Correa
Re: [OMPI users] some mpi processes "disappear" on a cluster of servers
On 09/07/2012 08:02 AM, Jeff Squyres wrote: On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote: Also look for hardware errors. Perhaps you have some bad RAM somewhere. Is it always the same node that crashes? And so on. Another thought on hardware errors... I actually have seen bad RAM cause spontaneous reboots with no Linux warnings. Do you have any hardware diagnostics from your server vendor that you can run? If you don't have a vendor provided diagnostic tool, you or your sys admin could try Advanced Clustering "breakin": http://www.advancedclustering.com/our-software/view-category.html Download the ISO version, burn a CD, put in the node CD drive, assuming it has one, reboot, chose breakin in the menu options. If there is no CD drive, there is an alternative with network boot, although more involved. I hope it helps, Gus Correa A simple way to test your RAM (it's not completely comprehensive, but it does check for a surprisingly wide array of memory issues) is to do something like this (pseudocode): - size_t i, size, increment; increment = 1GB; size = 1GB; int *ptr; // Find the biggest amount of memory that you can malloc while (increment>= 1024) { ptr = malloc(size); if (NULL != ptr) { free(ptr); size += increment; } else { size -= increment; increment /= 2; } } printf("I can malloc %lu bytes\n", size); // Malloc that huge chunk of memory ptr = malloc(size); for (i = 0; i< size / sizeof(int); ++i, ++ptr) { *ptr = 37; if (*ptr != 37) { printf("Readback error!\n"); } } printf("All done\n"); - Depending on how much memory you have, that might take a little while to run (all the memory has to be paged in, etc.). You might want to add a status output to show progress, and/or write/read a page at a time for better efficiency, etc. But you get the idea. Hope that helps.