Re: [OMPI users] Infiniband performance Problem and stalling

2012-09-07 Thread Randolph Pullen
Yevgeny,
The ibstat results:
CA 'mthca0'
    CA type: MT25208 (MT23108 compat mode)
    Number of ports: 2
    Firmware version: 4.7.600
    Hardware version: a0
    Node GUID: 0x0005ad0c21e0
    System image GUID: 0x0005ad000100d050
    Port 1:
    State: Active
    Physical state: LinkUp
    Rate: 10
    Base lid: 4
    LMC: 0
    SM lid: 2
    Capability mask: 0x02510a68
    Port GUID: 0x0005ad0c21e1
    Link layer: IB
    Port 2:
    State: Down
    Physical state: Polling
    Rate: 10
    Base lid: 0
    LMC: 0
    SM lid: 0
    Capability mask: 0x02510a68
    Port GUID: 0x0005ad0c21e2
    Link layer: IB

And more interestingly, ib_write_bw: 
   RDMA_Write BW Test
 Number of qps   : 1
 Connection type : RC
 TX depth    : 300
 CQ Moderation   : 50
 Link type   : IB
 Mtu : 2048
 Inline data is used up to 0 bytes message
 local address: LID 0x04 QPN 0x1c0407 PSN 0x48ad9e RKey 0xd86a0051 VAddr 
0x002ae36287
 remote address: LID 0x03 QPN 0x2e0407 PSN 0xf57209 RKey 0x8d98003b VAddr 
0x002b533d366000
--
 #bytes #iterations    BW peak[MB/sec]    BW average[MB/sec]
Conflicting CPU frequency values detected: 1600.00 != 3301.00
 65536 5000   0.00   0.00   
--

What does Conflicting CPU frequency values mean?

Examining the /proc/cpuinfo file however shows:
processor   : 0
cpu MHz : 3301.000
processor   : 1
cpu MHz : 3301.000
processor   : 2

cpu MHz : 1600.000
processor   : 3
cpu MHz : 1600.000


Which seems oddly wierd to me...




 From: Yevgeny Kliteynik 
To: Randolph Pullen ; OpenMPI Users 
 
Sent: Thursday, 6 September 2012 6:03 PM
Subject: Re: [OMPI users] Infiniband performance Problem and stalling
 
On 9/3/2012 4:14 AM, Randolph Pullen wrote:
> No RoCE, Just native IB with TCP over the top.

Sorry, I'm confused - still not clear what is "Melanox III HCA 10G card".
Could you run "ibstat" and post the results?

What is the expected BW on your cards?
Could you run "ib_write_bw" between two machines?

Also, please see below.

> No I haven't used 1.6 I was trying to stick with the standards on the 
> mellanox disk.
> Is there a known problem with 1.4.3 ?
> 
>
 
--

> *From:* Yevgeny Kliteynik 
> *To:* Randolph Pullen ; Open MPI Users 
> 
> *Sent:* Sunday, 2 September 2012 10:54 PM
> *Subject:* Re: [OMPI users] Infiniband performance Problem and stalling
> 
> Randolph,
> 
> Some clarification on the setup:
> 
> "Melanox III HCA 10G cards" - are those ConnectX 3 cards configured to 
> Ethernet?
> That is, when you're using openib BTL, you mean RoCE, right?
> 
> Also, have you had a chance to try some newer OMPI release?
> Any 1.6.x would do.
> 
> 
> -- YK
> 
> On 8/31/2012 10:53 AM, Randolph Pullen wrote:
>  > (reposted with consolidatedinformation)
>  > I have a test rig comprising 2 i7 systems 8GB RAM with Melanox III HCA 10G 
>cards
>  > running Centos 5.7 Kernel 2.6.18-274
>  > Open MPI 1.4.3
>  > MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2):
>  > On a Cisco 24 pt switch
>  > Normal performance is:
>  > $ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts PingPong
>  > results in:
>  > Max rate = 958.388867 MB/sec Min latency = 4.529953 usec
>  > and:
>  > $ mpirun --mca btl tcp,self -n 2 -hostfile mpi.hosts PingPong
>  > Max rate = 653.547293 MB/sec Min latency = 19.550323 usec
>  > NetPipeMPI results show a max of 7.4 Gb/s at 8388605 bytes which seems 
>fine.
>  > log_num_mtt =20 and log_mtts_per_seg params =2
>  > My application exchanges about a gig of data between the processes with 2 
>send

Re: [OMPI users] Infiniband performance Problem and stalling

2012-09-07 Thread Randolph Pullen
One system is actually an i5-2400 - maybe its throttling back on 2 cores to 
save power?
The other(I7) shows consistent CPU MHz on all cores



 From: Yevgeny Kliteynik 
To: Randolph Pullen ; OpenMPI Users 
 
Sent: Thursday, 6 September 2012 6:03 PM
Subject: Re: [OMPI users] Infiniband performance Problem and stalling
 
On 9/3/2012 4:14 AM, Randolph Pullen wrote:
> No RoCE, Just native IB with TCP over the top.

Sorry, I'm confused - still not clear what is "Melanox III HCA 10G card".
Could you run "ibstat" and post the results?

What is the expected BW on your cards?
Could you run "ib_write_bw" between two machines?

Also, please see below.

> No I haven't used 1.6 I was trying to stick with the standards on the 
> mellanox disk.
> Is there a known problem with 1.4.3 ?
> 
>
 
--

> *From:* Yevgeny Kliteynik 
> *To:* Randolph Pullen ; Open MPI Users 
> 
> *Sent:* Sunday, 2 September 2012 10:54 PM
> *Subject:* Re: [OMPI users] Infiniband performance Problem and stalling
> 
> Randolph,
> 
> Some clarification on the setup:
> 
> "Melanox III HCA 10G cards" - are those ConnectX 3 cards configured to 
> Ethernet?
> That is, when you're using openib BTL, you mean RoCE, right?
> 
> Also, have you had a chance to try some newer OMPI release?
> Any 1.6.x would do.
> 
> 
> -- YK
> 
> On 8/31/2012 10:53 AM, Randolph Pullen wrote:
>  > (reposted with consolidatedinformation)
>  > I have a test rig comprising 2 i7 systems 8GB RAM with Melanox III HCA 10G 
>cards
>  > running Centos 5.7 Kernel 2.6.18-274
>  > Open MPI 1.4.3
>  > MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2):
>  > On a Cisco 24 pt switch
>  > Normal performance is:
>  > $ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts PingPong
>  > results in:
>  > Max rate = 958.388867 MB/sec Min latency = 4.529953 usec
>  > and:
>  > $ mpirun --mca btl tcp,self -n 2 -hostfile mpi.hosts PingPong
>  > Max rate = 653.547293 MB/sec Min latency = 19.550323 usec
>  > NetPipeMPI results show a max of 7.4 Gb/s at 8388605 bytes which seems 
>fine.
>  > log_num_mtt =20 and log_mtts_per_seg params =2
>  > My application exchanges about a gig of data between the processes with 2 
>sender and 2 consumer processes on each node with 1 additional controller 
>process on the starting node.
>  > The program splits the data into 64K blocks and uses non blocking sends 
>and receives with busy/sleep loops to monitor progress until completion.
>  > Each process owns a single buffer for these 64K blocks.
>  > My problem is I see better performance under IPoIB then I do on native IB 
>(RDMA_CM).
>  > My understanding is that IPoIB is limited to about 1G/s so I am at a loss 
>to know why it is faster.
>  > These 2 configurations are equivelant (about 8-10 seconds per cycle)
>  > mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl 
>tcp,self -H vh2,vh1 -np 9 --bycore prog
>  > mpirun --mca btl_openib_flags 3 --mca mpi_leave_pinned 1 --mca btl 
>tcp,self -H vh2,vh1 -np 9 --bycore prog

When you say "--mca btl tcp,self", it means that openib btl is not enabled.
Hence "--mca btl_openib_flags" is irrelevant.

>  > And this one produces similar run times but seems to degrade with repeated 
>cycles:
>  > mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl 
>openib,self -H vh2,vh1 -np 9 --bycore prog

You're running 9 ranks on two machines, but you're using IB for intra-node 
communication.
Is it intentional? If not, you can add "sm" btl and have performance improved.

-- YK

>  > Other btl_openib_flags settings result in much lower performance.
>  > Changing the first of the above configs to use openIB results in a 21 
>second run time at best. Sometimes it takes up to 5 minutes.
>  > In all cases, OpenIB runs in twice the time it takes TCP,except if I push 
>the small message max to 64K and force short messages. Then the openib times 
>are the same as TCP and no faster.
>  > With openib:
>  > - Repeated cycles during a single run seem to slow down wit

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Andrea Negri
George,

I hace done some modifications to the code, however this is the first
part my zmp_list:
!ZEUSMP2 CONFIGURATION FILE
 &GEOMCONF  LGEOM= 2,
LDIMEN   = 2 /
 &PHYSCONF  LRAD = 0,
XHYDRO   = .TRUE.,
XFORCE   = .TRUE.,
XMHD = .false.,
XTOTNRG  = .false.,
XGRAV= .false.,
XGRVFFT  = .false.,
XPTMASS  = .false.,
XISO = .false.,
XSUBAV   = .false.,
XVGRID   = .false.,
!- - - - - - - - - - - - - - - - - - -
XFIXFORCE   = .TRUE.,
XFIXFORCE2  = .TRUE.,
!- - - - - - - - - - - - - - - - - - -
XSOURCEENERGY   = .TRUE.,
XSOURCEMASS = .TRUE.,
!- - - - - - - - - - - - - - - - - - -
XRADCOOL= .TRUE.,
XA_RGB_WINDS= .TRUE.,
XSNIa   = .TRUE./
!=
 &IOCONFXASCII   = .false.,
XA_MULT  = .false.,
XHDF = .TRUE.,
XHST = .TRUE.,
XRESTART = .TRUE.,
XTSL = .false.,
XDPRCHDF = .TRUE.,
XTTY = .TRUE. ,
XAGRID   = .false. /
 &PRECONF   SMALL_NO = 1.0D-307,
LARGE_NO = 1.0D+307 /
 &ARRAYCONF IZONES   = 100,
JZONES   = 125,
KZONES   = 1,
MAXIJK   = 125/
 &mpitop ntiles(1)=5,ntiles(2)=2,ntiles(3)=1,periodic=2*.false.,.true. /

I have done some tests, and currently I'm able to perform a run with
10 processes on 10 nodes, ie I use only 1 of two CPUs in a node. It
crashes after 6 hours, and not after 20 minutes!


2012/9/6  :
> Send users mailing list submissions to
> us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
>
> You can reach the person managing the list at
> users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>1. Re: error compiling openmpi-1.6.1 on Windows 7 (Siegmar Gross)
>2. Re: OMPI 1.6.x Hang on khugepaged 100% CPU time (Yong Qin)
>3. Regarding the Pthreads (seshendra seshu)
>4. Re: some mpi processes "disappear" on a cluster ofservers
>   (George Bosilca)
>5. SIGSEGV in OMPI 1.6.x (Yong Qin)
>6. Re: error compiling openmpi-1.6.1 on Windows 7 (Siegmar Gross)
>7. Re: Infiniband performance Problem and stalling
>   (Yevgeny Kliteynik)
>8. Re: SIGSEGV in OMPI 1.6.x (Jeff Squyres)
>9. Re: Regarding the Pthreads (Jeff Squyres)
>   10. Re: python-mrmpi() failed (Jeff Squyres)
>   11. Re: MPI_Cart_sub periods (Jeff Squyres)
>   12. Re: error compiling openmpi-1.6.1 on Windows 7 (Shiqing Fan)
>
>
> --
>
> Message: 1
> Date: Wed, 5 Sep 2012 17:43:50 +0200 (CEST)
> From: Siegmar Gross 
> Subject: Re: [OMPI users] error compiling openmpi-1.6.1 on Windows 7
> To: f...@hlrs.de
> Cc: us...@open-mpi.org
> Message-ID: <201209051543.q85fhoba021...@tyr.informatik.hs-fulda.de>
> Content-Type: TEXT/plain; charset=ISO-8859-1
>
> Hi Shiqing,
>
>> Could you try set OPENMPI_HOME env var to the root of the Open MPI dir?
>> This env is a backup option for the registry.
>
> It solves one problem but there is a new problem now :-((
>
>
> Without OPENMPI_HOME: Wrong pathname to help files.
>
> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
> --
> Sorry!  You were supposed to get help about:
> invalid if_inexclude
> But I couldn't open the help file:
> D:\...\prog\mpi\small_prog\..\share\openmpi\help-mpi-btl-tcp.txt:
> No such file or directory.  Sorry!
> --
> ...
>
>
>
> With OPENMPI_HOME: It nearly uses the correct directory. Unfortunately
> the pathname contains the character " in the wrong place so that it
> couldn't find the available help file.
>
> set OPENMPI_HOME="c:\Program Files (x86)\openmpi-1.6.1"
>
> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
> --
> Sorry!  You were supposed to get help about:
> no-hostfile
> But I couldn't open the help file:
> "c:\Program Files (x86)\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: 
> Invalid argument.  Sorry
> !
> --
> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file 
> ..\..\openmpi-1.6.1\orte\mca\ras\base
> \ras_base_allocate.c at line 200
> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file 
> ..\..\openmpi-1.6.1\orte\mca\plm\base
> \plm_base_launch_suppo

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Jeff Squyres
On Sep 5, 2012, at 3:59 AM, Andrea Negri wrote:

> I have tried with these flags (I use gcc 4.7 and open mpi 1.6), but
> the program doesn't crash, a node go down and the rest of them remain
> to wait a signal (there is an ALLREDUCE in the code).
> 
> Anyway, yesterday some processes died (without a log) on the node 10,

I suggest that you should probably start adding your own monitoring.  
*Something* is happening, but apparently it's not being captured in any logs 
that you see.  For example:

- run your program through valgrind, or other memory-checking debugger
- ask you admin to increase the syslog levels to get more information
- ensure that sys logging is going to both the local disk and to a remote 
server (in case your machines are getting re-imaged and local disk syslogs get 
wiped out upon reboot)
- look at dmesg output immediately upon reboot
- look at /var/log/syslog output immediately upon reboot
- when your job launches continually capture some linux statistics (e.g., every 
N seconds -- pick N to meet your needs), such as:
  - top -b -n 999 -d N (use the same N value as above)
  - numastat -H
  - cat /proc/meminfo
  - ...etc.

When a crash occurs, look an these logs you've made and see if you can find any 
trends, like running out of memory on any particular NUMA node (or overall), if 
any process size is growing arbitrarily large, etc.

Also look for hardware errors.  Perhaps you have some bad RAM somewhere.  Is it 
always the same node that crashes?  And so on.

> I logged almost immediately in the node and I found the process
> 
> /usr/sbin/hal_lpadmin -x /org/freedesktop/Hal/devices/pci_10de_267
> 
> What is it? I know that hal is a device demon, but hal_lpadmin?

It has to do with managing printers.

> PS: What is the correct method to reply in this mailing list? I use
> gmail and I usually I hit the reply butt, replace the object, but here
> it seems the I opening a new thread each time I post.


You seem to be replying to the daily digest mail rather than the individual 
mails in this thread.  That's why it creates a new thread in the web mail 
archives.  If you replied to the individual mails, they would thread properly 
on the web mail archives.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Jeff Squyres
On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote:

> Also look for hardware errors.  Perhaps you have some bad RAM somewhere.  Is 
> it always the same node that crashes?  And so on.


Another thought on hardware errors... I actually have seen bad RAM cause 
spontaneous reboots with no Linux warnings.

Do you have any hardware diagnostics from your server vendor that you can run?

A simple way to test your RAM (it's not completely comprehensive, but it does 
check for a surprisingly wide array of memory issues) is to do something like 
this (pseudocode):

-
size_t i, size, increment;
increment = 1GB;
size = 1GB;
int *ptr;

// Find the biggest amount of memory that you can malloc
while (increment >= 1024) {
ptr = malloc(size);
if (NULL != ptr) {
 free(ptr);
 size += increment;
} else {
 size -= increment;
 increment /= 2;
}
}
printf("I can malloc %lu bytes\n", size);

// Malloc that huge chunk of memory
ptr = malloc(size);
for (i = 0; i < size / sizeof(int); ++i, ++ptr) {
*ptr = 37;
if (*ptr != 37) {
printf("Readback error!\n");
}
}

printf("All done\n");
-

Depending on how much memory you have, that might take a little while to run 
(all the memory has to be paged in, etc.).  You might want to add a status 
output to show progress, and/or write/read a page at a time for better 
efficiency, etc.  But you get the idea.

Hope that helps.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] problem with rankfile

2012-09-07 Thread Siegmar Gross
Hi,

are the following outputs helpful to find the error with
a rankfile on Solaris? I wrapped long lines so that they
are easier to read. Have you had time to look at the
segmentation fault with a rankfile which I reported in my
last email (see below)?

"tyr" is a two processor single core machine.

tyr fd1026 116 mpiexec -report-bindings -np 4 \
  -bind-to-socket -bycore rank_size
[tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
  fork binding child [[27298,1],0] to socket 0 cpus 0001
[tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
  fork binding child [[27298,1],1] to socket 1 cpus 0002
[tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
  fork binding child [[27298,1],2] to socket 0 cpus 0001
[tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
  fork binding child [[27298,1],3] to socket 1 cpus 0002
I'm process 0 of 4 ...


tyr fd1026 121 mpiexec -report-bindings -np 4 \
 -bind-to-socket -bysocket rank_size
[tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
  fork binding child [[27380,1],0] to socket 0 cpus 0001
[tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
  fork binding child [[27380,1],1] to socket 1 cpus 0002
[tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
  fork binding child [[27380,1],2] to socket 0 cpus 0001
[tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
  fork binding child [[27380,1],3] to socket 1 cpus 0002
I'm process 0 of 4 ...


tyr fd1026 117 mpiexec -report-bindings -np 4 \
  -bind-to-core -bycore rank_size
[tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
  fork binding child [[27307,1],2] to cpus 0004
--
An attempt to set processor affinity has failed - please check to
ensure that your system supports such functionality. If so, then
this is probably something that should be reported to the OMPI
  developers.
--
[tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
  fork binding child [[27307,1],0] to cpus 0001
[tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
  fork binding child [[27307,1],1] to cpus 0002
--
mpiexec was unable to start the specified application
  as it encountered an error
on node tyr.informatik.hs-fulda.de. More information may be
  available above.
--
4 total processes failed to start



tyr fd1026 118 mpiexec -report-bindings -np 4 \
  -bind-to-core -bysocket rank_size
--
An invalid physical processor ID was returned when attempting to
  bind
an MPI process to a unique processor.

This usually means that you requested binding to more processors
  than

exist (e.g., trying to bind N MPI processes to M processors,
  where N >
M).  Double check that you have enough unique processors for
  all the
MPI processes that you are launching on this host.

You job will now abort.
--
[tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default:
  fork binding child [[27347,1],0] to socket 0 cpus 0001
[tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default:
  fork binding child [[27347,1],1] to socket 1 cpus 0002
--
mpiexec was unable to start the specified application as it
  encountered an error
on node tyr.informatik.hs-fulda.de. More information may be
  available above.
--
4 total processes failed to start
tyr fd1026 119 



"linpc3" and "linpc4" are two processor dual core machines.

linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \
 -np 4 -bind-to-core -bycore rank_size
[linpc4:16842] [[40914,0],0] odls:default:
  fork binding child [[40914,1],1] to cpus 0001
[linpc4:16842] [[40914,0],0] odls:default:
  fork binding child [[40914,1],3] to cpus 0002
[linpc3:31384] [[40914,0],1] odls:default:
  fork binding child [[40914,1],0] to cpus 0001
[linpc3:31384] [[40914,0],1] odls:default:
  fork binding child [[40914,1],2] to cpus 0002
I'm process 1 of 4 ...


linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \
  -np 4 -bind-to-core -bysocket rank_size
[linpc4:16846] [[40918,0],0] odls:default:
  fork binding child [[40918,1],1] to socket 0 cpus 0001
[linpc4:16846] [[40918,0],0] odls:default:
  fork binding child [[40918,1],3] to socket 0 cpus 0002
[linpc3:31435] [[40918,0],1] odls:default:
  fork binding child [[40918,1],0] to socket 0 cpus 0001
[linpc3:31435] [[40918,0],1] odls:default:
  fork binding child [[40918,1],2] to socket 0 cpus 0002
I'm process 1 of 4 ...




linpc4 fd1026 104 mpiexec -report-bindings -host linpc3,linpc4 \
  -np 4 -bind-to-socket -bycore rank_size
--

Re: [OMPI users] problem with rankfile

2012-09-07 Thread Ralph Castain

On Sep 7, 2012, at 5:41 AM, Siegmar Gross 
 wrote:

> Hi,
> 
> are the following outputs helpful to find the error with
> a rankfile on Solaris?

If you can't bind on the new Solaris machine, then the rankfile won't do you 
any good. It looks like we are getting the incorrect number of cores on that 
machine - is it possible that it has hardware threads, and doesn't report 
"cores"? Can you download and run a copy of lstopo to check the output? You get 
that from the hwloc folks:

http://www.open-mpi.org/software/hwloc/v1.5/


> I wrapped long lines so that they
> are easier to read. Have you had time to look at the
> segmentation fault with a rankfile which I reported in my
> last email (see below)?

I'm afraid not - been too busy lately. I'd suggest first focusing on getting 
binding to work.

> 
> "tyr" is a two processor single core machine.
> 
> tyr fd1026 116 mpiexec -report-bindings -np 4 \
>  -bind-to-socket -bycore rank_size
> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
>  fork binding child [[27298,1],0] to socket 0 cpus 0001
> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
>  fork binding child [[27298,1],1] to socket 1 cpus 0002
> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
>  fork binding child [[27298,1],2] to socket 0 cpus 0001
> [tyr.informatik.hs-fulda.de:18614] [[27298,0],0] odls:default:
>  fork binding child [[27298,1],3] to socket 1 cpus 0002
> I'm process 0 of 4 ...
> 
> 
> tyr fd1026 121 mpiexec -report-bindings -np 4 \
> -bind-to-socket -bysocket rank_size
> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
>  fork binding child [[27380,1],0] to socket 0 cpus 0001
> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
>  fork binding child [[27380,1],1] to socket 1 cpus 0002
> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
>  fork binding child [[27380,1],2] to socket 0 cpus 0001
> [tyr.informatik.hs-fulda.de:18656] [[27380,0],0] odls:default:
>  fork binding child [[27380,1],3] to socket 1 cpus 0002
> I'm process 0 of 4 ...
> 
> 
> tyr fd1026 117 mpiexec -report-bindings -np 4 \
>  -bind-to-core -bycore rank_size
> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
>  fork binding child [[27307,1],2] to cpus 0004
> --
> An attempt to set processor affinity has failed - please check to
> ensure that your system supports such functionality. If so, then
> this is probably something that should be reported to the OMPI
>  developers.
> --
> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
>  fork binding child [[27307,1],0] to cpus 0001
> [tyr.informatik.hs-fulda.de:18623] [[27307,0],0] odls:default:
>  fork binding child [[27307,1],1] to cpus 0002
> --
> mpiexec was unable to start the specified application
>  as it encountered an error
> on node tyr.informatik.hs-fulda.de. More information may be
>  available above.
> --
> 4 total processes failed to start
> 
> 
> 
> tyr fd1026 118 mpiexec -report-bindings -np 4 \
>  -bind-to-core -bysocket rank_size
> --
> An invalid physical processor ID was returned when attempting to
>  bind
> an MPI process to a unique processor.
> 
> This usually means that you requested binding to more processors
>  than
> 
> exist (e.g., trying to bind N MPI processes to M processors,
>  where N >
> M).  Double check that you have enough unique processors for
>  all the
> MPI processes that you are launching on this host.
> 
> You job will now abort.
> --
> [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default:
>  fork binding child [[27347,1],0] to socket 0 cpus 0001
> [tyr.informatik.hs-fulda.de:18631] [[27347,0],0] odls:default:
>  fork binding child [[27347,1],1] to socket 1 cpus 0002
> --
> mpiexec was unable to start the specified application as it
>  encountered an error
> on node tyr.informatik.hs-fulda.de. More information may be
>  available above.
> --
> 4 total processes failed to start
> tyr fd1026 119 
> 
> 
> 
> "linpc3" and "linpc4" are two processor dual core machines.
> 
> linpc4 fd1026 102 mpiexec -report-bindings -host linpc3,linpc4 \
> -np 4 -bind-to-core -bycore rank_size
> [linpc4:16842] [[40914,0],0] odls:default:
>  fork binding child [[40914,1],1] to cpus 0001
> [linpc4:16842] [[40914,0],0] odls:default:
>  fork binding child [[40914,1],3] to cpus 0002
> [linpc3:31384] [[40914,0],1] odls:default:
>  fork binding child [[40914,1],0] to cpus 0001
> [linpc3:31384] [[40914,0],1] odls:default:
>  fork binding ch

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Gus Correa

On 09/03/2012 04:39 PM, Andrea Negri wrote:

max locked memory (kbytes, -l) 32

 max memory size(kbytes, -m) unlimited
 open files   (-n) 1024
 pipe size(512 bytes, -p) 8
 POSIX message queues (bytes, -q) 819200
 stack size   (kbytes, -s) 10240



Hi Andrea
This is besides the possibilities of
running out of physical memory,
or even defective memory chips, which Jeff, Ralph,
John, George have addressed, I still think that the
system limits above may play a role.
In a 8-year old cluster, hardware failures are not unexpected.


1) System limits

For what it is worth, virtually none of the programs we run here,
mostly atmosphere/ocean/climate codes,
would run with these limits.
On our compute nodes we set
max locked memory and stack size to
unlimited, to avoid problems with symptoms very
similar to those you describe.
Typically there are lots of automatic arrays in subroutines,
etc, which require a large stack.
Your sys admin could add these lines to the bottom
of /etc/security/limits.conf [the last one sets the
max number of open files]:

*   -   memlock -1
*   -   stack   -1
*   -   nofile  4096

2) Defective network interface/cable/switch port

Yet another possibility, following Ralph's suggestion,
is that you may have a failing network interface, or a bad
Ethernet cable or connector, on the node that goes south,
or on the switch port that serves that node.
[I am assuming your network is Ethernet, probably GigE.]

Again, in a 8-year old cluster, hardware failures are not unexpected.

We had this sort of problems with old clusters, old nodes.

3) Quarantine the bad node

Is it always the same node that fails, or does it vary?
[Please answer, it helps us understand what's going on.]

If it is always the same node, have you tried to quarantine it,
either temporarily removing it from your job submission system
or just turning it off, and run the job on the remaining
nodes?

I hope this helps,
Gus Correa


Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

2012-09-07 Thread Gus Correa

On 09/07/2012 08:02 AM, Jeff Squyres wrote:

On Sep 7, 2012, at 5:58 AM, Jeff Squyres wrote:


Also look for hardware errors.  Perhaps you have some bad RAM somewhere.  Is it 
always the same node that crashes?  And so on.



Another thought on hardware errors... I actually have seen bad RAM cause 
spontaneous reboots with no Linux warnings.

Do you have any hardware diagnostics from your server
vendor that you can run?



If you don't have a vendor provided diagnostic tool,
you or your sys admin could try Advanced Clustering "breakin":

http://www.advancedclustering.com/our-software/view-category.html

Download the ISO version, burn a CD, put in the node CD drive,
assuming it has one, reboot, chose breakin in the menu options.
If there is no CD drive, there is an alternative with network boot,
although more involved.

I hope it helps,
Gus Correa


A simple way to test your RAM (it's not completely comprehensive, but it does 
check for a surprisingly wide array of memory issues) is to do something like 
this (pseudocode):

-
size_t i, size, increment;
increment = 1GB;
size = 1GB;
int *ptr;

// Find the biggest amount of memory that you can malloc
while (increment>= 1024) {
 ptr = malloc(size);
 if (NULL != ptr) {
  free(ptr);
  size += increment;
 } else {
  size -= increment;
  increment /= 2;
 }
}
printf("I can malloc %lu bytes\n", size);

// Malloc that huge chunk of memory
ptr = malloc(size);
for (i = 0; i<  size / sizeof(int); ++i, ++ptr) {
 *ptr = 37;
 if (*ptr != 37) {
 printf("Readback error!\n");
 }
}

printf("All done\n");
-

Depending on how much memory you have,

that might take a little while to run
(all the memory has to be paged in, etc.).
You might want to add a status output to show progress,
and/or write/read a page at a time for better efficiency, etc.
But you get the idea.


Hope that helps.