Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

Andrea Negri Fri, 7 Sep 2012 05:38:49 -0400

George,

I hace done some modifications to the code, however this is the first
part my zmp_list:
!                ZEUSMP2 CONFIGURATION FILE
 &GEOMCONF  LGEOM    = 2,
            LDIMEN   = 2 /
 &PHYSCONF  LRAD     = 0,
            XHYDRO   = .TRUE.,
            XFORCE   = .TRUE.,
            XMHD     = .false.,
            XTOTNRG  = .false.,
            XGRAV    = .false.,
            XGRVFFT  = .false.,
            XPTMASS  = .false.,
            XISO     = .false.,
            XSUBAV   = .false.,
            XVGRID   = .false.,
!- - - - - - - - - - - - - - - - - - -
            XFIXFORCE       = .TRUE.,
            XFIXFORCE2      = .TRUE.,
!- - - - - - - - - - - - - - - - - - -
            XSOURCEENERGY   = .TRUE.,
            XSOURCEMASS     = .TRUE.,
!- - - - - - - - - - - - - - - - - - -
            XRADCOOL        = .TRUE.,
            XA_RGB_WINDS    = .TRUE.,
            XSNIa           = .TRUE./
!=====================================
 &IOCONF    XASCII   = .false.,
            XA_MULT  = .false.,
            XHDF     = .TRUE.,
            XHST     = .TRUE.,
            XRESTART = .TRUE.,
            XTSL     = .false.,
            XDPRCHDF = .TRUE.,
            XTTY     = .TRUE. ,
            XAGRID   = .false. /
 &PRECONF   SMALL_NO = 1.0D-307,
            LARGE_NO = 1.0D+307 /
 &ARRAYCONF IZONES   = 100,
            JZONES   = 125,
            KZONES   = 1,
            MAXIJK   = 125/
 &mpitop ntiles(1)=5,ntiles(2)=2,ntiles(3)=1,periodic=2*.false.,.true. /


I have done some tests, and currently I'm able to perform a run with
10 processes on 10 nodes, ie I use only 1 of two CPUs in a node. It
crashes after 6 hours, and not after 20 minutes!


2012/9/6  <users-requ...@open-mpi.org>:
> Send users mailing list submissions to
>         us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
>         users-requ...@open-mpi.org
>
> You can reach the person managing the list at
>         users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>    1. Re: error compiling openmpi-1.6.1 on Windows 7 (Siegmar Gross)
>    2. Re: OMPI 1.6.x Hang on khugepaged 100% CPU time (Yong Qin)
>    3. Regarding the Pthreads (seshendra seshu)
>    4. Re: some mpi processes "disappear" on a cluster of        servers
>       (George Bosilca)
>    5. SIGSEGV in OMPI 1.6.x (Yong Qin)
>    6. Re: error compiling openmpi-1.6.1 on Windows 7 (Siegmar Gross)
>    7. Re: Infiniband performance Problem and stalling
>       (Yevgeny Kliteynik)
>    8. Re: SIGSEGV in OMPI 1.6.x (Jeff Squyres)
>    9. Re: Regarding the Pthreads (Jeff Squyres)
>   10. Re: python-mrmpi() failed (Jeff Squyres)
>   11. Re: MPI_Cart_sub periods (Jeff Squyres)
>   12. Re: error compiling openmpi-1.6.1 on Windows 7 (Shiqing Fan)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Wed, 5 Sep 2012 17:43:50 +0200 (CEST)
> From: Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de>
> Subject: Re: [OMPI users] error compiling openmpi-1.6.1 on Windows 7
> To: f...@hlrs.de
> Cc: us...@open-mpi.org
> Message-ID: <201209051543.q85fhoba021...@tyr.informatik.hs-fulda.de>
> Content-Type: TEXT/plain; charset=ISO-8859-1
>
> Hi Shiqing,
>
>> Could you try set OPENMPI_HOME env var to the root of the Open MPI dir?
>> This env is a backup option for the registry.
>
> It solves one problem but there is a new problem now :-((
>
>
> Without OPENMPI_HOME: Wrong pathname to help files.
>
> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
> --------------------------------------------------------------------------
> Sorry!  You were supposed to get help about:
>     invalid if_inexclude
> But I couldn't open the help file:
>     D:\...\prog\mpi\small_prog\..\share\openmpi\help-mpi-btl-tcp.txt:
>     No such file or directory.  Sorry!
> --------------------------------------------------------------------------
> ...
>
>
>
> With OPENMPI_HOME: It nearly uses the correct directory. Unfortunately
> the pathname contains the character " in the wrong place so that it
> couldn't find the available help file.
>
> set OPENMPI_HOME="c:\Program Files (x86)\openmpi-1.6.1"
>
> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
> --------------------------------------------------------------------------
> Sorry!  You were supposed to get help about:
>     no-hostfile
> But I couldn't open the help file:
>     "c:\Program Files (x86)\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: 
> Invalid argument.  Sorry
> !
> --------------------------------------------------------------------------
> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file 
> ..\..\openmpi-1.6.1\orte\mca\ras\base
> \ras_base_allocate.c at line 200
> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file 
> ..\..\openmpi-1.6.1\orte\mca\plm\base
> \plm_base_launch_support.c at line 99
> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file 
> ..\..\openmpi-1.6.1\orte\mca\plm\proc
> ess\plm_process_module.c at line 996
>
>
>
> It looks like that the environment variable can also solve my
> problem in the 64-bit environment.
>
> D:\g...\prog\mpi\small_prog>mpicc init_finalize.c
>
> Microsoft (R) C/C++-Optimierungscompiler Version 16.00.40219.01 f?r x64
> ...
>
>
> The process hangs without OPENMPI_HOME.
>
> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
> ^C
>
>
> With OPENMPI_HOME:
>
> set OPENMPI_HOME="c:\Program Files\openmpi-1.6.1"
>
> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
> --------------------------------------------------------------------------
> Sorry!  You were supposed to get help about:
>     no-hostfile
> But I couldn't open the help file:
>     "c:\Program Files\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: Invalid 
> argument.  S
> orry!
> --------------------------------------------------------------------------
> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file 
> ..\..\openmpi-1.6.1\orte\mc
> a\ras\base\ras_base_allocate.c at line 200
> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file 
> ..\..\openmpi-1.6.1\orte\mc
> a\plm\base\plm_base_launch_support.c at line 99
> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file 
> ..\..\openmpi-1.6.1\orte\mc
> a\plm\process\plm_process_module.c at line 996
>
>
> At least the program doesn't block any longer. Do you have any ideas
> how this new problem can be solved?
>
>
> Kind regards
>
> Siegmar
>
>
>
>> On 2012-09-05 1:02 PM, Siegmar Gross wrote:
>> > Hi Shiqing,
>> >
>> >>>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>> >>>> ---------------------------------------------------------------------
>> >>>> Sorry!  You were supposed to get help about:
>> >>>>       invalid if_inexclude
>> >>>> But I couldn't open the help file:
>> >>>>       D:\...\prog\mpi\small_prog\..\share\openmpi\help-mpi-btl-tcp.txt:
>> >>>>       No such file or directory.  Sorry!
>> >>>> ---------------------------------------------------------------------
>> >>> ...
>> >>>> Why does "mpiexec" look for the help file relativ to my current
>> >>>> program and not relative to itself? The file is part of the
>> >>>> package.
>> >>> Do you know how I can solve this problem?
>> >> I have similar issue with message from tcp, but it's not finding the
>> >> file, it's something else, which doesn't affect the execution of the
>> >> application. Could you make sure the help-mpi-btl-tcp.txt is actually in
>> >> the path D:\...\prog\mpi\small_prog\..\share\openmpi\?
>> > That wouldn't be a good idea because I have MPI programs in different
>> > directories so that I would have to install all help files in several
>> > places (<my_directory>/../share/openmpi/help*.txt). All help files are
>> > available in the installation directory of Open MPI.
>> >
>> > dir "c:\Program Files (x86)\openmpi-1.6.1\bin\mpiexec.exe"
>> > ...
>> > 29.08.2012  10:59            38.912 mpiexec.exe
>> > ...
>> > dir "c:\Program Files 
>> > (x86)\openmpi-1.6.1\bin\..\share\openmpi\help-mpi-btl-tcp.txt"
>> > ...
>> > 03.04.2012  16:30               631 help-mpi-btl-tcp.txt
>> > ...
>> >
>> > I don't know if "mpiexec" or my program "init_finilize" is responsible
>> > for the error message but whoever is responsible shouldn't use the path
>> > to my program but the prefix_dir from MPI to find the help files. Perhaps
>> > you can change the behaviour in the Open MPI source code.
>> >
>> >
>> >>>> I can also compile in 64-bit mode but the program hangs.
>> >>> Do you have any ideas why the program hangs? Thank you very much for any
>> >>> help in advance.
>> >> To be honest I don't know. I couldn't reproduce it. Did you try
>> >> installing the binary installer, will it also behave the same?
>> > I like to have different versions of Open MPI which I activate via
>> > a batch file so that I can still run my program in an old version if
>> > something goes wrong in a new one. I have no entries in the system
>> > environment or registry so that I can even run different versions in
>> > different command windows without problems (everything is only known
>> > within the command window in which a have run my batch file). It seems
>> > that you put something in the registry when I use your installer.
>> > Perhaps you remember an earlier email where I had to uninstall an old
>> > version because the environment in my own installation was wrong
>> > as long as your installation was active. Nevertheless I can give it
>> > a try. Perhaps I find out if you set more than just the path to your
>> > binaries. Do you know if there is something similar to "truss" or
>> > "strace" in the UNIX world so that I can see where the program hangs?
>> > Thank you very much for your help in advance.
>> >
>> >
>> > Kind regards
>> >
>> > Siegmar
>> >
>>
>>
>> --
>> ---------------------------------------------------------------
>> Shiqing Fan
>> High Performance Computing Center Stuttgart (HLRS)
>> Tel: ++49(0)711-685-87234      Nobelstrasse 19
>> Fax: ++49(0)711-685-65832      70569 Stuttgart
>> http://www.hlrs.de/organization/people/shiqing-fan/
>> email: f...@hlrs.de
>>
>
>
>
>
> ------------------------------
>
> Message: 2
> Date: Wed, 5 Sep 2012 09:07:35 -0700
> From: Yong Qin <yong....@gmail.com>
> Subject: Re: [OMPI users] OMPI 1.6.x Hang on khugepaged 100% CPU time
> To: klit...@dev.mellanox.co.il
> Cc: Open MPI Users <us...@open-mpi.org>
> Message-ID:
>         <CADEJBEWq0Rzfi_uKx8U4Uz4tjz=vJzn1=rdtphpyul04cv9...@mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Yes, so far this has only been observed in VASP and a specific dataset.
>
> Thanks,
>
> On Wed, Sep 5, 2012 at 4:52 AM, Yevgeny Kliteynik
> <klit...@dev.mellanox.co.il> wrote:
>> On 9/4/2012 7:21 PM, Yong Qin wrote:
>>> On Tue, Sep 4, 2012 at 5:42 AM, Yevgeny Kliteynik
>>> <klit...@dev.mellanox.co.il>  wrote:
>>>> On 8/30/2012 10:28 PM, Yong Qin wrote:
>>>>> On Thu, Aug 30, 2012 at 5:12 AM, Jeff Squyres<jsquy...@cisco.com>   wrote:
>>>>>> On Aug 29, 2012, at 2:25 PM, Yong Qin wrote:
>>>>>>
>>>>>>> This issue has been observed on OMPI 1.6 and 1.6.1 with openib btl but
>>>>>>> not on 1.4.5 (tcp btl is always fine). The application is VASP and
>>>>>>> only one specific dataset is identified during the testing, and the OS
>>>>>>> is SL 6.2 with kernel 2.6.32-220.23.1.el6.x86_64. The issue is that
>>>>>>> when a certain type of load is put on OMPI 1.6.x, khugepaged thread
>>>>>>> always runs with 100% CPU load, and it looks to me like that OMPI is
>>>>>>> waiting for some memory to be available thus appears to be hung.
>>>>>>> Reducing the per node processes would sometimes ease the problem a bit
>>>>>>> but not always. So I did some further testing by playing around with
>>>>>>> the kernel transparent hugepage support.
>>>>>>>
>>>>>>> 1. Disable transparent hugepage support completely (echo never
>>>>>>>> /sys/kernel/mm/redhat_transparent_hugepage/enabled). This would allow
>>>>>>> the program to progress as normal (as in 1.4.5). Total run time for an
>>>>>>> iteration is 3036.03 s.
>>>>>>
>>>>>> I'll admit that we have not tested using transparent hugepages.  I 
>>>>>> wonder if there's some kind of bad interaction going on here...
>>>>>
>>>>> The transparent hugepage is "transparent", which means it is
>>>>> automatically applied to all applications unless it is explicitly told
>>>>> otherwise. I highly suspect that it is not working properly in this
>>>>> case.
>>>>
>>>> Like Jeff said - I don't think we've ever tested OMPI with transparent
>>>> huge pages.
>>>>
>>>
>>> Thanks. But have you tested OMPI under RHEL 6 or its variants (CentOS
>>> 6, SL 6)? THP is on by default in RHEL 6 so no matter you want it or
>>> not it's there.
>>
>> Interesting. Indeed, THP is on be default in RHEL 6.x.
>> I run OMPI 1.6.x constantly on RHEL 6.2, and I've never seen this problem.
>>
>> I'm checking it with OFED folks, but I doubt that there are some dedicated
>> tests for THP.
>>
>> So do you see it only with a specific application and only on a specific
>> data set? Wonder if I can somehow reproduce it in-house...
>>
>> -- YK
>
>
> ------------------------------
>
> Message: 3
> Date: Wed, 5 Sep 2012 20:23:05 +0200
> From: seshendra seshu <seshu...@gmail.com>
> Subject: [OMPI users] Regarding the Pthreads
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID:
>         <CAJ_xm3AYtMt22NgjtY67TuwOpZxev0ZYSW4fEYGxKA=2yvd...@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> Hi,
> I am learning pthreads and trying to implement the pthreads in my
> quicksortprogram.
> My problem is iam unable to understand how to implement the pthreads at
> data received at a node from the master (In detail: In my program Master
> will divide the data and send to the slaves and each slave will do the
> sorting independently of The received data and send back to master after
> sorting is done. Now Iam having a problem in Implementing the pthreads at
> the slaves,i.e how to implement the pthreads in order to share data among
> the cores in each slave and sort the data and send it back to master.
> So could anyone help in solving this problem by providing some suggestions
> and clues.
>
> Thanking you very much.
>
> --
>  WITH REGARDS
> M.L.N.Seshendra
> -------------- next part --------------
> HTML attachment scrubbed and removed
>
> ------------------------------
>
> Message: 4
> Date: Thu, 6 Sep 2012 02:40:19 +0200
> From: George Bosilca <bosi...@eecs.utk.edu>
> Subject: Re: [OMPI users] some mpi processes "disappear" on a cluster
>         of      servers
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <f6f521b2-df90-4827-8abf-abe0f3599...@eecs.utk.edu>
> Content-Type: text/plain; charset=us-ascii
>
> Andrea,
>
> As suggested by the previous answers I guess the size of your problem is too 
> large for the memory available on the nodes. I can runs ZeusMP without any 
> issues up to 64 processes, both over Ethernet and Infiniband. I tried the 1.6 
> and the current trunk, and both perform as expected.
>
> What is the content of your zmp_inp file?
>
>   george.
>
> On Sep 1, 2012, at 16:01 , Andrea Negri <negri.an...@gmail.com> wrote:
>
>> I have tried to run with a single process (i.e. the entire grid is
>> contained by one process) and the the command free -m on the compute
>> node returns
>>
>>             total       used       free     shared    buffers     cached
>> Mem:          3913       1540       2372          0         49       1234
>> -/+ buffers/cache:        257       3656
>> Swap:         1983          0       1983
>>
>>
>> while top returns
>> top - 16:01:09 up 4 days,  5:56,  1 user,  load average: 0.53, 0.16, 0.10
>> Tasks:  63 total,   3 running,  60 sleeping,   0 stopped,   0 zombie
>> Cpu(s): 49.4% us,  0.7% sy,  0.0% ni, 49.9% id,  0.0% wa,  0.0% hi,  0.0% si
>> Mem:   4007720k total,  1577968k used,  2429752k free,    50664k buffers
>> Swap:  2031608k total,        0k used,  2031608k free,  1263844k cached
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> ------------------------------
>
> Message: 5
> Date: Wed, 5 Sep 2012 21:06:12 -0700
> From: Yong Qin <yong....@gmail.com>
> Subject: [OMPI users] SIGSEGV in OMPI 1.6.x
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID:
>         <CADEJBEVFcsyh5WnK=3yj6w7b2aasrf7yc4uimcvaqia-j6c...@mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hi,
>
> While debugging a mysterious crash of a code, I was able to trace down
> to a SIGSEGV in OMPI 1.6 and 1.6.1. The offending code is in
> opal/mca/memory/linux/malloc.c. Please see the following gdb log.
>
> (gdb) c
> Continuing.
>
> Program received signal SIGSEGV, Segmentation fault.
> opal_memory_ptmalloc2_int_free (av=0x2fd0637, mem=0x203a746f74512000)
> at malloc.c:4385
> 4385          nextsize = chunksize(nextchunk);
> (gdb) l
> 4380           Consolidate other non-mmapped chunks as they arrive.
> 4381        */
> 4382
> 4383        else if (!chunk_is_mmapped(p)) {
> 4384          nextchunk = chunk_at_offset(p, size);
> 4385          nextsize = chunksize(nextchunk);
> 4386          assert(nextsize > 0);
> 4387
> 4388          /* consolidate backward */
> 4389          if (!prev_inuse(p)) {
> (gdb) bt
> #0  opal_memory_ptmalloc2_int_free (av=0x2fd0637,
> mem=0x203a746f74512000) at malloc.c:4385
> #1  0x00002ae6b18ea0c0 in opal_memory_ptmalloc2_free (mem=0x2fd0637)
> at malloc.c:3511
> #2  0x00002ae6b18ea736 in opal_memory_linux_free_hook
> (__ptr=0x2fd0637, caller=0x203a746f74512000) at hooks.c:705
> #3  0x0000000001412fcc in for_dealloc_allocatable ()
> #4  0x00000000007767b1 in ALLOC::dealloc_d2 (array=@0x2fd0647,
> name=@0x6f6e6f69006f6e78, routine=Cannot access memory at address 0x0
> ) at alloc.F90:1357
> #5  0x000000000082628c in M_LDAU::hubbard_term (scell=..., nua=@0xd5,
> na=@0xd5, isa=..., xa=..., indxua=..., maxnh=@0xcf4ff, maxnd=@0xcf4ff,
> lasto=..., iphorb=...,
>     numd=..., listdptr=..., listd=..., numh=..., listhptr=...,
> listh=..., nspin=@0xcf4ff00000002, dscf=..., eldau=@0x0, deldau=@0x0,
> fa=..., stress=..., h=...,
>     first=@0x0, last=@0x0) at ldau.F:752
> #6  0x00000000006cd532 in M_SETUP_HAMILTONIAN::setup_hamiltonian
> (first=@0x0, last=@0x0, iscf=@0x2) at setup_hamiltonian.F:199
> #7  0x000000000070e257 in M_SIESTA_FORCES::siesta_forces
> (istep=@0xf9a4d07000000000) at siesta_forces.F:90
> #8  0x000000000070e475 in siesta () at siesta.F:23
> #9  0x000000000045e47c in main ()
>
> Can anybody shed some light here on what could be wrong?
>
> Thanks,
>
> Yong Qin
>
>
> ------------------------------
>
> Message: 6
> Date: Thu, 6 Sep 2012 07:48:34 +0200 (CEST)
> From: Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de>
> Subject: Re: [OMPI users] error compiling openmpi-1.6.1 on Windows 7
> To: f...@hlrs.de
> Cc: us...@open-mpi.org
> Message-ID: <201209060548.q865myke023...@tyr.informatik.hs-fulda.de>
> Content-Type: TEXT/plain; charset=ISO-8859-1
>
> Hi Shiqing,
>
> I have solved the problem with the double quotes in OPENMPI_HOME but
> there is still something wrong.
>
> set OPENMPI_HOME="c:\Program Files (x86)\openmpi-1.6.1"
>
> mpicc init_finalize.c
> Cannot open configuration file "c:\Program Files 
> (x86)\openmpi-1.6.1"/share/openmpi\mpicc-wrapper-data.txt
> Error parsing data file mpicc: Not found
>
>
> Everything is OK if you remove the double quotes which Windows
> automatically adds.
>
> set OPENMPI_HOME=c:\Program Files (x86)\openmpi-1.6.1
>
> mpicc init_finalize.c
> Microsoft (R) 32-Bit C/C++-Optimierungscompiler Version 16.00.40219.01 f?r 
> 80x86
> ...
>
> mpiexec init_finalize.exe
> --------------------------------------------------------------------------
> WARNING: An invalid value was given for btl_tcp_if_exclude.  This
> value will be ignored.
>
>   Local host: hermes
>   Value:      127.0.0.1/8
>   Message:    Did not find interface matching this subnet
> --------------------------------------------------------------------------
>
> Hello!
>
>
> I get the output from my program but also a warning from Open MPI.
> The new value for the loopback device was introduced a short time
> ago when I have had problems with the loopback device on Solaris
> (it used "lo0" instead of your default "lo"). How can I avoid this
> message? The 64-bit version of my program still hangs.
>
>
> Kind regards
>
> Siegmar
>
>
>> > Could you try set OPENMPI_HOME env var to the root of the Open MPI dir?
>> > This env is a backup option for the registry.
>>
>> It solves one problem but there is a new problem now :-((
>>
>>
>> Without OPENMPI_HOME: Wrong pathname to help files.
>>
>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>> --------------------------------------------------------------------------
>> Sorry!  You were supposed to get help about:
>>     invalid if_inexclude
>> But I couldn't open the help file:
>>     D:\...\prog\mpi\small_prog\..\share\openmpi\help-mpi-btl-tcp.txt:
>>     No such file or directory.  Sorry!
>> --------------------------------------------------------------------------
>> ...
>>
>>
>>
>> With OPENMPI_HOME: It nearly uses the correct directory. Unfortunately
>> the pathname contains the character " in the wrong place so that it
>> couldn't find the available help file.
>>
>> set OPENMPI_HOME="c:\Program Files (x86)\openmpi-1.6.1"
>>
>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>> --------------------------------------------------------------------------
>> Sorry!  You were supposed to get help about:
>>     no-hostfile
>> But I couldn't open the help file:
>>     "c:\Program Files (x86)\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: 
>> Invalid argument.  Sorry
>> !
>> --------------------------------------------------------------------------
>> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file 
>> ..\..\openmpi-1.6.1\orte\mca\ras\base
>> \ras_base_allocate.c at line 200
>> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file 
>> ..\..\openmpi-1.6.1\orte\mca\plm\base
>> \plm_base_launch_support.c at line 99
>> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file 
>> ..\..\openmpi-1.6.1\orte\mca\plm\proc
>> ess\plm_process_module.c at line 996
>>
>>
>>
>> It looks like that the environment variable can also solve my
>> problem in the 64-bit environment.
>>
>> D:\g...\prog\mpi\small_prog>mpicc init_finalize.c
>>
>> Microsoft (R) C/C++-Optimierungscompiler Version 16.00.40219.01 f?r x64
>> ...
>>
>>
>> The process hangs without OPENMPI_HOME.
>>
>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>> ^C
>>
>>
>> With OPENMPI_HOME:
>>
>> set OPENMPI_HOME="c:\Program Files\openmpi-1.6.1"
>>
>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>> --------------------------------------------------------------------------
>> Sorry!  You were supposed to get help about:
>>     no-hostfile
>> But I couldn't open the help file:
>>     "c:\Program Files\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: 
>> Invalid argument.  S
>> orry!
>> --------------------------------------------------------------------------
>> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file 
>> ..\..\openmpi-1.6.1\orte\mc
>> a\ras\base\ras_base_allocate.c at line 200
>> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file 
>> ..\..\openmpi-1.6.1\orte\mc
>> a\plm\base\plm_base_launch_support.c at line 99
>> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file 
>> ..\..\openmpi-1.6.1\orte\mc
>> a\plm\process\plm_process_module.c at line 996
>>
>>
>> At least the program doesn't block any longer. Do you have any ideas
>> how this new problem can be solved?
>>
>>
>> Kind regards
>>
>> Siegmar
>>
>>
>>
>> > On 2012-09-05 1:02 PM, Siegmar Gross wrote:
>> > > Hi Shiqing,
>> > >
>> > >>>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>> > >>>> ---------------------------------------------------------------------
>> > >>>> Sorry!  You were supposed to get help about:
>> > >>>>       invalid if_inexclude
>> > >>>> But I couldn't open the help file:
>> > >>>>       
>> > >>>> D:\...\prog\mpi\small_prog\..\share\openmpi\help-mpi-btl-tcp.txt:
>> > >>>>       No such file or directory.  Sorry!
>> > >>>> ---------------------------------------------------------------------
>> > >>> ...
>> > >>>> Why does "mpiexec" look for the help file relativ to my current
>> > >>>> program and not relative to itself? The file is part of the
>> > >>>> package.
>> > >>> Do you know how I can solve this problem?
>> > >> I have similar issue with message from tcp, but it's not finding the
>> > >> file, it's something else, which doesn't affect the execution of the
>> > >> application. Could you make sure the help-mpi-btl-tcp.txt is actually in
>> > >> the path D:\...\prog\mpi\small_prog\..\share\openmpi\?
>> > > That wouldn't be a good idea because I have MPI programs in different
>> > > directories so that I would have to install all help files in several
>> > > places (<my_directory>/../share/openmpi/help*.txt). All help files are
>> > > available in the installation directory of Open MPI.
>> > >
>> > > dir "c:\Program Files (x86)\openmpi-1.6.1\bin\mpiexec.exe"
>> > > ...
>> > > 29.08.2012  10:59            38.912 mpiexec.exe
>> > > ...
>> > > dir "c:\Program Files 
>> > > (x86)\openmpi-1.6.1\bin\..\share\openmpi\help-mpi-btl-tcp.txt"
>> > > ...
>> > > 03.04.2012  16:30               631 help-mpi-btl-tcp.txt
>> > > ...
>> > >
>> > > I don't know if "mpiexec" or my program "init_finilize" is responsible
>> > > for the error message but whoever is responsible shouldn't use the path
>> > > to my program but the prefix_dir from MPI to find the help files. Perhaps
>> > > you can change the behaviour in the Open MPI source code.
>> > >
>> > >
>> > >>>> I can also compile in 64-bit mode but the program hangs.
>> > >>> Do you have any ideas why the program hangs? Thank you very much for 
>> > >>> any
>> > >>> help in advance.
>> > >> To be honest I don't know. I couldn't reproduce it. Did you try
>> > >> installing the binary installer, will it also behave the same?
>> > > I like to have different versions of Open MPI which I activate via
>> > > a batch file so that I can still run my program in an old version if
>> > > something goes wrong in a new one. I have no entries in the system
>> > > environment or registry so that I can even run different versions in
>> > > different command windows without problems (everything is only known
>> > > within the command window in which a have run my batch file). It seems
>> > > that you put something in the registry when I use your installer.
>> > > Perhaps you remember an earlier email where I had to uninstall an old
>> > > version because the environment in my own installation was wrong
>> > > as long as your installation was active. Nevertheless I can give it
>> > > a try. Perhaps I find out if you set more than just the path to your
>> > > binaries. Do you know if there is something similar to "truss" or
>> > > "strace" in the UNIX world so that I can see where the program hangs?
>> > > Thank you very much for your help in advance.
>> > >
>> > >
>> > > Kind regards
>> > >
>> > > Siegmar
>> > >
>> >
>> >
>> > --
>> > ---------------------------------------------------------------
>> > Shiqing Fan
>> > High Performance Computing Center Stuttgart (HLRS)
>> > Tel: ++49(0)711-685-87234      Nobelstrasse 19
>> > Fax: ++49(0)711-685-65832      70569 Stuttgart
>> > http://www.hlrs.de/organization/people/shiqing-fan/
>> > email: f...@hlrs.de
>> >
>>
>>
>
>
>
>
> ------------------------------
>
> Message: 7
> Date: Thu, 06 Sep 2012 11:03:04 +0300
> From: Yevgeny Kliteynik <klit...@dev.mellanox.co.il>
> Subject: Re: [OMPI users] Infiniband performance Problem and stalling
> To: Randolph Pullen <randolph_pul...@yahoo.com.au>,     OpenMPI Users
>         <us...@open-mpi.org>
> Message-ID: <504858b8.3050...@dev.mellanox.co.il>
> Content-Type: text/plain; charset=UTF-8
>
> On 9/3/2012 4:14 AM, Randolph Pullen wrote:
>> No RoCE, Just native IB with TCP over the top.
>
> Sorry, I'm confused - still not clear what is "Melanox III HCA 10G card".
> Could you run "ibstat" and post the results?
>
> What is the expected BW on your cards?
> Could you run "ib_write_bw" between two machines?
>
> Also, please see below.
>
>> No I haven't used 1.6 I was trying to stick with the standards on the 
>> mellanox disk.
>> Is there a known problem with 1.4.3 ?
>>
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------!
>  ---
>> *From:* Yevgeny Kliteynik <klit...@dev.mellanox.co.il>
>> *To:* Randolph Pullen <randolph_pul...@yahoo.com.au>; Open MPI Users 
>> <us...@open-mpi.org>
>> *Sent:* Sunday, 2 September 2012 10:54 PM
>> *Subject:* Re: [OMPI users] Infiniband performance Problem and stalling
>>
>> Randolph,
>>
>> Some clarification on the setup:
>>
>> "Melanox III HCA 10G cards" - are those ConnectX 3 cards configured to 
>> Ethernet?
>> That is, when you're using openib BTL, you mean RoCE, right?
>>
>> Also, have you had a chance to try some newer OMPI release?
>> Any 1.6.x would do.
>>
>>
>> -- YK
>>
>> On 8/31/2012 10:53 AM, Randolph Pullen wrote:
>>  > (reposted with consolidatedinformation)
>>  > I have a test rig comprising 2 i7 systems 8GB RAM with Melanox III HCA 
>> 10G cards
>>  > running Centos 5.7 Kernel 2.6.18-274
>>  > Open MPI 1.4.3
>>  > MLNX_OFED_LINUX-1.5.3-1.0.0.2 (OFED-1.5.3-1.0.0.2):
>>  > On a Cisco 24 pt switch
>>  > Normal performance is:
>>  > $ mpirun --mca btl openib,self -n 2 -hostfile mpi.hosts PingPong
>>  > results in:
>>  > Max rate = 958.388867 MB/sec Min latency = 4.529953 usec
>>  > and:
>>  > $ mpirun --mca btl tcp,self -n 2 -hostfile mpi.hosts PingPong
>>  > Max rate = 653.547293 MB/sec Min latency = 19.550323 usec
>>  > NetPipeMPI results show a max of 7.4 Gb/s at 8388605 bytes which seems 
>> fine.
>>  > log_num_mtt =20 and log_mtts_per_seg params =2
>>  > My application exchanges about a gig of data between the processes with 2 
>> sender and 2 consumer processes on each node with 1 additional controller 
>> process on the starting node.
>>  > The program splits the data into 64K blocks and uses non blocking sends 
>> and receives with busy/sleep loops to monitor progress until completion.
>>  > Each process owns a single buffer for these 64K blocks.
>>  > My problem is I see better performance under IPoIB then I do on native IB 
>> (RDMA_CM).
>>  > My understanding is that IPoIB is limited to about 1G/s so I am at a loss 
>> to know why it is faster.
>>  > These 2 configurations are equivelant (about 8-10 seconds per cycle)
>>  > mpirun --mca btl_openib_flags 2 --mca mpi_leave_pinned 1 --mca btl 
>> tcp,self -H vh2,vh1 -np 9 --bycore prog
>>  > mpirun --mca btl_openib_flags 3 --mca mpi_leave_pinned 1 --mca btl 
>> tcp,self -H vh2,vh1 -np 9 --bycore prog
>
> When you say "--mca btl tcp,self", it means that openib btl is not enabled.
> Hence "--mca btl_openib_flags" is irrelevant.
>
>>  > And this one produces similar run times but seems to degrade with 
>> repeated cycles:
>>  > mpirun --mca btl_openib_eager_limit 64 --mca mpi_leave_pinned 1 --mca btl 
>> openib,self -H vh2,vh1 -np 9 --bycore prog
>
> You're running 9 ranks on two machines, but you're using IB for intra-node 
> communication.
> Is it intentional? If not, you can add "sm" btl and have performance improved.
>
> -- YK
>
>>  > Other btl_openib_flags settings result in much lower performance.
>>  > Changing the first of the above configs to use openIB results in a 21 
>> second run time at best. Sometimes it takes up to 5 minutes.
>>  > In all cases, OpenIB runs in twice the time it takes TCP,except if I push 
>> the small message max to 64K and force short messages. Then the openib times 
>> are the same as TCP and no faster.
>>  > With openib:
>>  > - Repeated cycles during a single run seem to slow down with each cycle
>>  > (usually by about 10 seconds).
>>  > - On occasions it seems to stall indefinitely, waiting on a single 
>> receive.
>>  > I'm still at a loss as to why. I can?t find any errors logged during the 
>> runs.
>>  > Any ideas appreciated.
>>  > Thanks in advance,
>>  > Randolph
>>  >
>>  >
>>  > _______________________________________________
>>  > users mailing list
>>  > us...@open-mpi.org <mailto:us...@open-mpi.org>
>>  > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>>
>
>
>
> ------------------------------
>
> Message: 8
> Date: Thu, 6 Sep 2012 08:01:01 -0400
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] SIGSEGV in OMPI 1.6.x
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <256da22f-f9ac-4746-acd9-501f8208e...@cisco.com>
> Content-Type: text/plain; charset=us-ascii
>
> If you run into a segv in this code, it almost certainly means that you have 
> heap corruption somewhere.  FWIW, that has *always* been what it meant when 
> I've run into segv's in any code under in opal/mca/memory/linux/.  Meaning: 
> my user code did something wrong, it created heap corruption, and then later 
> some malloc() or free() caused a segv in this area of the code.
>
> This code is the same ptmalloc memory allocator that has shipped in glibc for 
> years.  I'll be hard-pressed to say that any code is 100% bug free :-), but 
> I'd be surprised if there is a bug in this particular chunk of code.
>
> I'd run your code through valgrind or some other memory-checking debugger and 
> see if that can shed any light on what's going on.
>
>
> On Sep 6, 2012, at 12:06 AM, Yong Qin wrote:
>
>> Hi,
>>
>> While debugging a mysterious crash of a code, I was able to trace down
>> to a SIGSEGV in OMPI 1.6 and 1.6.1. The offending code is in
>> opal/mca/memory/linux/malloc.c. Please see the following gdb log.
>>
>> (gdb) c
>> Continuing.
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> opal_memory_ptmalloc2_int_free (av=0x2fd0637, mem=0x203a746f74512000)
>> at malloc.c:4385
>> 4385          nextsize = chunksize(nextchunk);
>> (gdb) l
>> 4380           Consolidate other non-mmapped chunks as they arrive.
>> 4381        */
>> 4382
>> 4383        else if (!chunk_is_mmapped(p)) {
>> 4384          nextchunk = chunk_at_offset(p, size);
>> 4385          nextsize = chunksize(nextchunk);
>> 4386          assert(nextsize > 0);
>> 4387
>> 4388          /* consolidate backward */
>> 4389          if (!prev_inuse(p)) {
>> (gdb) bt
>> #0  opal_memory_ptmalloc2_int_free (av=0x2fd0637,
>> mem=0x203a746f74512000) at malloc.c:4385
>> #1  0x00002ae6b18ea0c0 in opal_memory_ptmalloc2_free (mem=0x2fd0637)
>> at malloc.c:3511
>> #2  0x00002ae6b18ea736 in opal_memory_linux_free_hook
>> (__ptr=0x2fd0637, caller=0x203a746f74512000) at hooks.c:705
>> #3  0x0000000001412fcc in for_dealloc_allocatable ()
>> #4  0x00000000007767b1 in ALLOC::dealloc_d2 (array=@0x2fd0647,
>> name=@0x6f6e6f69006f6e78, routine=Cannot access memory at address 0x0
>> ) at alloc.F90:1357
>> #5  0x000000000082628c in M_LDAU::hubbard_term (scell=..., nua=@0xd5,
>> na=@0xd5, isa=..., xa=..., indxua=..., maxnh=@0xcf4ff, maxnd=@0xcf4ff,
>> lasto=..., iphorb=...,
>>    numd=..., listdptr=..., listd=..., numh=..., listhptr=...,
>> listh=..., nspin=@0xcf4ff00000002, dscf=..., eldau=@0x0, deldau=@0x0,
>> fa=..., stress=..., h=...,
>>    first=@0x0, last=@0x0) at ldau.F:752
>> #6  0x00000000006cd532 in M_SETUP_HAMILTONIAN::setup_hamiltonian
>> (first=@0x0, last=@0x0, iscf=@0x2) at setup_hamiltonian.F:199
>> #7  0x000000000070e257 in M_SIESTA_FORCES::siesta_forces
>> (istep=@0xf9a4d07000000000) at siesta_forces.F:90
>> #8  0x000000000070e475 in siesta () at siesta.F:23
>> #9  0x000000000045e47c in main ()
>>
>> Can anybody shed some light here on what could be wrong?
>>
>> Thanks,
>>
>> Yong Qin
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 9
> Date: Thu, 6 Sep 2012 08:03:06 -0400
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] Regarding the Pthreads
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <7fd0702a-4a29-4ff6-a80a-170d2002f...@cisco.com>
> Content-Type: text/plain; charset=iso-8859-1
>
> Your question is somewhat outside the scope of this list.  Perhaps people may 
> chime in with some suggestions, but that's more of a threading question than 
> an MPI question.
>
> Be warned that you need to call MPI_Init_thread (not MPI_Init) with 
> MPI_THREAD_MULTIPLE in order to get true multi-threaded support in Open MPI.  
> And we only support that on the TCP and shared memory transports if you built 
> Open MPI with threading support enabled.
>
>
> On Sep 5, 2012, at 2:23 PM, seshendra seshu wrote:
>
>> Hi,
>> I am learning pthreads and trying to implement the pthreads in my quicksort 
>> program.
>> My problem is iam unable to understand how to implement the pthreads at data 
>> received at a node from the master (In detail: In my program Master will 
>> divide the data and send to the slaves and each slave will do the sorting 
>> independently of The received data and send back to master after sorting is 
>> done. Now Iam having a problem in Implementing the pthreads at the 
>> slaves,i.e how to implement the pthreads in order to share data among the 
>> cores in each slave and sort the data and send it back to master.
>> So could anyone help in solving this problem by providing some suggestions 
>> and clues.
>>
>> Thanking you very much.
>>
>> --
>>  WITH REGARDS
>> M.L.N.Seshendra
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 10
> Date: Thu, 6 Sep 2012 08:05:30 -0400
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] python-mrmpi() failed
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <e8aefb84-8702-432c-9fb0-0c34451b0...@cisco.com>
> Content-Type: text/plain; charset=us-ascii
>
> On Sep 4, 2012, at 3:09 PM, mariana Vargas wrote:
>
>> I 'am new in this, I have some codes that use mpi for python and I
>> just installed (openmpi, mrmpi, mpi4py) in my home (from a  cluster
>> account) without apparent errors and I tried to perform this simple
>> test in python and I get the following error related with openmpi,
>> could you help to figure out what is going on? I attach as many
>> informations as possible...
>
> I think I know what's happening here.
>
> It's a complicated linker issue that we've discussed before -- I'm not sure 
> whether it was on this users list or the OMPI developers list.
>
> The short version is that you should remove your prior Open MPI installation, 
> and then rebuild Open MPI with the --disable-dlopen configure switch.  See if 
> that fixes the problem.
>
>> Thanks.
>>
>> Mariana
>>
>>
>>  From a python console
>>  >>> from mrmpi import mrmpi
>>  >>> mr=mrmpi()
>> [ferrari:23417] mca: base: component_find: unable to open /home/
>> mvargas/lib/openmpi/mca_paffinity_hwloc: /home/mvargas/lib/openmpi/
>> mca_paffinity_hwloc.so: undefined symbol: opal_hwloc_topology (ignored)
>> [ferrari:23417] mca: base: component_find: unable to open /home/
>> mvargas/lib/openmpi/mca_carto_auto_detect: /home/mvargas/lib/openmpi/
>> mca_carto_auto_detect.so: undefined symbol:
>> opal_carto_base_graph_get_host_graph_fn (ignored)
>> [ferrari:23417] mca: base: component_find: unable to open /home/
>> mvargas/lib/openmpi/mca_carto_file: /home/mvargas/lib/openmpi/
>> mca_carto_file.so: undefined symbol:
>> opal_carto_base_graph_get_host_graph_fn (ignored)
>> [ferrari:23417] mca: base: component_find: unable to open /home/
>> mvargas/lib/openmpi/mca_shmem_mmap: /home/mvargas/lib/openmpi/
>> mca_shmem_mmap.so: undefined symbol: opal_show_help (ignored)
>> [ferrari:23417] mca: base: component_find: unable to open /home/
>> mvargas/lib/openmpi/mca_shmem_posix: /home/mvargas/lib/openmpi/
>> mca_shmem_posix.so: undefined symbol: opal_show_help (ignored)
>> [ferrari:23417] mca: base: component_find: unable to open /home/
>> mvargas/lib/openmpi/mca_shmem_sysv: /home/mvargas/lib/openmpi/
>> mca_shmem_sysv.so: undefined symbol: opal_show_help (ignored)
>> --------------------------------------------------------------------------
>> It looks like opal_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during opal_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>>
>>   opal_shmem_base_select failed
>>   --> Returned value -1 instead of OPAL_SUCCESS
>> --------------------------------------------------------------------------
>> [ferrari:23417] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file
>> runtime/orte_init.c at line 79
>> --------------------------------------------------------------------------
>> It looks like MPI_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during MPI_INIT; some of which are due to configuration or
>> environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open MPI
>> developer):
>>
>>   ompi_mpi_init: orte_init failed
>>   --> Returned "Error" (-1) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> *** An error occurred in MPI_Init
>> *** on a NULL communicator
>> *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
>> [ferrari:23417] Local abort before MPI_INIT completed successfully;
>> not able to aggregate error messages, and not able to guarantee that
>> all other processes were killed!
>>
>>
>>
>> echo $PATH
>>
>> /home/mvargas/idl/pro/LibsSDSSS/idlutilsv5_4_15/bin:/usr/local/itt/
>> idl70/bin:/opt/local/bin:/home/mvargas/bin:/home/mvargas/lib:/home/
>> mvargas/lib/openmpi/:/home/mvargas:/home/vargas/bin/:/home/mvargas/idl/
>> pro/LibsSDSSS/idlutilsv5_4_15/bin:/usr/local/itt/idl70/bin:/opt/local/
>> bin:/home/mvargas/bin:/home/mvargas/lib:/home/mvargas/lib/openmpi/:/
>> home/mvargas:/home/vargas/bin/:/usr/lib64/qt3.3/bin:/usr/kerberos/bin:/
>> usr/local/bin:/bin:/usr/bin:/opt/pbs/bin:/opt/pbs/lib/xpbs/bin:/opt/
>> envswitcher/bin:/opt/pvm3/lib:/opt/pvm3/lib/LINUX64:/opt/pvm3/bin/
>> LINUX64:/opt/c3-4/
>>
>> echo $LD_LIBRARY_PATH
>> /usr/local/mpich2/lib:/home/mvargas/lib:/home/mvargas/:/home/mvargas/
>> lib64:/home/mvargas/lib/openmpi/:/usr/lib64/openmpi/1.4-gcc/lib/:/user/
>> local/:/usr/local/mpich2/lib:/home/mvargas/lib:/home/mvargas/:/home/
>> mvargas/lib64:/home/mvargas/lib/openmpi/:/usr/lib64/openmpi/1.4-gcc/
>> lib/:/user/local/:
>>
>> Version: openmpi-1.6
>>
>>
>>
>> mpirun --bynode --tag-output ompi_info -v ompi full --parsable
>> [1,0]<stdout>:package:Open MPI mvargas@ferrari Distribution
>> [1,0]<stdout>:ompi:version:full:1.6
>> [1,0]<stdout>:ompi:version:svn:r26429
>> [1,0]<stdout>:ompi:version:release_date:May 10, 2012
>> [1,0]<stdout>:orte:version:full:1.6
>> [1,0]<stdout>:orte:version:svn:r26429
>> [1,0]<stdout>:orte:version:release_date:May 10, 2012
>> [1,0]<stdout>:opal:version:full:1.6
>> [1,0]<stdout>:opal:version:svn:r26429
>> [1,0]<stdout>:opal:version:release_date:May 10, 2012
>> [1,0]<stdout>:mpi-api:version:full:2.1
>> [1,0]<stdout>:ident:1.6
>>
>>
>> eth0      Link encap:Ethernet  HWaddr 00:30:48:95:99:CC
>>          inet addr:192.168.2.1  Bcast:192.168.2.255  Mask:255.255.255.0
>>          inet6 addr: fe80::230:48ff:fe95:99cc/64 Scope:Link
>>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>          RX packets:4739875255 errors:0 dropped:1636 overruns:0 frame:0
>>          TX packets:5196871012 errors:0 dropped:0 overruns:0 carrier:0
>>          collisions:0 txqueuelen:1000
>>          RX bytes:4959384349297 (4.5 TiB)  TX bytes:3933641883577 (3.5
>> TiB)
>>          Memory:ef300000-ef320000
>>
>> eth1      Link encap:Ethernet  HWaddr 00:30:48:95:99:CD
>>          inet addr:128.2.116.104  Bcast:128.2.119.255  Mask:
>> 255.255.248.0
>>          inet6 addr: fe80::230:48ff:fe95:99cd/64 Scope:Link
>>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>          RX packets:2645952109 errors:0 dropped:13353 overruns:0 frame:0
>>          TX packets:2974763570 errors:0 dropped:0 overruns:0 carrier:0
>>          collisions:0 txqueuelen:1000
>>          RX bytes:2024044043824 (1.8 TiB)  TX bytes:3390935387820 (3.0
>> TiB)
>>          Memory:ef400000-ef420000
>>
>> lo        Link encap:Local Loopback
>>          inet addr:127.0.0.1  Mask:255.0.0.0
>>          inet6 addr: ::1/128 Scope:Host
>>          UP LOOPBACK RUNNING  MTU:16436  Metric:1
>>          RX packets:143359307 errors:0 dropped:0 overruns:0 frame:0
>>          TX packets:143359307 errors:0 dropped:0 overruns:0 carrier:0
>>          collisions:0 txqueuelen:0
>>          RX bytes:80413513464 (74.8 GiB)  TX bytes:80413513464 (74.8
>> GiB)
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> <files.tar.gz>
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 11
> Date: Thu, 6 Sep 2012 10:23:04 -0400
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] MPI_Cart_sub periods
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <346c2878-a5a6-4043-b890-09dab6880...@cisco.com>
> Content-Type: text/plain; charset=iso-8859-1
>
> John --
>
> This cartesian stuff always makes my head hurt.  :-)
>
> You seem to have hit on a bona-fide bug.  I have fixed the issue in our SVN 
> trunk and will get the fixed moved over to the v1.6 and v1.7 branches.
>
> Thanks for the report!
>
>
> On Aug 29, 2012, at 5:32 AM, Craske, John wrote:
>
>> Hello,
>>
>> We are partitioning a two-dimensional Cartesian communicator into
>> two one-dimensional subgroups. In this situation we have found
>> that both one-dimensional communicators inherit the period
>> logical of the first dimension of the original two-dimensional
>> communicator when using Open MPI.  Using MPICH each
>> one-dimensional communicator inherits the period corresponding to
>> the dimensions specified in REMAIN_DIMS, as expected.  Could this
>> be a bug, or are we making a mistake? The relevant calls we make in a
>> Fortran code are
>>
>> CALL MPI_CART_CREATE(MPI_COMM_WORLD, 2, (/ NDIMX, NDIMY /), (/ .True., 
>> .False. /), .TRUE.,
>>                  COMM_CART_2D, IERROR)
>>
>> CALL MPI_CART_SUB(COMM_CART_2D, (/ .True., .False. /), COMM_CART_X, IERROR)
>> CALL MPI_CART_SUB(COMM_CART_2D, (/ .False., .True. /), COMM_CART_Y, IERROR)
>>
>> Following these requests,
>>
>> CALL MPI_CART_GET(COMM_CART_X, MAXDIM_X, DIMS_X, PERIODS_X, COORDS_X, IERROR)
>> CALL MPI_CART_GET(COMM_CART_Y, MAXDIM_Y, DIMS_Y, PERIODS_Y, COORDS_Y, IERROR)
>>
>> will result in
>>
>> PERIODS_X = T
>> PERIODS_Y = T
>>
>> If, on the other hand we define the two-dimensional communicator
>> using PERIODS = (/ .False., .True. /), we find
>>
>> PERIODS_X = F
>> PERIODS_Y = F
>>
>> Your advice on the matter would be greatly appreciated.
>>
>> Regards,
>>
>> John.
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>
>
> ------------------------------
>
> Message: 12
> Date: Thu, 06 Sep 2012 16:58:03 +0200
> From: Shiqing Fan <f...@hlrs.de>
> Subject: Re: [OMPI users] error compiling openmpi-1.6.1 on Windows 7
> To: Siegmar Gross <siegmar.gr...@informatik.hs-fulda.de>
> Cc: us...@open-mpi.org
> Message-ID: <5048b9fb.3070...@hlrs.de>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Hi Siegmar,
>
> Glad to hear that it's working for you.
>
> The warning message is because the loopback adapter is excluded by
> default, but this adapter is actually not installed on Windows.
>
> One solution might be installing the loopback adapter on Windows. It
> very easy, only a few minutes.
>
> Or it may be possible to avoid this message from internal Open MPI. But
> I'm not sure about how this can be done.
>
>
> Regards,
> Shiqing
>
>
> On 2012-09-06 7:48 AM, Siegmar Gross wrote:
>> Hi Shiqing,
>>
>> I have solved the problem with the double quotes in OPENMPI_HOME but
>> there is still something wrong.
>>
>> set OPENMPI_HOME="c:\Program Files (x86)\openmpi-1.6.1"
>>
>> mpicc init_finalize.c
>> Cannot open configuration file "c:\Program Files 
>> (x86)\openmpi-1.6.1"/share/openmpi\mpicc-wrapper-data.txt
>> Error parsing data file mpicc: Not found
>>
>>
>> Everything is OK if you remove the double quotes which Windows
>> automatically adds.
>>
>> set OPENMPI_HOME=c:\Program Files (x86)\openmpi-1.6.1
>>
>> mpicc init_finalize.c
>> Microsoft (R) 32-Bit C/C++-Optimierungscompiler Version 16.00.40219.01 f?r 
>> 80x86
>> ...
>>
>> mpiexec init_finalize.exe
>> --------------------------------------------------------------------------
>> WARNING: An invalid value was given for btl_tcp_if_exclude.  This
>> value will be ignored.
>>
>>    Local host: hermes
>>    Value:      127.0.0.1/8
>>    Message:    Did not find interface matching this subnet
>> --------------------------------------------------------------------------
>>
>> Hello!
>>
>>
>> I get the output from my program but also a warning from Open MPI.
>> The new value for the loopback device was introduced a short time
>> ago when I have had problems with the loopback device on Solaris
>> (it used "lo0" instead of your default "lo"). How can I avoid this
>> message? The 64-bit version of my program still hangs.
>>
>>
>> Kind regards
>>
>> Siegmar
>>
>>
>>>> Could you try set OPENMPI_HOME env var to the root of the Open MPI dir?
>>>> This env is a backup option for the registry.
>>> It solves one problem but there is a new problem now :-((
>>>
>>>
>>> Without OPENMPI_HOME: Wrong pathname to help files.
>>>
>>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>>> --------------------------------------------------------------------------
>>> Sorry!  You were supposed to get help about:
>>>      invalid if_inexclude
>>> But I couldn't open the help file:
>>>      D:\...\prog\mpi\small_prog\..\share\openmpi\help-mpi-btl-tcp.txt:
>>>      No such file or directory.  Sorry!
>>> --------------------------------------------------------------------------
>>> ...
>>>
>>>
>>>
>>> With OPENMPI_HOME: It nearly uses the correct directory. Unfortunately
>>> the pathname contains the character " in the wrong place so that it
>>> couldn't find the available help file.
>>>
>>> set OPENMPI_HOME="c:\Program Files (x86)\openmpi-1.6.1"
>>>
>>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>>> --------------------------------------------------------------------------
>>> Sorry!  You were supposed to get help about:
>>>      no-hostfile
>>> But I couldn't open the help file:
>>>      "c:\Program Files 
>>> (x86)\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: Invalid argument.  
>>> Sorry
>>> !
>>> --------------------------------------------------------------------------
>>> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file 
>>> ..\..\openmpi-1.6.1\orte\mca\ras\base
>>> \ras_base_allocate.c at line 200
>>> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file 
>>> ..\..\openmpi-1.6.1\orte\mca\plm\base
>>> \plm_base_launch_support.c at line 99
>>> [hermes:04964] [[12187,0],0] ORTE_ERROR_LOG: Not found in file 
>>> ..\..\openmpi-1.6.1\orte\mca\plm\proc
>>> ess\plm_process_module.c at line 996
>>>
>>>
>>>
>>> It looks like that the environment variable can also solve my
>>> problem in the 64-bit environment.
>>>
>>> D:\g...\prog\mpi\small_prog>mpicc init_finalize.c
>>>
>>> Microsoft (R) C/C++-Optimierungscompiler Version 16.00.40219.01 f?r x64
>>> ...
>>>
>>>
>>> The process hangs without OPENMPI_HOME.
>>>
>>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>>> ^C
>>>
>>>
>>> With OPENMPI_HOME:
>>>
>>> set OPENMPI_HOME="c:\Program Files\openmpi-1.6.1"
>>>
>>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>>> --------------------------------------------------------------------------
>>> Sorry!  You were supposed to get help about:
>>>      no-hostfile
>>> But I couldn't open the help file:
>>>      "c:\Program Files\openmpi-1.6.1"\share\openmpi\help-hostfile.txt: 
>>> Invalid argument.  S
>>> orry!
>>> --------------------------------------------------------------------------
>>> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file 
>>> ..\..\openmpi-1.6.1\orte\mc
>>> a\ras\base\ras_base_allocate.c at line 200
>>> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file 
>>> ..\..\openmpi-1.6.1\orte\mc
>>> a\plm\base\plm_base_launch_support.c at line 99
>>> [hermes:05248] [[10367,0],0] ORTE_ERROR_LOG: Not found in file 
>>> ..\..\openmpi-1.6.1\orte\mc
>>> a\plm\process\plm_process_module.c at line 996
>>>
>>>
>>> At least the program doesn't block any longer. Do you have any ideas
>>> how this new problem can be solved?
>>>
>>>
>>> Kind regards
>>>
>>> Siegmar
>>>
>>>
>>>
>>>> On 2012-09-05 1:02 PM, Siegmar Gross wrote:
>>>>> Hi Shiqing,
>>>>>
>>>>>>>> D:\...\prog\mpi\small_prog>mpiexec init_finalize.exe
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> Sorry!  You were supposed to get help about:
>>>>>>>>        invalid if_inexclude
>>>>>>>> But I couldn't open the help file:
>>>>>>>>        
>>>>>>>> D:\...\prog\mpi\small_prog\..\share\openmpi\help-mpi-btl-tcp.txt:
>>>>>>>>        No such file or directory.  Sorry!
>>>>>>>> ---------------------------------------------------------------------
>>>>>>> ...
>>>>>>>> Why does "mpiexec" look for the help file relativ to my current
>>>>>>>> program and not relative to itself? The file is part of the
>>>>>>>> package.
>>>>>>> Do you know how I can solve this problem?
>>>>>> I have similar issue with message from tcp, but it's not finding the
>>>>>> file, it's something else, which doesn't affect the execution of the
>>>>>> application. Could you make sure the help-mpi-btl-tcp.txt is actually in
>>>>>> the path D:\...\prog\mpi\small_prog\..\share\openmpi\?
>>>>> That wouldn't be a good idea because I have MPI programs in different
>>>>> directories so that I would have to install all help files in several
>>>>> places (<my_directory>/../share/openmpi/help*.txt). All help files are
>>>>> available in the installation directory of Open MPI.
>>>>>
>>>>> dir "c:\Program Files (x86)\openmpi-1.6.1\bin\mpiexec.exe"
>>>>> ...
>>>>> 29.08.2012  10:59            38.912 mpiexec.exe
>>>>> ...
>>>>> dir "c:\Program Files 
>>>>> (x86)\openmpi-1.6.1\bin\..\share\openmpi\help-mpi-btl-tcp.txt"
>>>>> ...
>>>>> 03.04.2012  16:30               631 help-mpi-btl-tcp.txt
>>>>> ...
>>>>>
>>>>> I don't know if "mpiexec" or my program "init_finilize" is responsible
>>>>> for the error message but whoever is responsible shouldn't use the path
>>>>> to my program but the prefix_dir from MPI to find the help files. Perhaps
>>>>> you can change the behaviour in the Open MPI source code.
>>>>>
>>>>>
>>>>>>>> I can also compile in 64-bit mode but the program hangs.
>>>>>>> Do you have any ideas why the program hangs? Thank you very much for any
>>>>>>> help in advance.
>>>>>> To be honest I don't know. I couldn't reproduce it. Did you try
>>>>>> installing the binary installer, will it also behave the same?
>>>>> I like to have different versions of Open MPI which I activate via
>>>>> a batch file so that I can still run my program in an old version if
>>>>> something goes wrong in a new one. I have no entries in the system
>>>>> environment or registry so that I can even run different versions in
>>>>> different command windows without problems (everything is only known
>>>>> within the command window in which a have run my batch file). It seems
>>>>> that you put something in the registry when I use your installer.
>>>>> Perhaps you remember an earlier email where I had to uninstall an old
>>>>> version because the environment in my own installation was wrong
>>>>> as long as your installation was active. Nevertheless I can give it
>>>>> a try. Perhaps I find out if you set more than just the path to your
>>>>> binaries. Do you know if there is something similar to "truss" or
>>>>> "strace" in the UNIX world so that I can see where the program hangs?
>>>>> Thank you very much for your help in advance.
>>>>>
>>>>>
>>>>> Kind regards
>>>>>
>>>>> Siegmar
>>>>>
>>>>
>>>> --
>>>> ---------------------------------------------------------------
>>>> Shiqing Fan
>>>> High Performance Computing Center Stuttgart (HLRS)
>>>> Tel: ++49(0)711-685-87234      Nobelstrasse 19
>>>> Fax: ++49(0)711-685-65832      70569 Stuttgart
>>>> http://www.hlrs.de/organization/people/shiqing-fan/
>>>> email: f...@hlrs.de
>>>>
>>>
>>
>
>
> --
> ---------------------------------------------------------------
> Shiqing Fan
> High Performance Computing Center Stuttgart (HLRS)
> Tel: ++49(0)711-685-87234      Nobelstrasse 19
> Fax: ++49(0)711-685-65832      70569 Stuttgart
> http://www.hlrs.de/organization/people/shiqing-fan/
> email: f...@hlrs.de
>
>
>
> ------------------------------
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> End of users Digest, Vol 2345, Issue 1
> **************************************

Re: [OMPI users] some mpi processes "disappear" on a cluster of servers

Reply via email to