Re: [OMPI users] Open MPI via SSH noob issue

Jeff Squyres Wed, 10 Aug 2011 08:18:39 -0400

Have you setup your shell startup files such that they point to the new OMPI 
installation (/opt/local/openmpi/) even for non-interactive logins?



On Aug 10, 2011, at 6:14 AM, Christopher Jones wrote:

> Hi,
> 
> Thanks for the quick response.....I managed to compile 1.5.3 on both 
> computers using gcc-4.2, with the proper flags set (this took a bit of 
> playing with, but I did eventually get it to compile). Once that was done, I 
> installed it to a different directory from 1.2.8 (/opt/local/openmpi/), 
> specified the PATH and LD_LIBRARY_PATH for the new version on each computer, 
> then managed to get the hello_world script to run again so it could call each 
> process, like before. However, I'm still in the same place - ring_c freezes 
> up. I tried changing the hostname in the host file (just for poops and 
> giggles - I see the response stating it doesn't matter), but to no avail. I 
> made sure the firewall is off on both computers.
> 
> I'm hoping I'm not doing something overly dumb here, but I'm still a bit 
> stuck...I see in the FAQ that there were some issues with nehalem processors 
> - I have two Xeons in one box and a nehalem in another. Could this make any 
> difference?
> 
> Thanks again,
> Chris
> 
> On Aug 9, 2011, at 6:50 PM, Jeff Squyres wrote:
> 
>> No, Open MPI doesn't use the names in the hostfile to figure out which 
>> TCP/IP addresses to use (for example).  Each process ends up publishing a 
>> list of IP addresses at which it can be connected, and OMPI does routability 
>> computations to figure out which is the "best" address to contact a given 
>> peer on.
>> 
>> If you're just starting with Open MPI, can you upgrade?  1.2.8 is pretty 
>> ancient.  Open MPI 1.4.3 is the most recent stable release; 1.5.3 is our 
>> "feature" series, but it's also relatively stable (new releases are coming 
>> in both the 1.4.x and 1.5.x series soon, FWIW).
>> 
>> 
>> On Aug 9, 2011, at 12:14 PM, David Warren wrote:
>> 
>>> I don't know if this is it, but if you use the name localhost, won't 
>>> processes on both machines try to talk to 127.0.0.1? I believe you need to 
>>> use the real hostname in you host file. I think that your two tests work 
>>> because there is no interprocess communication, just stdout.
>>> 
>>> On 08/08/11 23:46, Christopher Jones wrote:
>>>> Hi again,
>>>> 
>>>> I changed the subject of my previous posting to reflect a new problem 
>>>> encountered when I changed my strategy to using SSH instead of Xgrid on 
>>>> two mac pros. I've set up a login-less ssh communication between the two 
>>>> macs (connected via direct ethernet, both running openmpi 1.2.8 on OSX 
>>>> 10.6.8) per the instructions on the FAQ. I can type in 'ssh 
>>>> computer-name.local' on either computer and connect without a password 
>>>> prompt. From what I can see, the ssh-agent is up and running - the 
>>>> following is listed in my ENV:
>>>> 
>>>> SSH_AUTH_SOCK=/tmp/launch-5FoCc1/Listeners
>>>> SSH_AGENT_PID=61058
>>>> 
>>>> My host file simply lists 'localhost' and 
>>>> 'chrisjones2@allana-welshs-mac-pro.local'. When I run a simple hello_world 
>>>> test, I get what seems like a reasonable output:
>>>> 
>>>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile 
>>>> ./test_hello
>>>> Hello world from process 0 of 8
>>>> Hello world from process 1 of 8
>>>> Hello world from process 2 of 8
>>>> Hello world from process 3 of 8
>>>> Hello world from process 4 of 8
>>>> Hello world from process 7 of 8
>>>> Hello world from process 5 of 8
>>>> Hello world from process 6 of 8
>>>> 
>>>> I can also run hostname and get what seems to be an ok response (unless 
>>>> I'm wrong about this):
>>>> 
>>>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile hostname
>>>> allana-welshs-mac-pro.local
>>>> allana-welshs-mac-pro.local
>>>> allana-welshs-mac-pro.local
>>>> allana-welshs-mac-pro.local
>>>> quadcore.mikrob.slu.se
>>>> quadcore.mikrob.slu.se
>>>> quadcore.mikrob.slu.se
>>>> quadcore.mikrob.slu.se
>>>> 
>>>> 
>>>> However, when I run the ring_c test, it freezes:
>>>> 
>>>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile ./ring_c
>>>> Process 0 sending 10 to 1, tag 201 (8 processes in ring)
>>>> Process 0 sent to 1
>>>> Process 0 decremented value: 9
>>>> 
>>>> (I noted that processors on both computers are active).
>>>> 
>>>> ring_c was compiled separately on each computer, however both have the 
>>>> same version of openmpi and OSX. I've gone through the FAQ and searched 
>>>> the user forum, but I can't quite seems to get this problem unstuck.
>>>> 
>>>> Many thanks for your time,
>>>> Chris
>>>> 
>>>> On Aug 5, 2011, at 6:00 PM,<users-requ...@open-mpi.org>  
>>>> <users-requ...@open-mpi.org>  wrote:
>>>> 
>>>> 
>>>>> Send users mailing list submissions to
>>>>>      us...@open-mpi.org
>>>>> 
>>>>> To subscribe or unsubscribe via the World Wide Web, visit
>>>>>      http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> or, via email, send a message with subject or body 'help' to
>>>>>      users-requ...@open-mpi.org
>>>>> 
>>>>> You can reach the person managing the list at
>>>>>      users-ow...@open-mpi.org
>>>>> 
>>>>> When replying, please edit your Subject line so it is more specific
>>>>> than "Re: Contents of users digest..."
>>>>> 
>>>>> 
>>>>> Today's Topics:
>>>>> 
>>>>> 1. Re: OpenMPI causing WRF to crash (Jeff Squyres)
>>>>> 2. Re: OpenMPI causing WRF to crash (Anthony Chan)
>>>>> 3. Re: Program hangs on send when run with nodes on  remote
>>>>>    machine (Jeff Squyres)
>>>>> 4. Re: openmpi 1.2.8 on Xgrid noob issue (Jeff Squyres)
>>>>> 5. Re: parallel I/O on 64-bit indexed arays (Rob Latham)
>>>>> 
>>>>> 
>>>>> ----------------------------------------------------------------------
>>>>> 
>>>>> Message: 1
>>>>> Date: Thu, 4 Aug 2011 19:18:36 -0400
>>>>> From: Jeff Squyres<jsquy...@cisco.com>
>>>>> Subject: Re: [OMPI users] OpenMPI causing WRF to crash
>>>>> To: Open MPI Users<us...@open-mpi.org>
>>>>> Message-ID:<3f0e661f-a74f-4e51-86c0-1f84feb07...@cisco.com>
>>>>> Content-Type: text/plain; charset=windows-1252
>>>>> 
>>>>> Signal 15 is usually SIGTERM on Linux, meaning that some external entity 
>>>>> probably killed the job.
>>>>> 
>>>>> The OMPI error message you describe is also typical for that kind of 
>>>>> scenario -- i.e., a process exited without calling MPI_Finalize could 
>>>>> mean that it called exit() or some external process killed it.
>>>>> 
>>>>> 
>>>>> On Aug 3, 2011, at 7:24 AM, BasitAli Khan wrote:
>>>>> 
>>>>> 
>>>>>> I am trying to run a rather heavy wrf simulation with spectral nudging 
>>>>>> but the simulation crashes after 1.8 minutes of integration.
>>>>>> The simulation has two domains    with  d01 = 601x601 and d02 = 721x721 
>>>>>> and 51 vertical levels. I tried this simulation on two different systems 
>>>>>> but result was more or less same. For example
>>>>>> 
>>>>>> On our Bluegene/P  with SUSE Linux Enterprise Server 10 ppc and XLF 
>>>>>> compiler I tried to run wrf on 2048 shared memory nodes (1 compute node 
>>>>>> = 4 cores , 32 bit, 850 Mhz). For the parallel run I used mpixlc, 
>>>>>> mpixlcxx and mpixlf90.  I got the following error message in the wrf.err 
>>>>>> file
>>>>>> 
>>>>>> <Aug 01 19:50:21.244540>  BE_MPI (ERROR): The error message in the job
>>>>>> record is as follows:
>>>>>> <Aug 01 19:50:21.244657>  BE_MPI (ERROR):   "killed with signal 15"
>>>>>> 
>>>>>> I also tried to run the same simulation on our linux cluster (Linux Red 
>>>>>> Hat Enterprise 5.4m  x86_64 and Intel compiler) with 8, 16 and 64 nodes 
>>>>>> (1 compute node=8 cores). For the parallel run I am used 
>>>>>> mpi/openmpi/1.4.2-intel-11. I got the following error message in the 
>>>>>> error log after couple of minutes of integration.
>>>>>> 
>>>>>> "mpirun has exited due to process rank 45 with PID 19540 on
>>>>>> node ci118 exiting without calling "finalize". This may
>>>>>> have caused other processes in the application to be
>>>>>> terminated by signals sent by mpirun (as reported here)."
>>>>>> 
>>>>>> I tried many things but nothing seems to be working. However, if I 
>>>>>> reduce  grid points below 200, the simulation goes fine. It appears that 
>>>>>> probably OpenMP has problem with large number of grid points but I have 
>>>>>> no idea how to fix it. I will greatly appreciate if you could suggest 
>>>>>> some solution.
>>>>>> 
>>>>>> Best regards,
>>>>>> ---
>>>>>> Basit A. Khan, Ph.D.
>>>>>> Postdoctoral Fellow
>>>>>> Division of Physical Sciences&  Engineering
>>>>>> Office# 3204, Level 3, Building 1,
>>>>>> King Abdullah University of Science&  Technology
>>>>>> 4700 King Abdullah Blvd, Box 2753, Thuwal 23955 ?6900,
>>>>>> Kingdom of Saudi Arabia.
>>>>>> 
>>>>>> Office: +966(0)2 808 0276,  Mobile: +966(0)5 9538 7592
>>>>>> E-mail: basitali.k...@kaust.edu.sa
>>>>>> Skype name: basit.a.khan
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>> 
>>>>> --
>>>>> Jeff Squyres
>>>>> jsquy...@cisco.com
>>>>> For corporate legal information go to:
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------
>>>>> 
>>>>> Message: 2
>>>>> Date: Thu, 4 Aug 2011 18:59:59 -0500 (CDT)
>>>>> From: Anthony Chan<c...@mcs.anl.gov>
>>>>> Subject: Re: [OMPI users] OpenMPI causing WRF to crash
>>>>> To: Open MPI Users<us...@open-mpi.org>
>>>>> Message-ID:
>>>>>      <660521091.191111.1312502399225.javamail.r...@zimbra.anl.gov>
>>>>> Content-Type: text/plain; charset=utf-8
>>>>> 
>>>>> 
>>>>> If you want to debug this on BGP, you could set BG_COREDUMPONERROR=1
>>>>> and look at the backtrace in the light weight core files
>>>>> (you probably need to recompile everything with -g).
>>>>> 
>>>>> A.Chan
>>>>> 
>>>>> ----- Original Message -----
>>>>> 
>>>>>> Hi Dmitry,
>>>>>> Thanks for a prompt and fairly detailed response. I have also
>>>>>> forwarded
>>>>>> the email to wrf community in the hope that somebody would have some
>>>>>> straight forward solution. I will try to debug the error as suggested
>>>>>> by
>>>>>> you if I would not have much luck from the wrf forum.
>>>>>> 
>>>>>> Cheers,
>>>>>> ---
>>>>>> 
>>>>>> Basit A. Khan, Ph.D.
>>>>>> Postdoctoral Fellow
>>>>>> Division of Physical Sciences&  Engineering
>>>>>> Office# 3204, Level 3, Building 1,
>>>>>> King Abdullah University of Science&  Technology
>>>>>> 4700 King Abdullah Blvd, Box 2753, Thuwal 23955 ?6900,
>>>>>> Kingdom of Saudi Arabia.
>>>>>> 
>>>>>> Office: +966(0)2 808 0276, Mobile: +966(0)5 9538 7592
>>>>>> E-mail: basitali.k...@kaust.edu.sa
>>>>>> Skype name: basit.a.khan
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 8/3/11 2:46 PM, "Dmitry N. Mikushin"<maemar...@gmail.com>  wrote:
>>>>>> 
>>>>>> 
>>>>>>> 5 apparently means one of the WRF's MPI processes has been
>>>>>>> unexpectedly terminated, maybe by program decision. No matter, if it
>>>>>>> is OpenMPI-specifi
>>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------
>>>>> 
>>>>> Message: 3
>>>>> Date: Thu, 4 Aug 2011 20:46:16 -0400
>>>>> From: Jeff Squyres<jsquy...@cisco.com>
>>>>> Subject: Re: [OMPI users] Program hangs on send when run with nodes on
>>>>>      remote machine
>>>>> To: Open MPI Users<us...@open-mpi.org>
>>>>> Message-ID:<f344f301-ad7b-4e83-b0df-a6e001072...@cisco.com>
>>>>> Content-Type: text/plain; charset=us-ascii
>>>>> 
>>>>> I notice that in the worker, you have:
>>>>> 
>>>>> eth2      Link encap:Ethernet  HWaddr 00:1b:21:77:c5:d4
>>>>>        inet addr:192.168.1.155  Bcast:192.168.1.255  Mask:255.255.255.0
>>>>>        inet6 addr: fe80::21b:21ff:fe77:c5d4/64 Scope:Link
>>>>>        UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>>>        RX packets:9225846 errors:0 dropped:75175 overruns:0 frame:0
>>>>>        TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
>>>>>        collisions:0 txqueuelen:1000
>>>>>        RX bytes:1336628768 (1.3 GB)  TX bytes:552 (552.0 B)
>>>>> 
>>>>> eth3      Link encap:Ethernet  HWaddr 00:1b:21:77:c5:d5
>>>>>        inet addr:192.168.1.156  Bcast:192.168.1.255  Mask:255.255.255.0
>>>>>        inet6 addr: fe80::21b:21ff:fe77:c5d5/64 Scope:Link
>>>>>        UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>>>        RX packets:26481809 errors:0 dropped:75059 overruns:0 frame:0
>>>>>        TX packets:18030236 errors:0 dropped:0 overruns:0 carrier:0
>>>>>        collisions:0 txqueuelen:1000
>>>>>        RX bytes:70061260271 (70.0 GB)  TX bytes:11844181778 (11.8 GB)
>>>>> 
>>>>> Two different NICs are on the same subnet -- that doesn't seem like a 
>>>>> good idea...?  I think this topic has come up on the users list before, 
>>>>> and, IIRC, the general consensus is "don't do that" because it's not 
>>>>> clear as to which NIC Linux will actually send outgoing traffic across 
>>>>> bound for the 192.168.1.x subnet.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Aug 4, 2011, at 1:59 PM, Keith Manville wrote:
>>>>> 
>>>>> 
>>>>>> I am having trouble running my MPI program on multiple nodes. I can
>>>>>> run multiple processes on a single node, and I can spawn processes on
>>>>>> on remote nodes, but when I call Send from a remote node, the node
>>>>>> never returns, even though there is an appropriate Recv waiting. I'm
>>>>>> pretty sure this is an issue with my configuration, not my code. I've
>>>>>> tried some other sample programs I found and had the same problem of
>>>>>> hanging on a send from one host to another.
>>>>>> 
>>>>>> Here's an in depth description:
>>>>>> 
>>>>>> I wrote a quick test program where each process with rank>  1 sends an
>>>>>> int to the master (rank 0), and the master receives until it gets
>>>>>> something from every other process.
>>>>>> 
>>>>>> My test program works fine when I run multiple processes on a single 
>>>>>> machine.
>>>>>> 
>>>>>> either the local node:
>>>>>> 
>>>>>> $ ./mpirun -n 4 ./mpi-test
>>>>>> Hi I'm localhost:2
>>>>>> Hi I'm localhost:1
>>>>>> localhost:1 sending 11...
>>>>>> localhost:2 sending 12...
>>>>>> localhost:2 sent 12
>>>>>> localhost:1 sent 11
>>>>>> Hi I'm localhost:0
>>>>>> localhost:0 received 11 from 1
>>>>>> localhost:0 received 12 from 2
>>>>>> Hi I'm localhost:3
>>>>>> localhost:3 sending 13...
>>>>>> localhost:3 sent 13
>>>>>> localhost:0 received 13 from 3
>>>>>> all workers checked in!
>>>>>> 
>>>>>> or a remote one:
>>>>>> 
>>>>>> $ ./mpirun -np 2 -host remotehost ./mpi-test
>>>>>> Hi I'm remotehost:0
>>>>>> remotehost:0 received 11 from 1
>>>>>> all workers checked in!
>>>>>> Hi I'm remotehost:1
>>>>>> remotehost:1 sending 11...
>>>>>> remotehost:1 sent 11
>>>>>> 
>>>>>> But when I try to run the master locally and the worker(s) remotely
>>>>>> (this is the way I am actually interested in running it), Send never
>>>>>> returns and it hangs indefinitely.
>>>>>> 
>>>>>> $ ./mpirun -np 2 -host localhost,remotehost ./mpi-test
>>>>>> Hi I'm localhost:0
>>>>>> Hi I'm remotehost:1
>>>>>> remotehost:1 sending 11...
>>>>>> 
>>>>>> Just to see if it would work, I tried spawning the master on the
>>>>>> remotehost and the worker on the localhost.
>>>>>> 
>>>>>> $ ./mpirun -np 2 -host remotehost,localhost ./mpi-test
>>>>>> Hi I'm localhost:1
>>>>>> localhost:1 sending 11...
>>>>>> localhost:1 sent 11
>>>>>> Hi I'm remotehost:0
>>>>>> remotehost:0 received 0 from 1
>>>>>> all workers checked in!
>>>>>> 
>>>>>> It doesn't hang on Send, but the wrong value is received.
>>>>>> 
>>>>>> Any idea what's going on? I've attached my code, my config.log,
>>>>>> ifconfig output, and ompi_info output.
>>>>>> 
>>>>>> Thanks,
>>>>>> Keith
>>>>>> <mpi.tgz>_______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>> 
>>>>> --
>>>>> Jeff Squyres
>>>>> jsquy...@cisco.com
>>>>> For corporate legal information go to:
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------
>>>>> 
>>>>> Message: 4
>>>>> Date: Thu, 4 Aug 2011 20:48:30 -0400
>>>>> From: Jeff Squyres<jsquy...@cisco.com>
>>>>> Subject: Re: [OMPI users] openmpi 1.2.8 on Xgrid noob issue
>>>>> To: Open MPI Users<us...@open-mpi.org>
>>>>> Message-ID:<c2ea7fd0-badb-4d05-851c-c444be26f...@cisco.com>
>>>>> Content-Type: text/plain; charset=us-ascii
>>>>> 
>>>>> I'm afraid our Xgrid support has lagged, and Apple hasn't show much 
>>>>> interest in MPI + Xgrid support -- much less HPC.  :-\
>>>>> 
>>>>> Have you see the FAQ items about Xgrid?
>>>>> 
>>>>>  http://www.open-mpi.org/faq/?category=osx#xgrid-howto
>>>>> 
>>>>> 
>>>>> On Aug 4, 2011, at 4:16 AM, Christopher Jones wrote:
>>>>> 
>>>>> 
>>>>>> Hi there,
>>>>>> 
>>>>>> I'm currently trying to set up a small xgrid between two mac pros (a 
>>>>>> single quadcore and a 2 duo core), where both are directly connected via 
>>>>>> an ethernet cable. I've set up xgrid using the password authentication 
>>>>>> (rather than the kerberos), and from what I can tell in the Xgrid admin 
>>>>>> tool it seems to be working. However, once I try a simple hello world 
>>>>>> program, I get this error:
>>>>>> 
>>>>>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 4 ./test_hello
>>>>>> mpirun noticed that job rank 0 with PID 381 on node xgrid-node-0 exited 
>>>>>> on signal 15 (Terminated).
>>>>>> 1 additional process aborted (not shown)
>>>>>> 2011-08-04 10:02:16.329 mpirun[350:903] *** Terminating app due to 
>>>>>> uncaught exception 'NSInvalidArgumentException', reason: '*** 
>>>>>> -[NSKVONotifying_XGConnection<0x1001325a0>  finalize]: called when 
>>>>>> collecting not enabled'
>>>>>> *** Call stack at first throw:
>>>>>> (
>>>>>>    0   CoreFoundation                      0x00007fff814237b4 
>>>>>> __exceptionPreprocess + 180
>>>>>>    1   libobjc.A.dylib                     0x00007fff84fe8f03 
>>>>>> objc_exception_throw + 45
>>>>>>    2   CoreFoundation                      0x00007fff8143e631 
>>>>>> -[NSObject(NSObject) finalize] + 129
>>>>>>    3   mca_pls_xgrid.so                    0x00000001002a9ce3 
>>>>>> -[PlsXGridClient dealloc] + 419
>>>>>>    4   mca_pls_xgrid.so                    0x00000001002a9837 
>>>>>> orte_pls_xgrid_finalize + 40
>>>>>>    5   libopen-rte.0.dylib                 0x000000010002d0f9 
>>>>>> orte_pls_base_close + 249
>>>>>>    6   libopen-rte.0.dylib                 0x0000000100012027 
>>>>>> orte_system_finalize + 119
>>>>>>    7   libopen-rte.0.dylib                 0x000000010000e968 
>>>>>> orte_finalize + 40
>>>>>>    8   mpirun                              0x00000001000011ff orterun + 
>>>>>> 2042
>>>>>>    9   mpirun                              0x0000000100000a03 main + 27
>>>>>>    10  mpirun                              0x00000001000009e0 start + 52
>>>>>>    11  ???                                 0x0000000000000004 0x0 + 4
>>>>>> )
>>>>>> terminate called after throwing an instance of 'NSException'
>>>>>> [chris-joness-mac-pro:00350] *** Process received signal ***
>>>>>> [chris-joness-mac-pro:00350] Signal: Abort trap (6)
>>>>>> [chris-joness-mac-pro:00350] Signal code:  (0)
>>>>>> [chris-joness-mac-pro:00350] [ 0] 2   libSystem.B.dylib                  
>>>>>>  0x00007fff81ca51ba _sigtramp + 26
>>>>>> [chris-joness-mac-pro:00350] [ 1] 3   ???                                
>>>>>>  0x00000001000cd400 0x0 + 4295808000
>>>>>> [chris-joness-mac-pro:00350] [ 2] 4   libstdc++.6.dylib                  
>>>>>>  0x00007fff830965d2 __tcf_0 + 0
>>>>>> [chris-joness-mac-pro:00350] [ 3] 5   libobjc.A.dylib                    
>>>>>>  0x00007fff84fecb39 _objc_terminate + 100
>>>>>> [chris-joness-mac-pro:00350] [ 4] 6   libstdc++.6.dylib                  
>>>>>>  0x00007fff83094ae1 _ZN10__cxxabiv111__terminateEPFvvE + 11
>>>>>> [chris-joness-mac-pro:00350] [ 5] 7   libstdc++.6.dylib                  
>>>>>>  0x00007fff83094b16 _ZN10__cxxabiv112__unexpectedEPFvvE + 0
>>>>>> [chris-joness-mac-pro:00350] [ 6] 8   libstdc++.6.dylib                  
>>>>>>  0x00007fff83094bfc 
>>>>>> _ZL23__gxx_exception_cleanup19_Unwind_Reason_CodeP17_Unwind_Exception + 0
>>>>>> [chris-joness-mac-pro:00350] [ 7] 9   libobjc.A.dylib                    
>>>>>>  0x00007fff84fe8fa2 object_getIvar + 0
>>>>>> [chris-joness-mac-pro:00350] [ 8] 10  CoreFoundation                     
>>>>>>  0x00007fff8143e631 -[NSObject(NSObject) finalize] + 129
>>>>>> [chris-joness-mac-pro:00350] [ 9] 11  mca_pls_xgrid.so                   
>>>>>>  0x00000001002a9ce3 -[PlsXGridClient dealloc] + 419
>>>>>> [chris-joness-mac-pro:00350] [10] 12  mca_pls_xgrid.so                   
>>>>>>  0x00000001002a9837 orte_pls_xgrid_finalize + 40
>>>>>> [chris-joness-mac-pro:00350] [11] 13  libopen-rte.0.dylib                
>>>>>>  0x000000010002d0f9 orte_pls_base_close + 249
>>>>>> [chris-joness-mac-pro:00350] [12] 14  libopen-rte.0.dylib                
>>>>>>  0x0000000100012027 orte_system_finalize + 119
>>>>>> [chris-joness-mac-pro:00350] [13] 15  libopen-rte.0.dylib                
>>>>>>  0x000000010000e968 orte_finalize + 40
>>>>>> [chris-joness-mac-pro:00350] [14] 16  mpirun                             
>>>>>>  0x00000001000011ff orterun + 2042
>>>>>> [chris-joness-mac-pro:00350] [15] 17  mpirun                             
>>>>>>  0x0000000100000a03 main + 27
>>>>>> [chris-joness-mac-pro:00350] [16] 18  mpirun                             
>>>>>>  0x00000001000009e0 start + 52
>>>>>> [chris-joness-mac-pro:00350] [17] 19  ???                                
>>>>>>  0x0000000000000004 0x0 + 4
>>>>>> [chris-joness-mac-pro:00350] *** End of error message ***
>>>>>> Abort trap
>>>>>> 
>>>>>> 
>>>>>> I've seen this error in a previous mailing, and it seems that the issue 
>>>>>> has something to do with forcing everything to use kerberos (SSO). 
>>>>>> However, I noticed that in the computer being used as an agent, this 
>>>>>> option is grayed on in the Xgrid sharing configuration (I have no idea 
>>>>>> why). I would therefore ask if it is absolutely necessary to use SSO to 
>>>>>> get openmpi to run with xgrid, or am I missing something with the 
>>>>>> password setup. Seems that the kerberos option is much more complicated, 
>>>>>> and I may even want to switch to just using openmpi with ssh.
>>>>>> 
>>>>>> Many thanks,
>>>>>> Chris
>>>>>> 
>>>>>> 
>>>>>> Chris Jones
>>>>>> Post-doctoral Research Assistant,
>>>>>> 
>>>>>> Department of Microbiology
>>>>>> Swedish University of Agricultural Sciences
>>>>>> Uppsala, Sweden
>>>>>> phone: +46 (0)18 67 3222
>>>>>> email: chris.jo...@slu.se
>>>>>> 
>>>>>> Department of Soil and Environmental Microbiology
>>>>>> National Institute for Agronomic Research
>>>>>> Dijon, France
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>> 
>>>>> --
>>>>> Jeff Squyres
>>>>> jsquy...@cisco.com
>>>>> For corporate legal information go to:
>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> ------------------------------
>>>>> 
>>>>> Message: 5
>>>>> Date: Fri, 5 Aug 2011 08:41:58 -0500
>>>>> From: Rob Latham<r...@mcs.anl.gov>
>>>>> Subject: Re: [OMPI users] parallel I/O on 64-bit indexed arays
>>>>> To: Open MPI Users<us...@open-mpi.org>
>>>>> Cc: Quincey Koziol<koz...@hdfgroup.org>, Fab Tillier
>>>>>      <ftill...@microsoft.com>
>>>>> Message-ID:<20110805134158.ga28...@mcs.anl.gov>
>>>>> Content-Type: text/plain; charset=us-ascii
>>>>> 
>>>>> On Wed, Jul 27, 2011 at 06:13:05PM +0200, Troels Haugboelle wrote:
>>>>> 
>>>>>> and we get good (+GB/s) performance when writing files from large runs.
>>>>>> 
>>>>>> Interestingly, an alternative and conceptually simpler option is to
>>>>>> use MPI_FILE_WRITE_ORDERED, but the performance of that function on
>>>>>> Blue-Gene/P sucks - 20 MB/s instead of GB/s. I do not know why.
>>>>>> 
>>>>> Ordered mode as implemented in ROMIO is awful.  Entirely serialized.
>>>>> We pass a token from process to process. Each process acquires the
>>>>> token, updates the shared file pointer, does its i/o, then passes the
>>>>> token to the next process.
>>>>> 
>>>>> What we should do, and have done in test branches [1], is use MPI_SCAN
>>>>> to look at the shared file pointer once, tell all the processors their
>>>>> offset, then update the shared file pointer while all processes do I/O
>>>>> in parallel.
>>>>> 
>>>>> [1]: Robert Latham, Robert Ross, and Rajeev Thakur. "Implementing
>>>>> MPI-IO Atomic Mode and Shared File Pointers Using MPI One-Sided
>>>>> Communication". International Journal of High Performance Computing
>>>>> Applications, 21(2):132-143, 2007
>>>>> 
>>>>> Since no one uses the shared file pointers, and even fewer people use
>>>>> ordered mode, we just haven't seen the need to do so.
>>>>> 
>>>>> Do you want to rebuild your MPI library on BlueGene?  I can pretty
>>>>> quickly generate and send a patch that will make ordered mode go whip
>>>>> fast.
>>>>> 
>>>>> ==rob
>>>>> 
>>>>> 
>>>>>> Troels
>>>>>> 
>>>>>> On 6/7/11 15:04 , Jeff Squyres wrote:
>>>>>> 
>>>>>>> On Jun 7, 2011, at 4:53 AM, Troels Haugboelle wrote:
>>>>>>> 
>>>>>>> 
>>>>>>>> In principle yes, but the problem is we have an unequal amount of 
>>>>>>>> particles on each node, so the length of each array is not guaranteed 
>>>>>>>> to be divisible by 2, 4 or any other number. If I have understood the 
>>>>>>>> definition of MPI_TYPE_CREATE_SUBARRAY correctly the offset can be 
>>>>>>>> 64-bit, but not the global array size, so, optimally, what I am 
>>>>>>>> looking for is something that has unequal size for each thread, simple 
>>>>>>>> vector, and with 64-bit offsets and global array size.
>>>>>>>> 
>>>>>>> It's a bit awkward, but you can still make datatypes to give the offset 
>>>>>>> that you want.  E.g., if you need an offset of 2B+31 bytes, you can 
>>>>>>> make datatype A with type contig of N=(2B/sizeof(int)) int's.  Then 
>>>>>>> make datatype B with type struct, containing type A and 31 MPI_BYTEs.  
>>>>>>> Then use 1 instance of datatype B to get the offset that you want.
>>>>>>> 
>>>>>>> You could make utility functions that, given a specific (64 bit) 
>>>>>>> offset, it makes an MPI datatype that matches the offset, and then 
>>>>>>> frees it (and all sub-datatypes).
>>>>>>> 
>>>>>>> There is a bit of overhead in creating these datatypes, but it should 
>>>>>>> be dwarfed by the amount of data that you're reading/writing, right?
>>>>>>> 
>>>>>>> It's awkward, but it should work.
>>>>>>> 
>>>>>>> 
>>>>>>>> Another possible workaround would be to identify subsections that do 
>>>>>>>> not pass 2B elements, make sub communicators, and then let each of 
>>>>>>>> them dump their elements with proper offsets. It may work. The 
>>>>>>>> problematic architecture is a BG/P. On other clusters doing simple 
>>>>>>>> I/O, letting all threads open the file, seek to their position, and 
>>>>>>>> then write their chunk works fine, but somehow on BG/P performance 
>>>>>>>> drops dramatically. My guess is that there is some file locking, or we 
>>>>>>>> are overwhelming the I/O nodes..
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> This ticket for the MPI-3 standard is a first step in the right 
>>>>>>>>> direction, but won't do everything you need (this is more FYI):
>>>>>>>>> 
>>>>>>>>>  https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/265
>>>>>>>>> 
>>>>>>>>> See the PDF attached to the ticket; it's going up for a "first 
>>>>>>>>> reading" in a month.  It'll hopefully be part of the MPI-3 standard 
>>>>>>>>> by the end of the year (Fab Tillier, CC'ed, has been the chief 
>>>>>>>>> proponent of this ticket for the past several months).
>>>>>>>>> 
>>>>>>>>> Quincey Koziol from the HDF group is going to propose a follow on to 
>>>>>>>>> this ticket, specifically about the case you're referring to -- large 
>>>>>>>>> counts for file functions and datatype constructors.  Quincey -- can 
>>>>>>>>> you expand on what you'll be proposing, perchance?
>>>>>>>>> 
>>>>>>>> Interesting, I think something along the lines of the note would be 
>>>>>>>> very useful and needed for large applications.
>>>>>>>> 
>>>>>>>> Thanks a lot for the pointers and your suggestions,
>>>>>>>> 
>>>>>>>> cheers,
>>>>>>>> 
>>>>>>>> Troels
>>>>>>>> 
>>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>> 
>>>>> --
>>>>> Rob Latham
>>>>> Mathematics and Computer Science Division
>>>>> Argonne National Lab, IL USA
>>>>> 
>>>>> 
>>>>> ------------------------------
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> End of users Digest, Vol 1977, Issue 1
>>>>> **************************************
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>> <warren.vcf>_______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] Open MPI via SSH noob issue

Reply via email to