Have you setup your shell startup files such that they point to the new OMPI installation (/opt/local/openmpi/) even for non-interactive logins?
On Aug 10, 2011, at 6:14 AM, Christopher Jones wrote: > Hi, > > Thanks for the quick response.....I managed to compile 1.5.3 on both > computers using gcc-4.2, with the proper flags set (this took a bit of > playing with, but I did eventually get it to compile). Once that was done, I > installed it to a different directory from 1.2.8 (/opt/local/openmpi/), > specified the PATH and LD_LIBRARY_PATH for the new version on each computer, > then managed to get the hello_world script to run again so it could call each > process, like before. However, I'm still in the same place - ring_c freezes > up. I tried changing the hostname in the host file (just for poops and > giggles - I see the response stating it doesn't matter), but to no avail. I > made sure the firewall is off on both computers. > > I'm hoping I'm not doing something overly dumb here, but I'm still a bit > stuck...I see in the FAQ that there were some issues with nehalem processors > - I have two Xeons in one box and a nehalem in another. Could this make any > difference? > > Thanks again, > Chris > > On Aug 9, 2011, at 6:50 PM, Jeff Squyres wrote: > >> No, Open MPI doesn't use the names in the hostfile to figure out which >> TCP/IP addresses to use (for example). Each process ends up publishing a >> list of IP addresses at which it can be connected, and OMPI does routability >> computations to figure out which is the "best" address to contact a given >> peer on. >> >> If you're just starting with Open MPI, can you upgrade? 1.2.8 is pretty >> ancient. Open MPI 1.4.3 is the most recent stable release; 1.5.3 is our >> "feature" series, but it's also relatively stable (new releases are coming >> in both the 1.4.x and 1.5.x series soon, FWIW). >> >> >> On Aug 9, 2011, at 12:14 PM, David Warren wrote: >> >>> I don't know if this is it, but if you use the name localhost, won't >>> processes on both machines try to talk to 127.0.0.1? I believe you need to >>> use the real hostname in you host file. I think that your two tests work >>> because there is no interprocess communication, just stdout. >>> >>> On 08/08/11 23:46, Christopher Jones wrote: >>>> Hi again, >>>> >>>> I changed the subject of my previous posting to reflect a new problem >>>> encountered when I changed my strategy to using SSH instead of Xgrid on >>>> two mac pros. I've set up a login-less ssh communication between the two >>>> macs (connected via direct ethernet, both running openmpi 1.2.8 on OSX >>>> 10.6.8) per the instructions on the FAQ. I can type in 'ssh >>>> computer-name.local' on either computer and connect without a password >>>> prompt. From what I can see, the ssh-agent is up and running - the >>>> following is listed in my ENV: >>>> >>>> SSH_AUTH_SOCK=/tmp/launch-5FoCc1/Listeners >>>> SSH_AGENT_PID=61058 >>>> >>>> My host file simply lists 'localhost' and >>>> 'chrisjones2@allana-welshs-mac-pro.local'. When I run a simple hello_world >>>> test, I get what seems like a reasonable output: >>>> >>>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile >>>> ./test_hello >>>> Hello world from process 0 of 8 >>>> Hello world from process 1 of 8 >>>> Hello world from process 2 of 8 >>>> Hello world from process 3 of 8 >>>> Hello world from process 4 of 8 >>>> Hello world from process 7 of 8 >>>> Hello world from process 5 of 8 >>>> Hello world from process 6 of 8 >>>> >>>> I can also run hostname and get what seems to be an ok response (unless >>>> I'm wrong about this): >>>> >>>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile hostname >>>> allana-welshs-mac-pro.local >>>> allana-welshs-mac-pro.local >>>> allana-welshs-mac-pro.local >>>> allana-welshs-mac-pro.local >>>> quadcore.mikrob.slu.se >>>> quadcore.mikrob.slu.se >>>> quadcore.mikrob.slu.se >>>> quadcore.mikrob.slu.se >>>> >>>> >>>> However, when I run the ring_c test, it freezes: >>>> >>>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile ./ring_c >>>> Process 0 sending 10 to 1, tag 201 (8 processes in ring) >>>> Process 0 sent to 1 >>>> Process 0 decremented value: 9 >>>> >>>> (I noted that processors on both computers are active). >>>> >>>> ring_c was compiled separately on each computer, however both have the >>>> same version of openmpi and OSX. I've gone through the FAQ and searched >>>> the user forum, but I can't quite seems to get this problem unstuck. >>>> >>>> Many thanks for your time, >>>> Chris >>>> >>>> On Aug 5, 2011, at 6:00 PM,<users-requ...@open-mpi.org> >>>> <users-requ...@open-mpi.org> wrote: >>>> >>>> >>>>> Send users mailing list submissions to >>>>> us...@open-mpi.org >>>>> >>>>> To subscribe or unsubscribe via the World Wide Web, visit >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> or, via email, send a message with subject or body 'help' to >>>>> users-requ...@open-mpi.org >>>>> >>>>> You can reach the person managing the list at >>>>> users-ow...@open-mpi.org >>>>> >>>>> When replying, please edit your Subject line so it is more specific >>>>> than "Re: Contents of users digest..." >>>>> >>>>> >>>>> Today's Topics: >>>>> >>>>> 1. Re: OpenMPI causing WRF to crash (Jeff Squyres) >>>>> 2. Re: OpenMPI causing WRF to crash (Anthony Chan) >>>>> 3. Re: Program hangs on send when run with nodes on remote >>>>> machine (Jeff Squyres) >>>>> 4. Re: openmpi 1.2.8 on Xgrid noob issue (Jeff Squyres) >>>>> 5. Re: parallel I/O on 64-bit indexed arays (Rob Latham) >>>>> >>>>> >>>>> ---------------------------------------------------------------------- >>>>> >>>>> Message: 1 >>>>> Date: Thu, 4 Aug 2011 19:18:36 -0400 >>>>> From: Jeff Squyres<jsquy...@cisco.com> >>>>> Subject: Re: [OMPI users] OpenMPI causing WRF to crash >>>>> To: Open MPI Users<us...@open-mpi.org> >>>>> Message-ID:<3f0e661f-a74f-4e51-86c0-1f84feb07...@cisco.com> >>>>> Content-Type: text/plain; charset=windows-1252 >>>>> >>>>> Signal 15 is usually SIGTERM on Linux, meaning that some external entity >>>>> probably killed the job. >>>>> >>>>> The OMPI error message you describe is also typical for that kind of >>>>> scenario -- i.e., a process exited without calling MPI_Finalize could >>>>> mean that it called exit() or some external process killed it. >>>>> >>>>> >>>>> On Aug 3, 2011, at 7:24 AM, BasitAli Khan wrote: >>>>> >>>>> >>>>>> I am trying to run a rather heavy wrf simulation with spectral nudging >>>>>> but the simulation crashes after 1.8 minutes of integration. >>>>>> The simulation has two domains with d01 = 601x601 and d02 = 721x721 >>>>>> and 51 vertical levels. I tried this simulation on two different systems >>>>>> but result was more or less same. For example >>>>>> >>>>>> On our Bluegene/P with SUSE Linux Enterprise Server 10 ppc and XLF >>>>>> compiler I tried to run wrf on 2048 shared memory nodes (1 compute node >>>>>> = 4 cores , 32 bit, 850 Mhz). For the parallel run I used mpixlc, >>>>>> mpixlcxx and mpixlf90. I got the following error message in the wrf.err >>>>>> file >>>>>> >>>>>> <Aug 01 19:50:21.244540> BE_MPI (ERROR): The error message in the job >>>>>> record is as follows: >>>>>> <Aug 01 19:50:21.244657> BE_MPI (ERROR): "killed with signal 15" >>>>>> >>>>>> I also tried to run the same simulation on our linux cluster (Linux Red >>>>>> Hat Enterprise 5.4m x86_64 and Intel compiler) with 8, 16 and 64 nodes >>>>>> (1 compute node=8 cores). For the parallel run I am used >>>>>> mpi/openmpi/1.4.2-intel-11. I got the following error message in the >>>>>> error log after couple of minutes of integration. >>>>>> >>>>>> "mpirun has exited due to process rank 45 with PID 19540 on >>>>>> node ci118 exiting without calling "finalize". This may >>>>>> have caused other processes in the application to be >>>>>> terminated by signals sent by mpirun (as reported here)." >>>>>> >>>>>> I tried many things but nothing seems to be working. However, if I >>>>>> reduce grid points below 200, the simulation goes fine. It appears that >>>>>> probably OpenMP has problem with large number of grid points but I have >>>>>> no idea how to fix it. I will greatly appreciate if you could suggest >>>>>> some solution. >>>>>> >>>>>> Best regards, >>>>>> --- >>>>>> Basit A. Khan, Ph.D. >>>>>> Postdoctoral Fellow >>>>>> Division of Physical Sciences& Engineering >>>>>> Office# 3204, Level 3, Building 1, >>>>>> King Abdullah University of Science& Technology >>>>>> 4700 King Abdullah Blvd, Box 2753, Thuwal 23955 ?6900, >>>>>> Kingdom of Saudi Arabia. >>>>>> >>>>>> Office: +966(0)2 808 0276, Mobile: +966(0)5 9538 7592 >>>>>> E-mail: basitali.k...@kaust.edu.sa >>>>>> Skype name: basit.a.khan >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> jsquy...@cisco.com >>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> >>>>> >>>>> >>>>> >>>>> ------------------------------ >>>>> >>>>> Message: 2 >>>>> Date: Thu, 4 Aug 2011 18:59:59 -0500 (CDT) >>>>> From: Anthony Chan<c...@mcs.anl.gov> >>>>> Subject: Re: [OMPI users] OpenMPI causing WRF to crash >>>>> To: Open MPI Users<us...@open-mpi.org> >>>>> Message-ID: >>>>> <660521091.191111.1312502399225.javamail.r...@zimbra.anl.gov> >>>>> Content-Type: text/plain; charset=utf-8 >>>>> >>>>> >>>>> If you want to debug this on BGP, you could set BG_COREDUMPONERROR=1 >>>>> and look at the backtrace in the light weight core files >>>>> (you probably need to recompile everything with -g). >>>>> >>>>> A.Chan >>>>> >>>>> ----- Original Message ----- >>>>> >>>>>> Hi Dmitry, >>>>>> Thanks for a prompt and fairly detailed response. I have also >>>>>> forwarded >>>>>> the email to wrf community in the hope that somebody would have some >>>>>> straight forward solution. I will try to debug the error as suggested >>>>>> by >>>>>> you if I would not have much luck from the wrf forum. >>>>>> >>>>>> Cheers, >>>>>> --- >>>>>> >>>>>> Basit A. Khan, Ph.D. >>>>>> Postdoctoral Fellow >>>>>> Division of Physical Sciences& Engineering >>>>>> Office# 3204, Level 3, Building 1, >>>>>> King Abdullah University of Science& Technology >>>>>> 4700 King Abdullah Blvd, Box 2753, Thuwal 23955 ?6900, >>>>>> Kingdom of Saudi Arabia. >>>>>> >>>>>> Office: +966(0)2 808 0276, Mobile: +966(0)5 9538 7592 >>>>>> E-mail: basitali.k...@kaust.edu.sa >>>>>> Skype name: basit.a.khan >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On 8/3/11 2:46 PM, "Dmitry N. Mikushin"<maemar...@gmail.com> wrote: >>>>>> >>>>>> >>>>>>> 5 apparently means one of the WRF's MPI processes has been >>>>>>> unexpectedly terminated, maybe by program decision. No matter, if it >>>>>>> is OpenMPI-specifi >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> >>>>> ------------------------------ >>>>> >>>>> Message: 3 >>>>> Date: Thu, 4 Aug 2011 20:46:16 -0400 >>>>> From: Jeff Squyres<jsquy...@cisco.com> >>>>> Subject: Re: [OMPI users] Program hangs on send when run with nodes on >>>>> remote machine >>>>> To: Open MPI Users<us...@open-mpi.org> >>>>> Message-ID:<f344f301-ad7b-4e83-b0df-a6e001072...@cisco.com> >>>>> Content-Type: text/plain; charset=us-ascii >>>>> >>>>> I notice that in the worker, you have: >>>>> >>>>> eth2 Link encap:Ethernet HWaddr 00:1b:21:77:c5:d4 >>>>> inet addr:192.168.1.155 Bcast:192.168.1.255 Mask:255.255.255.0 >>>>> inet6 addr: fe80::21b:21ff:fe77:c5d4/64 Scope:Link >>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>>>> RX packets:9225846 errors:0 dropped:75175 overruns:0 frame:0 >>>>> TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 >>>>> collisions:0 txqueuelen:1000 >>>>> RX bytes:1336628768 (1.3 GB) TX bytes:552 (552.0 B) >>>>> >>>>> eth3 Link encap:Ethernet HWaddr 00:1b:21:77:c5:d5 >>>>> inet addr:192.168.1.156 Bcast:192.168.1.255 Mask:255.255.255.0 >>>>> inet6 addr: fe80::21b:21ff:fe77:c5d5/64 Scope:Link >>>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>>>> RX packets:26481809 errors:0 dropped:75059 overruns:0 frame:0 >>>>> TX packets:18030236 errors:0 dropped:0 overruns:0 carrier:0 >>>>> collisions:0 txqueuelen:1000 >>>>> RX bytes:70061260271 (70.0 GB) TX bytes:11844181778 (11.8 GB) >>>>> >>>>> Two different NICs are on the same subnet -- that doesn't seem like a >>>>> good idea...? I think this topic has come up on the users list before, >>>>> and, IIRC, the general consensus is "don't do that" because it's not >>>>> clear as to which NIC Linux will actually send outgoing traffic across >>>>> bound for the 192.168.1.x subnet. >>>>> >>>>> >>>>> >>>>> On Aug 4, 2011, at 1:59 PM, Keith Manville wrote: >>>>> >>>>> >>>>>> I am having trouble running my MPI program on multiple nodes. I can >>>>>> run multiple processes on a single node, and I can spawn processes on >>>>>> on remote nodes, but when I call Send from a remote node, the node >>>>>> never returns, even though there is an appropriate Recv waiting. I'm >>>>>> pretty sure this is an issue with my configuration, not my code. I've >>>>>> tried some other sample programs I found and had the same problem of >>>>>> hanging on a send from one host to another. >>>>>> >>>>>> Here's an in depth description: >>>>>> >>>>>> I wrote a quick test program where each process with rank> 1 sends an >>>>>> int to the master (rank 0), and the master receives until it gets >>>>>> something from every other process. >>>>>> >>>>>> My test program works fine when I run multiple processes on a single >>>>>> machine. >>>>>> >>>>>> either the local node: >>>>>> >>>>>> $ ./mpirun -n 4 ./mpi-test >>>>>> Hi I'm localhost:2 >>>>>> Hi I'm localhost:1 >>>>>> localhost:1 sending 11... >>>>>> localhost:2 sending 12... >>>>>> localhost:2 sent 12 >>>>>> localhost:1 sent 11 >>>>>> Hi I'm localhost:0 >>>>>> localhost:0 received 11 from 1 >>>>>> localhost:0 received 12 from 2 >>>>>> Hi I'm localhost:3 >>>>>> localhost:3 sending 13... >>>>>> localhost:3 sent 13 >>>>>> localhost:0 received 13 from 3 >>>>>> all workers checked in! >>>>>> >>>>>> or a remote one: >>>>>> >>>>>> $ ./mpirun -np 2 -host remotehost ./mpi-test >>>>>> Hi I'm remotehost:0 >>>>>> remotehost:0 received 11 from 1 >>>>>> all workers checked in! >>>>>> Hi I'm remotehost:1 >>>>>> remotehost:1 sending 11... >>>>>> remotehost:1 sent 11 >>>>>> >>>>>> But when I try to run the master locally and the worker(s) remotely >>>>>> (this is the way I am actually interested in running it), Send never >>>>>> returns and it hangs indefinitely. >>>>>> >>>>>> $ ./mpirun -np 2 -host localhost,remotehost ./mpi-test >>>>>> Hi I'm localhost:0 >>>>>> Hi I'm remotehost:1 >>>>>> remotehost:1 sending 11... >>>>>> >>>>>> Just to see if it would work, I tried spawning the master on the >>>>>> remotehost and the worker on the localhost. >>>>>> >>>>>> $ ./mpirun -np 2 -host remotehost,localhost ./mpi-test >>>>>> Hi I'm localhost:1 >>>>>> localhost:1 sending 11... >>>>>> localhost:1 sent 11 >>>>>> Hi I'm remotehost:0 >>>>>> remotehost:0 received 0 from 1 >>>>>> all workers checked in! >>>>>> >>>>>> It doesn't hang on Send, but the wrong value is received. >>>>>> >>>>>> Any idea what's going on? I've attached my code, my config.log, >>>>>> ifconfig output, and ompi_info output. >>>>>> >>>>>> Thanks, >>>>>> Keith >>>>>> <mpi.tgz>_______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> jsquy...@cisco.com >>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> >>>>> >>>>> >>>>> >>>>> ------------------------------ >>>>> >>>>> Message: 4 >>>>> Date: Thu, 4 Aug 2011 20:48:30 -0400 >>>>> From: Jeff Squyres<jsquy...@cisco.com> >>>>> Subject: Re: [OMPI users] openmpi 1.2.8 on Xgrid noob issue >>>>> To: Open MPI Users<us...@open-mpi.org> >>>>> Message-ID:<c2ea7fd0-badb-4d05-851c-c444be26f...@cisco.com> >>>>> Content-Type: text/plain; charset=us-ascii >>>>> >>>>> I'm afraid our Xgrid support has lagged, and Apple hasn't show much >>>>> interest in MPI + Xgrid support -- much less HPC. :-\ >>>>> >>>>> Have you see the FAQ items about Xgrid? >>>>> >>>>> http://www.open-mpi.org/faq/?category=osx#xgrid-howto >>>>> >>>>> >>>>> On Aug 4, 2011, at 4:16 AM, Christopher Jones wrote: >>>>> >>>>> >>>>>> Hi there, >>>>>> >>>>>> I'm currently trying to set up a small xgrid between two mac pros (a >>>>>> single quadcore and a 2 duo core), where both are directly connected via >>>>>> an ethernet cable. I've set up xgrid using the password authentication >>>>>> (rather than the kerberos), and from what I can tell in the Xgrid admin >>>>>> tool it seems to be working. However, once I try a simple hello world >>>>>> program, I get this error: >>>>>> >>>>>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 4 ./test_hello >>>>>> mpirun noticed that job rank 0 with PID 381 on node xgrid-node-0 exited >>>>>> on signal 15 (Terminated). >>>>>> 1 additional process aborted (not shown) >>>>>> 2011-08-04 10:02:16.329 mpirun[350:903] *** Terminating app due to >>>>>> uncaught exception 'NSInvalidArgumentException', reason: '*** >>>>>> -[NSKVONotifying_XGConnection<0x1001325a0> finalize]: called when >>>>>> collecting not enabled' >>>>>> *** Call stack at first throw: >>>>>> ( >>>>>> 0 CoreFoundation 0x00007fff814237b4 >>>>>> __exceptionPreprocess + 180 >>>>>> 1 libobjc.A.dylib 0x00007fff84fe8f03 >>>>>> objc_exception_throw + 45 >>>>>> 2 CoreFoundation 0x00007fff8143e631 >>>>>> -[NSObject(NSObject) finalize] + 129 >>>>>> 3 mca_pls_xgrid.so 0x00000001002a9ce3 >>>>>> -[PlsXGridClient dealloc] + 419 >>>>>> 4 mca_pls_xgrid.so 0x00000001002a9837 >>>>>> orte_pls_xgrid_finalize + 40 >>>>>> 5 libopen-rte.0.dylib 0x000000010002d0f9 >>>>>> orte_pls_base_close + 249 >>>>>> 6 libopen-rte.0.dylib 0x0000000100012027 >>>>>> orte_system_finalize + 119 >>>>>> 7 libopen-rte.0.dylib 0x000000010000e968 >>>>>> orte_finalize + 40 >>>>>> 8 mpirun 0x00000001000011ff orterun + >>>>>> 2042 >>>>>> 9 mpirun 0x0000000100000a03 main + 27 >>>>>> 10 mpirun 0x00000001000009e0 start + 52 >>>>>> 11 ??? 0x0000000000000004 0x0 + 4 >>>>>> ) >>>>>> terminate called after throwing an instance of 'NSException' >>>>>> [chris-joness-mac-pro:00350] *** Process received signal *** >>>>>> [chris-joness-mac-pro:00350] Signal: Abort trap (6) >>>>>> [chris-joness-mac-pro:00350] Signal code: (0) >>>>>> [chris-joness-mac-pro:00350] [ 0] 2 libSystem.B.dylib >>>>>> 0x00007fff81ca51ba _sigtramp + 26 >>>>>> [chris-joness-mac-pro:00350] [ 1] 3 ??? >>>>>> 0x00000001000cd400 0x0 + 4295808000 >>>>>> [chris-joness-mac-pro:00350] [ 2] 4 libstdc++.6.dylib >>>>>> 0x00007fff830965d2 __tcf_0 + 0 >>>>>> [chris-joness-mac-pro:00350] [ 3] 5 libobjc.A.dylib >>>>>> 0x00007fff84fecb39 _objc_terminate + 100 >>>>>> [chris-joness-mac-pro:00350] [ 4] 6 libstdc++.6.dylib >>>>>> 0x00007fff83094ae1 _ZN10__cxxabiv111__terminateEPFvvE + 11 >>>>>> [chris-joness-mac-pro:00350] [ 5] 7 libstdc++.6.dylib >>>>>> 0x00007fff83094b16 _ZN10__cxxabiv112__unexpectedEPFvvE + 0 >>>>>> [chris-joness-mac-pro:00350] [ 6] 8 libstdc++.6.dylib >>>>>> 0x00007fff83094bfc >>>>>> _ZL23__gxx_exception_cleanup19_Unwind_Reason_CodeP17_Unwind_Exception + 0 >>>>>> [chris-joness-mac-pro:00350] [ 7] 9 libobjc.A.dylib >>>>>> 0x00007fff84fe8fa2 object_getIvar + 0 >>>>>> [chris-joness-mac-pro:00350] [ 8] 10 CoreFoundation >>>>>> 0x00007fff8143e631 -[NSObject(NSObject) finalize] + 129 >>>>>> [chris-joness-mac-pro:00350] [ 9] 11 mca_pls_xgrid.so >>>>>> 0x00000001002a9ce3 -[PlsXGridClient dealloc] + 419 >>>>>> [chris-joness-mac-pro:00350] [10] 12 mca_pls_xgrid.so >>>>>> 0x00000001002a9837 orte_pls_xgrid_finalize + 40 >>>>>> [chris-joness-mac-pro:00350] [11] 13 libopen-rte.0.dylib >>>>>> 0x000000010002d0f9 orte_pls_base_close + 249 >>>>>> [chris-joness-mac-pro:00350] [12] 14 libopen-rte.0.dylib >>>>>> 0x0000000100012027 orte_system_finalize + 119 >>>>>> [chris-joness-mac-pro:00350] [13] 15 libopen-rte.0.dylib >>>>>> 0x000000010000e968 orte_finalize + 40 >>>>>> [chris-joness-mac-pro:00350] [14] 16 mpirun >>>>>> 0x00000001000011ff orterun + 2042 >>>>>> [chris-joness-mac-pro:00350] [15] 17 mpirun >>>>>> 0x0000000100000a03 main + 27 >>>>>> [chris-joness-mac-pro:00350] [16] 18 mpirun >>>>>> 0x00000001000009e0 start + 52 >>>>>> [chris-joness-mac-pro:00350] [17] 19 ??? >>>>>> 0x0000000000000004 0x0 + 4 >>>>>> [chris-joness-mac-pro:00350] *** End of error message *** >>>>>> Abort trap >>>>>> >>>>>> >>>>>> I've seen this error in a previous mailing, and it seems that the issue >>>>>> has something to do with forcing everything to use kerberos (SSO). >>>>>> However, I noticed that in the computer being used as an agent, this >>>>>> option is grayed on in the Xgrid sharing configuration (I have no idea >>>>>> why). I would therefore ask if it is absolutely necessary to use SSO to >>>>>> get openmpi to run with xgrid, or am I missing something with the >>>>>> password setup. Seems that the kerberos option is much more complicated, >>>>>> and I may even want to switch to just using openmpi with ssh. >>>>>> >>>>>> Many thanks, >>>>>> Chris >>>>>> >>>>>> >>>>>> Chris Jones >>>>>> Post-doctoral Research Assistant, >>>>>> >>>>>> Department of Microbiology >>>>>> Swedish University of Agricultural Sciences >>>>>> Uppsala, Sweden >>>>>> phone: +46 (0)18 67 3222 >>>>>> email: chris.jo...@slu.se >>>>>> >>>>>> Department of Soil and Environmental Microbiology >>>>>> National Institute for Agronomic Research >>>>>> Dijon, France >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> -- >>>>> Jeff Squyres >>>>> jsquy...@cisco.com >>>>> For corporate legal information go to: >>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>> >>>>> >>>>> >>>>> >>>>> ------------------------------ >>>>> >>>>> Message: 5 >>>>> Date: Fri, 5 Aug 2011 08:41:58 -0500 >>>>> From: Rob Latham<r...@mcs.anl.gov> >>>>> Subject: Re: [OMPI users] parallel I/O on 64-bit indexed arays >>>>> To: Open MPI Users<us...@open-mpi.org> >>>>> Cc: Quincey Koziol<koz...@hdfgroup.org>, Fab Tillier >>>>> <ftill...@microsoft.com> >>>>> Message-ID:<20110805134158.ga28...@mcs.anl.gov> >>>>> Content-Type: text/plain; charset=us-ascii >>>>> >>>>> On Wed, Jul 27, 2011 at 06:13:05PM +0200, Troels Haugboelle wrote: >>>>> >>>>>> and we get good (+GB/s) performance when writing files from large runs. >>>>>> >>>>>> Interestingly, an alternative and conceptually simpler option is to >>>>>> use MPI_FILE_WRITE_ORDERED, but the performance of that function on >>>>>> Blue-Gene/P sucks - 20 MB/s instead of GB/s. I do not know why. >>>>>> >>>>> Ordered mode as implemented in ROMIO is awful. Entirely serialized. >>>>> We pass a token from process to process. Each process acquires the >>>>> token, updates the shared file pointer, does its i/o, then passes the >>>>> token to the next process. >>>>> >>>>> What we should do, and have done in test branches [1], is use MPI_SCAN >>>>> to look at the shared file pointer once, tell all the processors their >>>>> offset, then update the shared file pointer while all processes do I/O >>>>> in parallel. >>>>> >>>>> [1]: Robert Latham, Robert Ross, and Rajeev Thakur. "Implementing >>>>> MPI-IO Atomic Mode and Shared File Pointers Using MPI One-Sided >>>>> Communication". International Journal of High Performance Computing >>>>> Applications, 21(2):132-143, 2007 >>>>> >>>>> Since no one uses the shared file pointers, and even fewer people use >>>>> ordered mode, we just haven't seen the need to do so. >>>>> >>>>> Do you want to rebuild your MPI library on BlueGene? I can pretty >>>>> quickly generate and send a patch that will make ordered mode go whip >>>>> fast. >>>>> >>>>> ==rob >>>>> >>>>> >>>>>> Troels >>>>>> >>>>>> On 6/7/11 15:04 , Jeff Squyres wrote: >>>>>> >>>>>>> On Jun 7, 2011, at 4:53 AM, Troels Haugboelle wrote: >>>>>>> >>>>>>> >>>>>>>> In principle yes, but the problem is we have an unequal amount of >>>>>>>> particles on each node, so the length of each array is not guaranteed >>>>>>>> to be divisible by 2, 4 or any other number. If I have understood the >>>>>>>> definition of MPI_TYPE_CREATE_SUBARRAY correctly the offset can be >>>>>>>> 64-bit, but not the global array size, so, optimally, what I am >>>>>>>> looking for is something that has unequal size for each thread, simple >>>>>>>> vector, and with 64-bit offsets and global array size. >>>>>>>> >>>>>>> It's a bit awkward, but you can still make datatypes to give the offset >>>>>>> that you want. E.g., if you need an offset of 2B+31 bytes, you can >>>>>>> make datatype A with type contig of N=(2B/sizeof(int)) int's. Then >>>>>>> make datatype B with type struct, containing type A and 31 MPI_BYTEs. >>>>>>> Then use 1 instance of datatype B to get the offset that you want. >>>>>>> >>>>>>> You could make utility functions that, given a specific (64 bit) >>>>>>> offset, it makes an MPI datatype that matches the offset, and then >>>>>>> frees it (and all sub-datatypes). >>>>>>> >>>>>>> There is a bit of overhead in creating these datatypes, but it should >>>>>>> be dwarfed by the amount of data that you're reading/writing, right? >>>>>>> >>>>>>> It's awkward, but it should work. >>>>>>> >>>>>>> >>>>>>>> Another possible workaround would be to identify subsections that do >>>>>>>> not pass 2B elements, make sub communicators, and then let each of >>>>>>>> them dump their elements with proper offsets. It may work. The >>>>>>>> problematic architecture is a BG/P. On other clusters doing simple >>>>>>>> I/O, letting all threads open the file, seek to their position, and >>>>>>>> then write their chunk works fine, but somehow on BG/P performance >>>>>>>> drops dramatically. My guess is that there is some file locking, or we >>>>>>>> are overwhelming the I/O nodes.. >>>>>>>> >>>>>>>> >>>>>>>>> This ticket for the MPI-3 standard is a first step in the right >>>>>>>>> direction, but won't do everything you need (this is more FYI): >>>>>>>>> >>>>>>>>> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/265 >>>>>>>>> >>>>>>>>> See the PDF attached to the ticket; it's going up for a "first >>>>>>>>> reading" in a month. It'll hopefully be part of the MPI-3 standard >>>>>>>>> by the end of the year (Fab Tillier, CC'ed, has been the chief >>>>>>>>> proponent of this ticket for the past several months). >>>>>>>>> >>>>>>>>> Quincey Koziol from the HDF group is going to propose a follow on to >>>>>>>>> this ticket, specifically about the case you're referring to -- large >>>>>>>>> counts for file functions and datatype constructors. Quincey -- can >>>>>>>>> you expand on what you'll be proposing, perchance? >>>>>>>>> >>>>>>>> Interesting, I think something along the lines of the note would be >>>>>>>> very useful and needed for large applications. >>>>>>>> >>>>>>>> Thanks a lot for the pointers and your suggestions, >>>>>>>> >>>>>>>> cheers, >>>>>>>> >>>>>>>> Troels >>>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> -- >>>>> Rob Latham >>>>> Mathematics and Computer Science Division >>>>> Argonne National Lab, IL USA >>>>> >>>>> >>>>> ------------------------------ >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> End of users Digest, Vol 1977, Issue 1 >>>>> ************************************** >>>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> <warren.vcf>_______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/