Hi, Am 09.08.2011 um 08:46 schrieb Christopher Jones:
> I changed the subject of my previous posting to reflect a new problem > encountered when I changed my strategy to using SSH instead of Xgrid on two > mac pros. I've set up a login-less ssh communication between the two macs > (connected via direct ethernet, both running openmpi 1.2.8 on OSX 10.6.8) per > the instructions on the FAQ. I can type in 'ssh computer-name.local' on > either computer and connect without a password prompt. From what I can see, > the ssh-agent is up and running - the following is listed in my ENV: > > SSH_AUTH_SOCK=/tmp/launch-5FoCc1/Listeners > SSH_AGENT_PID=61058 > > My host file simply lists 'localhost' and > 'chrisjones2@allana-welshs-mac-pro.local'. When I run a simple hello_world > test, I get what seems like a reasonable output: > > chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile > ./test_hello > Hello world from process 0 of 8 > Hello world from process 1 of 8 > Hello world from process 2 of 8 > Hello world from process 3 of 8 > Hello world from process 4 of 8 > Hello world from process 7 of 8 > Hello world from process 5 of 8 > Hello world from process 6 of 8 > > I can also run hostname and get what seems to be an ok response (unless I'm > wrong about this): > > chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile hostname > allana-welshs-mac-pro.local > allana-welshs-mac-pro.local > allana-welshs-mac-pro.local > allana-welshs-mac-pro.local > quadcore.mikrob.slu.se > quadcore.mikrob.slu.se > quadcore.mikrob.slu.se > quadcore.mikrob.slu.se > > > However, when I run the ring_c test, it freezes: > > chris-joness-mac-pro:~ chrisjones$ mpirun -np 8 -hostfile hostfile ./ring_c > Process 0 sending 10 to 1, tag 201 (8 processes in ring) > Process 0 sent to 1 > Process 0 decremented value: 9 > > (I noted that processors on both computers are active). > > ring_c was compiled separately on each computer, however both have the same > version of openmpi and OSX. I've gone through the FAQ and searched the user > forum, but I can't quite seems to get this problem unstuck. do you have any firewall on the machines? -- Reuti > > Many thanks for your time, > Chris > > On Aug 5, 2011, at 6:00 PM, <users-requ...@open-mpi.org> > <users-requ...@open-mpi.org> wrote: > >> Send users mailing list submissions to >> us...@open-mpi.org >> >> To subscribe or unsubscribe via the World Wide Web, visit >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> or, via email, send a message with subject or body 'help' to >> users-requ...@open-mpi.org >> >> You can reach the person managing the list at >> users-ow...@open-mpi.org >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of users digest..." >> >> >> Today's Topics: >> >> 1. Re: OpenMPI causing WRF to crash (Jeff Squyres) >> 2. Re: OpenMPI causing WRF to crash (Anthony Chan) >> 3. Re: Program hangs on send when run with nodes on remote >> machine (Jeff Squyres) >> 4. Re: openmpi 1.2.8 on Xgrid noob issue (Jeff Squyres) >> 5. Re: parallel I/O on 64-bit indexed arays (Rob Latham) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Thu, 4 Aug 2011 19:18:36 -0400 >> From: Jeff Squyres <jsquy...@cisco.com> >> Subject: Re: [OMPI users] OpenMPI causing WRF to crash >> To: Open MPI Users <us...@open-mpi.org> >> Message-ID: <3f0e661f-a74f-4e51-86c0-1f84feb07...@cisco.com> >> Content-Type: text/plain; charset=windows-1252 >> >> Signal 15 is usually SIGTERM on Linux, meaning that some external entity >> probably killed the job. >> >> The OMPI error message you describe is also typical for that kind of >> scenario -- i.e., a process exited without calling MPI_Finalize could mean >> that it called exit() or some external process killed it. >> >> >> On Aug 3, 2011, at 7:24 AM, BasitAli Khan wrote: >> >>> I am trying to run a rather heavy wrf simulation with spectral nudging but >>> the simulation crashes after 1.8 minutes of integration. >>> The simulation has two domains with d01 = 601x601 and d02 = 721x721 and >>> 51 vertical levels. I tried this simulation on two different systems but >>> result was more or less same. For example >>> >>> On our Bluegene/P with SUSE Linux Enterprise Server 10 ppc and XLF >>> compiler I tried to run wrf on 2048 shared memory nodes (1 compute node = 4 >>> cores , 32 bit, 850 Mhz). For the parallel run I used mpixlc, mpixlcxx and >>> mpixlf90. I got the following error message in the wrf.err file >>> >>> <Aug 01 19:50:21.244540> BE_MPI (ERROR): The error message in the job >>> record is as follows: >>> <Aug 01 19:50:21.244657> BE_MPI (ERROR): "killed with signal 15" >>> >>> I also tried to run the same simulation on our linux cluster (Linux Red Hat >>> Enterprise 5.4m x86_64 and Intel compiler) with 8, 16 and 64 nodes (1 >>> compute node=8 cores). For the parallel run I am used >>> mpi/openmpi/1.4.2-intel-11. I got the following error message in the error >>> log after couple of minutes of integration. >>> >>> "mpirun has exited due to process rank 45 with PID 19540 on >>> node ci118 exiting without calling "finalize". This may >>> have caused other processes in the application to be >>> terminated by signals sent by mpirun (as reported here)." >>> >>> I tried many things but nothing seems to be working. However, if I reduce >>> grid points below 200, the simulation goes fine. It appears that probably >>> OpenMP has problem with large number of grid points but I have no idea how >>> to fix it. I will greatly appreciate if you could suggest some solution. >>> >>> Best regards, >>> --- >>> Basit A. Khan, Ph.D. >>> Postdoctoral Fellow >>> Division of Physical Sciences & Engineering >>> Office# 3204, Level 3, Building 1, >>> King Abdullah University of Science & Technology >>> 4700 King Abdullah Blvd, Box 2753, Thuwal 23955 ?6900, >>> Kingdom of Saudi Arabia. >>> >>> Office: +966(0)2 808 0276, Mobile: +966(0)5 9538 7592 >>> E-mail: basitali.k...@kaust.edu.sa >>> Skype name: basit.a.khan >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> >> >> ------------------------------ >> >> Message: 2 >> Date: Thu, 4 Aug 2011 18:59:59 -0500 (CDT) >> From: Anthony Chan <c...@mcs.anl.gov> >> Subject: Re: [OMPI users] OpenMPI causing WRF to crash >> To: Open MPI Users <us...@open-mpi.org> >> Message-ID: >> <660521091.191111.1312502399225.javamail.r...@zimbra.anl.gov> >> Content-Type: text/plain; charset=utf-8 >> >> >> If you want to debug this on BGP, you could set BG_COREDUMPONERROR=1 >> and look at the backtrace in the light weight core files >> (you probably need to recompile everything with -g). >> >> A.Chan >> >> ----- Original Message ----- >>> Hi Dmitry, >>> Thanks for a prompt and fairly detailed response. I have also >>> forwarded >>> the email to wrf community in the hope that somebody would have some >>> straight forward solution. I will try to debug the error as suggested >>> by >>> you if I would not have much luck from the wrf forum. >>> >>> Cheers, >>> --- >>> >>> Basit A. Khan, Ph.D. >>> Postdoctoral Fellow >>> Division of Physical Sciences & Engineering >>> Office# 3204, Level 3, Building 1, >>> King Abdullah University of Science & Technology >>> 4700 King Abdullah Blvd, Box 2753, Thuwal 23955 ?6900, >>> Kingdom of Saudi Arabia. >>> >>> Office: +966(0)2 808 0276, Mobile: +966(0)5 9538 7592 >>> E-mail: basitali.k...@kaust.edu.sa >>> Skype name: basit.a.khan >>> >>> >>> >>> >>> On 8/3/11 2:46 PM, "Dmitry N. Mikushin" <maemar...@gmail.com> wrote: >>> >>>> 5 apparently means one of the WRF's MPI processes has been >>>> unexpectedly terminated, maybe by program decision. No matter, if it >>>> is OpenMPI-specifi >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> ------------------------------ >> >> Message: 3 >> Date: Thu, 4 Aug 2011 20:46:16 -0400 >> From: Jeff Squyres <jsquy...@cisco.com> >> Subject: Re: [OMPI users] Program hangs on send when run with nodes on >> remote machine >> To: Open MPI Users <us...@open-mpi.org> >> Message-ID: <f344f301-ad7b-4e83-b0df-a6e001072...@cisco.com> >> Content-Type: text/plain; charset=us-ascii >> >> I notice that in the worker, you have: >> >> eth2 Link encap:Ethernet HWaddr 00:1b:21:77:c5:d4 >> inet addr:192.168.1.155 Bcast:192.168.1.255 Mask:255.255.255.0 >> inet6 addr: fe80::21b:21ff:fe77:c5d4/64 Scope:Link >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:9225846 errors:0 dropped:75175 overruns:0 frame:0 >> TX packets:8 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:1336628768 (1.3 GB) TX bytes:552 (552.0 B) >> >> eth3 Link encap:Ethernet HWaddr 00:1b:21:77:c5:d5 >> inet addr:192.168.1.156 Bcast:192.168.1.255 Mask:255.255.255.0 >> inet6 addr: fe80::21b:21ff:fe77:c5d5/64 Scope:Link >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:26481809 errors:0 dropped:75059 overruns:0 frame:0 >> TX packets:18030236 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:70061260271 (70.0 GB) TX bytes:11844181778 (11.8 GB) >> >> Two different NICs are on the same subnet -- that doesn't seem like a good >> idea...? I think this topic has come up on the users list before, and, >> IIRC, the general consensus is "don't do that" because it's not clear as to >> which NIC Linux will actually send outgoing traffic across bound for the >> 192.168.1.x subnet. >> >> >> >> On Aug 4, 2011, at 1:59 PM, Keith Manville wrote: >> >>> I am having trouble running my MPI program on multiple nodes. I can >>> run multiple processes on a single node, and I can spawn processes on >>> on remote nodes, but when I call Send from a remote node, the node >>> never returns, even though there is an appropriate Recv waiting. I'm >>> pretty sure this is an issue with my configuration, not my code. I've >>> tried some other sample programs I found and had the same problem of >>> hanging on a send from one host to another. >>> >>> Here's an in depth description: >>> >>> I wrote a quick test program where each process with rank > 1 sends an >>> int to the master (rank 0), and the master receives until it gets >>> something from every other process. >>> >>> My test program works fine when I run multiple processes on a single >>> machine. >>> >>> either the local node: >>> >>> $ ./mpirun -n 4 ./mpi-test >>> Hi I'm localhost:2 >>> Hi I'm localhost:1 >>> localhost:1 sending 11... >>> localhost:2 sending 12... >>> localhost:2 sent 12 >>> localhost:1 sent 11 >>> Hi I'm localhost:0 >>> localhost:0 received 11 from 1 >>> localhost:0 received 12 from 2 >>> Hi I'm localhost:3 >>> localhost:3 sending 13... >>> localhost:3 sent 13 >>> localhost:0 received 13 from 3 >>> all workers checked in! >>> >>> or a remote one: >>> >>> $ ./mpirun -np 2 -host remotehost ./mpi-test >>> Hi I'm remotehost:0 >>> remotehost:0 received 11 from 1 >>> all workers checked in! >>> Hi I'm remotehost:1 >>> remotehost:1 sending 11... >>> remotehost:1 sent 11 >>> >>> But when I try to run the master locally and the worker(s) remotely >>> (this is the way I am actually interested in running it), Send never >>> returns and it hangs indefinitely. >>> >>> $ ./mpirun -np 2 -host localhost,remotehost ./mpi-test >>> Hi I'm localhost:0 >>> Hi I'm remotehost:1 >>> remotehost:1 sending 11... >>> >>> Just to see if it would work, I tried spawning the master on the >>> remotehost and the worker on the localhost. >>> >>> $ ./mpirun -np 2 -host remotehost,localhost ./mpi-test >>> Hi I'm localhost:1 >>> localhost:1 sending 11... >>> localhost:1 sent 11 >>> Hi I'm remotehost:0 >>> remotehost:0 received 0 from 1 >>> all workers checked in! >>> >>> It doesn't hang on Send, but the wrong value is received. >>> >>> Any idea what's going on? I've attached my code, my config.log, >>> ifconfig output, and ompi_info output. >>> >>> Thanks, >>> Keith >>> <mpi.tgz>_______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> >> >> ------------------------------ >> >> Message: 4 >> Date: Thu, 4 Aug 2011 20:48:30 -0400 >> From: Jeff Squyres <jsquy...@cisco.com> >> Subject: Re: [OMPI users] openmpi 1.2.8 on Xgrid noob issue >> To: Open MPI Users <us...@open-mpi.org> >> Message-ID: <c2ea7fd0-badb-4d05-851c-c444be26f...@cisco.com> >> Content-Type: text/plain; charset=us-ascii >> >> I'm afraid our Xgrid support has lagged, and Apple hasn't show much interest >> in MPI + Xgrid support -- much less HPC. :-\ >> >> Have you see the FAQ items about Xgrid? >> >> http://www.open-mpi.org/faq/?category=osx#xgrid-howto >> >> >> On Aug 4, 2011, at 4:16 AM, Christopher Jones wrote: >> >>> Hi there, >>> >>> I'm currently trying to set up a small xgrid between two mac pros (a single >>> quadcore and a 2 duo core), where both are directly connected via an >>> ethernet cable. I've set up xgrid using the password authentication (rather >>> than the kerberos), and from what I can tell in the Xgrid admin tool it >>> seems to be working. However, once I try a simple hello world program, I >>> get this error: >>> >>> chris-joness-mac-pro:~ chrisjones$ mpirun -np 4 ./test_hello >>> mpirun noticed that job rank 0 with PID 381 on node xgrid-node-0 exited on >>> signal 15 (Terminated). >>> 1 additional process aborted (not shown) >>> 2011-08-04 10:02:16.329 mpirun[350:903] *** Terminating app due to uncaught >>> exception 'NSInvalidArgumentException', reason: '*** >>> -[NSKVONotifying_XGConnection<0x1001325a0> finalize]: called when >>> collecting not enabled' >>> *** Call stack at first throw: >>> ( >>> 0 CoreFoundation 0x00007fff814237b4 >>> __exceptionPreprocess + 180 >>> 1 libobjc.A.dylib 0x00007fff84fe8f03 >>> objc_exception_throw + 45 >>> 2 CoreFoundation 0x00007fff8143e631 >>> -[NSObject(NSObject) finalize] + 129 >>> 3 mca_pls_xgrid.so 0x00000001002a9ce3 >>> -[PlsXGridClient dealloc] + 419 >>> 4 mca_pls_xgrid.so 0x00000001002a9837 >>> orte_pls_xgrid_finalize + 40 >>> 5 libopen-rte.0.dylib 0x000000010002d0f9 >>> orte_pls_base_close + 249 >>> 6 libopen-rte.0.dylib 0x0000000100012027 >>> orte_system_finalize + 119 >>> 7 libopen-rte.0.dylib 0x000000010000e968 >>> orte_finalize + 40 >>> 8 mpirun 0x00000001000011ff orterun + >>> 2042 >>> 9 mpirun 0x0000000100000a03 main + 27 >>> 10 mpirun 0x00000001000009e0 start + 52 >>> 11 ??? 0x0000000000000004 0x0 + 4 >>> ) >>> terminate called after throwing an instance of 'NSException' >>> [chris-joness-mac-pro:00350] *** Process received signal *** >>> [chris-joness-mac-pro:00350] Signal: Abort trap (6) >>> [chris-joness-mac-pro:00350] Signal code: (0) >>> [chris-joness-mac-pro:00350] [ 0] 2 libSystem.B.dylib >>> 0x00007fff81ca51ba _sigtramp + 26 >>> [chris-joness-mac-pro:00350] [ 1] 3 ??? >>> 0x00000001000cd400 0x0 + 4295808000 >>> [chris-joness-mac-pro:00350] [ 2] 4 libstdc++.6.dylib >>> 0x00007fff830965d2 __tcf_0 + 0 >>> [chris-joness-mac-pro:00350] [ 3] 5 libobjc.A.dylib >>> 0x00007fff84fecb39 _objc_terminate + 100 >>> [chris-joness-mac-pro:00350] [ 4] 6 libstdc++.6.dylib >>> 0x00007fff83094ae1 _ZN10__cxxabiv111__terminateEPFvvE + 11 >>> [chris-joness-mac-pro:00350] [ 5] 7 libstdc++.6.dylib >>> 0x00007fff83094b16 _ZN10__cxxabiv112__unexpectedEPFvvE + 0 >>> [chris-joness-mac-pro:00350] [ 6] 8 libstdc++.6.dylib >>> 0x00007fff83094bfc >>> _ZL23__gxx_exception_cleanup19_Unwind_Reason_CodeP17_Unwind_Exception + 0 >>> [chris-joness-mac-pro:00350] [ 7] 9 libobjc.A.dylib >>> 0x00007fff84fe8fa2 object_getIvar + 0 >>> [chris-joness-mac-pro:00350] [ 8] 10 CoreFoundation >>> 0x00007fff8143e631 -[NSObject(NSObject) finalize] + 129 >>> [chris-joness-mac-pro:00350] [ 9] 11 mca_pls_xgrid.so >>> 0x00000001002a9ce3 -[PlsXGridClient dealloc] + 419 >>> [chris-joness-mac-pro:00350] [10] 12 mca_pls_xgrid.so >>> 0x00000001002a9837 orte_pls_xgrid_finalize + 40 >>> [chris-joness-mac-pro:00350] [11] 13 libopen-rte.0.dylib >>> 0x000000010002d0f9 orte_pls_base_close + 249 >>> [chris-joness-mac-pro:00350] [12] 14 libopen-rte.0.dylib >>> 0x0000000100012027 orte_system_finalize + 119 >>> [chris-joness-mac-pro:00350] [13] 15 libopen-rte.0.dylib >>> 0x000000010000e968 orte_finalize + 40 >>> [chris-joness-mac-pro:00350] [14] 16 mpirun >>> 0x00000001000011ff orterun + 2042 >>> [chris-joness-mac-pro:00350] [15] 17 mpirun >>> 0x0000000100000a03 main + 27 >>> [chris-joness-mac-pro:00350] [16] 18 mpirun >>> 0x00000001000009e0 start + 52 >>> [chris-joness-mac-pro:00350] [17] 19 ??? >>> 0x0000000000000004 0x0 + 4 >>> [chris-joness-mac-pro:00350] *** End of error message *** >>> Abort trap >>> >>> >>> I've seen this error in a previous mailing, and it seems that the issue has >>> something to do with forcing everything to use kerberos (SSO). However, I >>> noticed that in the computer being used as an agent, this option is grayed >>> on in the Xgrid sharing configuration (I have no idea why). I would >>> therefore ask if it is absolutely necessary to use SSO to get openmpi to >>> run with xgrid, or am I missing something with the password setup. Seems >>> that the kerberos option is much more complicated, and I may even want to >>> switch to just using openmpi with ssh. >>> >>> Many thanks, >>> Chris >>> >>> >>> Chris Jones >>> Post-doctoral Research Assistant, >>> >>> Department of Microbiology >>> Swedish University of Agricultural Sciences >>> Uppsala, Sweden >>> phone: +46 (0)18 67 3222 >>> email: chris.jo...@slu.se >>> >>> Department of Soil and Environmental Microbiology >>> National Institute for Agronomic Research >>> Dijon, France >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> >> >> ------------------------------ >> >> Message: 5 >> Date: Fri, 5 Aug 2011 08:41:58 -0500 >> From: Rob Latham <r...@mcs.anl.gov> >> Subject: Re: [OMPI users] parallel I/O on 64-bit indexed arays >> To: Open MPI Users <us...@open-mpi.org> >> Cc: Quincey Koziol <koz...@hdfgroup.org>, Fab Tillier >> <ftill...@microsoft.com> >> Message-ID: <20110805134158.ga28...@mcs.anl.gov> >> Content-Type: text/plain; charset=us-ascii >> >> On Wed, Jul 27, 2011 at 06:13:05PM +0200, Troels Haugboelle wrote: >>> and we get good (+GB/s) performance when writing files from large runs. >>> >>> Interestingly, an alternative and conceptually simpler option is to >>> use MPI_FILE_WRITE_ORDERED, but the performance of that function on >>> Blue-Gene/P sucks - 20 MB/s instead of GB/s. I do not know why. >> >> Ordered mode as implemented in ROMIO is awful. Entirely serialized. >> We pass a token from process to process. Each process acquires the >> token, updates the shared file pointer, does its i/o, then passes the >> token to the next process. >> >> What we should do, and have done in test branches [1], is use MPI_SCAN >> to look at the shared file pointer once, tell all the processors their >> offset, then update the shared file pointer while all processes do I/O >> in parallel. >> >> [1]: Robert Latham, Robert Ross, and Rajeev Thakur. "Implementing >> MPI-IO Atomic Mode and Shared File Pointers Using MPI One-Sided >> Communication". International Journal of High Performance Computing >> Applications, 21(2):132-143, 2007 >> >> Since no one uses the shared file pointers, and even fewer people use >> ordered mode, we just haven't seen the need to do so. >> >> Do you want to rebuild your MPI library on BlueGene? I can pretty >> quickly generate and send a patch that will make ordered mode go whip >> fast. >> >> ==rob >> >>> >>> Troels >>> >>> On 6/7/11 15:04 , Jeff Squyres wrote: >>>> On Jun 7, 2011, at 4:53 AM, Troels Haugboelle wrote: >>>> >>>>> In principle yes, but the problem is we have an unequal amount of >>>>> particles on each node, so the length of each array is not guaranteed to >>>>> be divisible by 2, 4 or any other number. If I have understood the >>>>> definition of MPI_TYPE_CREATE_SUBARRAY correctly the offset can be >>>>> 64-bit, but not the global array size, so, optimally, what I am looking >>>>> for is something that has unequal size for each thread, simple vector, >>>>> and with 64-bit offsets and global array size. >>>> It's a bit awkward, but you can still make datatypes to give the offset >>>> that you want. E.g., if you need an offset of 2B+31 bytes, you can make >>>> datatype A with type contig of N=(2B/sizeof(int)) int's. Then make >>>> datatype B with type struct, containing type A and 31 MPI_BYTEs. Then use >>>> 1 instance of datatype B to get the offset that you want. >>>> >>>> You could make utility functions that, given a specific (64 bit) offset, >>>> it makes an MPI datatype that matches the offset, and then frees it (and >>>> all sub-datatypes). >>>> >>>> There is a bit of overhead in creating these datatypes, but it should be >>>> dwarfed by the amount of data that you're reading/writing, right? >>>> >>>> It's awkward, but it should work. >>>> >>>>> Another possible workaround would be to identify subsections that do not >>>>> pass 2B elements, make sub communicators, and then let each of them dump >>>>> their elements with proper offsets. It may work. The problematic >>>>> architecture is a BG/P. On other clusters doing simple I/O, letting all >>>>> threads open the file, seek to their position, and then write their chunk >>>>> works fine, but somehow on BG/P performance drops dramatically. My guess >>>>> is that there is some file locking, or we are overwhelming the I/O nodes.. >>>>> >>>>>> This ticket for the MPI-3 standard is a first step in the right >>>>>> direction, but won't do everything you need (this is more FYI): >>>>>> >>>>>> https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/265 >>>>>> >>>>>> See the PDF attached to the ticket; it's going up for a "first reading" >>>>>> in a month. It'll hopefully be part of the MPI-3 standard by the end of >>>>>> the year (Fab Tillier, CC'ed, has been the chief proponent of this >>>>>> ticket for the past several months). >>>>>> >>>>>> Quincey Koziol from the HDF group is going to propose a follow on to >>>>>> this ticket, specifically about the case you're referring to -- large >>>>>> counts for file functions and datatype constructors. Quincey -- can you >>>>>> expand on what you'll be proposing, perchance? >>>>> Interesting, I think something along the lines of the note would be very >>>>> useful and needed for large applications. >>>>> >>>>> Thanks a lot for the pointers and your suggestions, >>>>> >>>>> cheers, >>>>> >>>>> Troels >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> -- >> Rob Latham >> Mathematics and Computer Science Division >> Argonne National Lab, IL USA >> >> >> ------------------------------ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> End of users Digest, Vol 1977, Issue 1 >> ************************************** > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users