I have verified that disabling UAC does not fix the problem. xhlp.exe starts, threads spin up on both machines, CPU usage is at 80-90% but no progress is ever made. >From this state, Ctrl-break on the head node yields the following output:
[REMOTEMACHINE:02032] [[20816,1],0]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:05064] [[20816,1],1]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:05420] [[20816,1],2]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:03852] [[20816,1],3]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:05436] [[20816,1],4]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:04416] [[20816,1],5]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:02032] [[20816,1],0] routed:binomial: Connection to lifeline [[20816,0],0] lost [REMOTEMACHINE:05064] [[20816,1],1] routed:binomial: Connection to lifeline [[20816,0],0] lost [REMOTEMACHINE:05420] [[20816,1],2] routed:binomial: Connection to lifeline [[20816,0],0] lost [REMOTEMACHINE:03852] [[20816,1],3] routed:binomial: Connection to lifeline [[20816,0],0] lost [REMOTEMACHINE:05436] [[20816,1],4] routed:binomial: Connection to lifeline [[20816,0],0] lost [REMOTEMACHINE:04416] [[20816,1],5] routed:binomial: Connection to lifeline [[20816,0],0] lost > From: users-requ...@open-mpi.org > Subject: users Digest, Vol 1911, Issue 1 > To: us...@open-mpi.org > Date: Fri, 20 May 2011 08:14:13 -0400 > > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > > 1. Re: Error: Entry Point Not Found (Zhangping Wei) > 2. Re: Problem with MPI_Request, MPI_Isend/recv and > MPI_Wait/Test (George Bosilca) > 3. Re: v1.5.3-x64 does not work on Windows 7 workgroup (Jeff Squyres) > 4. Re: Error: Entry Point Not Found (Jeff Squyres) > 5. Re: openmpi (1.2.8 or above) and Intel composer XE 2011 (aka > 12.0) (Jeff Squyres) > 6. Re: Openib with > 32 cores per node (Jeff Squyres) > 7. Re: MPI_COMM_DUP freeze with OpenMPI 1.4.1 (Jeff Squyres) > 8. Re: Trouble with MPI-IO (Jeff Squyres) > 9. Re: Trouble with MPI-IO (Tom Rosmond) > 10. Re: Problem with MPI_Request, MPI_Isend/recv and > MPI_Wait/Test (David B?ttner) > 11. Re: Trouble with MPI-IO (Jeff Squyres) > 12. Re: MPI_Alltoallv function crashes when np > 100 (Jeff Squyres) > 13. Re: MPI_ERR_TRUNCATE with MPI_Allreduce() error, but only > sometimes... (Jeff Squyres) > 14. Re: Trouble with MPI-IO (Jeff Squyres) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 19 May 2011 09:13:53 -0700 (PDT) > From: Zhangping Wei <zhangping_...@yahoo.com> > Subject: Re: [OMPI users] Error: Entry Point Not Found > To: us...@open-mpi.org > Message-ID: <101342.7961...@web111818.mail.gq1.yahoo.com> > Content-Type: text/plain; charset="gb2312" > > Dear Paul, > > I checked the way 'mpirun -np N <cmd>' you mentioned, but it was the same > problem. > > I guess it may related to the system I used, because I have used it correctly > in > another XP 32 bit system. > > I look forward to more advice.Thanks. > > Zhangping > > > > > ________________________________ > ???????? "users-requ...@open-mpi.org" <users-requ...@open-mpi.org> > ???????? us...@open-mpi.org > ?????????? 2011/5/19 (????) 11:00:02 ???? > ?? ???? users Digest, Vol 1910, Issue 2 > > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > > 1. Re: Error: Entry Point Not Found (Paul van der Walt) > 2. Re: Openib with > 32 cores per node (Robert Horton) > 3. Re: Openib with > 32 cores per node (Samuel K. Gutierrez) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 19 May 2011 16:14:02 +0100 > From: Paul van der Walt <p...@denknerd.nl> > Subject: Re: [OMPI users] Error: Entry Point Not Found > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <banlktinjz0cntchqjczyhfgsnr51jpu...@mail.gmail.com> > Content-Type: text/plain; charset=UTF-8 > > Hi, > > On 19 May 2011 15:54, Zhangping Wei <zhangping_...@yahoo.com> wrote: > > 4, I use command window to run it in this way: ?mpirun ?n 4 ?**.exe ?,then I > > Probably not the problem, but shouldn't that be 'mpirun -np N <cmd>' ? > > Paul > > -- > O< ascii ribbon campaign - stop html mail - www.asciiribbon.org > > > > ------------------------------ > > Message: 2 > Date: Thu, 19 May 2011 16:37:56 +0100 > From: Robert Horton <r.hor...@qmul.ac.uk> > Subject: Re: [OMPI users] Openib with > 32 cores per node > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <1305819476.9663.148.camel@moelwyn> > Content-Type: text/plain; charset="UTF-8" > > On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote: > > Hi, > > > > Try the following QP parameters that only use shared receive queues. > > > > -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32 > > > > Thanks for that. If I run the job over 2 x 48 cores it now works and the > performance seems reasonable (I need to do some more tuning) but when I > go up to 4 x 48 cores I'm getting the same problem: > > [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] > error creating qp errno says Cannot allocate memory > [compute-1-7.local:18106] *** An error occurred in MPI_Isend > [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD > [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list > [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now > abort) > > Any thoughts? > > Thanks, > Rob > -- > Robert Horton > System Administrator (Research Support) - School of Mathematical Sciences > Queen Mary, University of London > r.hor...@qmul.ac.uk - +44 (0) 20 7882 7345 > > > > ------------------------------ > > Message: 3 > Date: Thu, 19 May 2011 09:59:13 -0600 > From: "Samuel K. Gutierrez" <sam...@lanl.gov> > Subject: Re: [OMPI users] Openib with > 32 cores per node > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <b3e83138-9af0-48c0-871c-dbbb2e712...@lanl.gov> > Content-Type: text/plain; charset=us-ascii > > Hi, > > On May 19, 2011, at 9:37 AM, Robert Horton wrote > > > On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote: > >> Hi, > >> > >> Try the following QP parameters that only use shared receive queues. > >> > >> -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32 > >> > > > > Thanks for that. If I run the job over 2 x 48 cores it now works and the > > performance seems reasonable (I need to do some more tuning) but when I > > go up to 4 x 48 cores I'm getting the same problem: > > > >[compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] > >] error creating qp errno says Cannot allocate memory > > [compute-1-7.local:18106] *** An error occurred in MPI_Isend > > [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD > > [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list > > [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now > >abort) > > > > Any thoughts? > > How much memory does each node have? Does this happen at startup? > > Try adding: > > -mca btl_openib_cpc_include rdmacm > > I'm not sure if your version of OFED supports this feature, but maybe using > XRC > may help. I **think** other tweaks are needed to get this going, but I'm not > familiar with the details. > > Hope that helps, > > Samuel K. Gutierrez > Los Alamos National Laboratory > > > > > > Thanks, > > Rob > > -- > > Robert Horton > > System Administrator (Research Support) - School of Mathematical Sciences > > Queen Mary, University of London > > r.hor...@qmul.ac.uk - +44 (0) 20 7882 7345 > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > ------------------------------ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > End of users Digest, Vol 1910, Issue 2 > ************************************** > -------------- next part -------------- > HTML attachment scrubbed and removed > > ------------------------------ > > Message: 2 > Date: Thu, 19 May 2011 08:48:03 -0800 > From: George Bosilca <bosi...@eecs.utk.edu> > Subject: Re: [OMPI users] Problem with MPI_Request, MPI_Isend/recv and > MPI_Wait/Test > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <fcac66f9-fdb5-48bb-a800-263d8a4f9...@eecs.utk.edu> > Content-Type: text/plain; charset=iso-8859-1 > > David, > > I do not see any mechanism for protecting the accesses to the requests to a > single thread? What is the thread model you're using? > > >From an implementation perspective, your code is correct only if you > >initialize the MPI library with MPI_THREAD_MULTIPLE and if the library > >accepts. Otherwise, there is an assumption that the application is single > >threaded, or that the MPI behavior is implementation dependent. Please read > >the MPI standard regarding to MPI_Init_thread for more details. > > Regards, > george. > > On May 19, 2011, at 02:34 , David B?ttner wrote: > > > Hello, > > > > I am working on a hybrid MPI (OpenMPI 1.4.3) and Pthread code. I am using > > MPI_Isend and MPI_Irecv for communication and MPI_Test/MPI_Wait to check if > > it is done. I do this repeatedly in the outer loop of my code. The MPI_Test > > is used in the inner loop to check if some function can be called which > > depends on the received data. > > The program regularly crashed (only when not using printf...) and after > > debugging it I figured out the following problem: > > > > In MPI_Isend I have an invalid read of memory. I fixed the problem with not > > re-using a > > > > MPI_Request req_s, req_r; > > > > but by using > > > > MPI_Request* req_s; > > MPI_Request* req_r > > > > and re-allocating them before the MPI_Isend/recv. > > > > The documentation says, that in MPI_Wait and MPI_Test (if successful) the > > request-objects are deallocated and set to MPI_REQUEST_NULL. > > It also says, that in MPI_Isend and MPI_Irecv, it allocates the Objects and > > associates it with the request object. > > > > As I understand this, this either means I can use a pointer to MPI_Request > > which I don't have to initialize for this (it doesn't work but crashes), or > > that I can use a MPI_Request pointer which I have initialized with > > malloc(sizeof(MPI_REQUEST)) (or passing the address of a MPI_Request req), > > which is set and unset in the functions. But this version crashes, too. > > What works is using a pointer, which I allocate before the MPI_Isend/recv > > and which I free after MPI_Wait in every iteration. In other words: It only > > uses if I don't reuse any kind of MPI_Request. Only if I recreate one every > > time. > > > > Is this, what is should be like? I believe that a reuse of the memory would > > be a lot more efficient (less calls to malloc...). Am I missing something > > here? Or am I doing something wrong? > > > > > > Let me provide some more detailed information about my problem: > > > > I am running the program on a 30 node infiniband cluster. Each node has 4 > > single core Opteron CPUs. I am running 1 MPI Rank per node and 4 threads > > per rank (-> one thread per core). > > I am compiling with mpicc of OpenMPI using gcc below. > > Some pseudo-code of the program can be found at the end of this e-mail. > > > > I was able to reproduce the problem using different amount of nodes and > > even using one node only. The problem does not arise when I put > > printf-debugging information into the code. This pointed me into the > > direction that I have some memory problem, where some write accesses some > > memory it is not supposed to. > > I ran the tests using valgrind with --leak-check=full and > > --show-reachable=yes, which pointed me either to MPI_Isend or MPI_Wait > > depending on whether I had the threads spin in a loop for MPI_Test to > > return success or used MPI_Wait respectively. > > > > I would appreciate your help with this. Am I missing something important > > here? Is there a way to re-use the request in the different iterations > > other than I thought it should work? > > Or is there a way to re-initialize the allocated memory before the > > MPI_Isend/recv so that I at least don't have to call free and malloc each > > time? > > > > Thank you very much for your help! > > Kind regards, > > David B?ttner > > > > _____________________ > > Pseudo-Code of program: > > > > MPI_Request* req_s; > > MPI_Request* req_w; > > OUTER-LOOP > > if(0 == threadid) > > { > > req_s = malloc(sizeof(MPI_Request)); > > req_r = malloc(sizeof(MPI_Request)); > > MPI_Isend(..., req_s) > > MPI_Irecv(..., req_r) > > } > > pthread_barrier > > INNER-LOOP (while NOT_DONE or RET) > > if(TRYLOCK && NOT_DONE) > > { > > if(MPI_TEST(req_r)) > > { > > Call_Function_A; > > NOT_DONE = 0; > > } > > > > } > > RET = Call_Function_B; > > } > > pthread_barrier_wait > > if(0 == threadid) > > { > > MPI_WAIT(req_s) > > MPI_WAIT(req_r) > > free(req_s); > > free(req_r); > > } > > _____________ > > > > > > -- > > David B?ttner, Informatik, Technische Universit?t M?nchen > > TUM I-10 - FMI 01.06.059 - Tel. 089 / 289-17676 > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > "To preserve the freedom of the human mind then and freedom of the press, > every spirit should be ready to devote itself to martyrdom; for as long as we > may think as we will, and speak as we think, the condition of man will > proceed in improvement." > -- Thomas Jefferson, 1799 > > > > > ------------------------------ > > Message: 3 > Date: Thu, 19 May 2011 21:22:48 -0400 > From: Jeff Squyres <jsquy...@cisco.com> > Subject: Re: [OMPI users] v1.5.3-x64 does not work on Windows 7 > workgroup > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <278274f0-bf00-4498-950f-9779e0083...@cisco.com> > Content-Type: text/plain; charset=us-ascii > > Unfortunately, our Windows guy (Shiqing) is off getting married and will be > out for a little while. :-( > > All that I can cite is the README.WINDOWS.txt file in the top-level > directory. I'm afraid that I don't know much else about Windows. :-( > > > On May 18, 2011, at 8:17 PM, Jason Mackay wrote: > > > Hi all, > > > > My thanks to all those involved for putting together this Windows binary > > release of OpenMPI! I am hoping to use it in a small Windows based OpenMPI > > cluster at home. > > > > Unfortunately my experience so far has not exactly been trouble free. It > > seems that, due to the fact that this release is using WMI, there are a > > number of settings that must be configured on the machines in order to get > > this to work. These settings are not documented in the distribution at all. > > I have been experimenting with it for over a week on and off and as soon as > > I solve one problem, another one arises. > > > > Currently, after much searching, reading, and tinkering with DCOM settings > > etc..., I can remotely start processes on all my machines using mpirun but > > those processes cannot access network shares (e.g. for binary distribution) > > and HPL (which works on any one node) does not seem to work if I run it > > across multiple nodes, also indicating a network issue (CPU sits at 100% in > > all processes with no network traffic and never terminates). To eliminate > > premission issues that may be caused by UAC I tried the same setup on two > > domain machines using an administrative account to launch and the behavior > > was the same. I have read that WMI processes cannot access network > > resources and I am at a loss for a solution to this newest of problems. If > > anyone knows how to make this work I would appreciate the help. I assume > > that someone has gotten this working and has the answers. > > > > I have searched the mailing list archives and I found other users with > > similar problems but no clear guidance on the threads. Some threads make > > references to Microsoft KB articles but do not explicitly tell the user > > what needs to be done, leaving each new user to rediscover the tricks on > > their own. One thread made it appear that testing had only been done on > > Windows XP. Needless to say, security has changed dramatically in Windows > > since XP! > > > > I would like to see OpenMPI for Windows be usable by a newcomer without all > > of this pain. > > > > What would be fantastic would be: > > 1) a step-by-step procedure for how to get OpenMPI 1.5 working on Windows > > a) preferably in a bare Windows 7 workgroup environment with nothing else > > (i.e. no Microsoft Cluster Compute Pack, no domain etc...) > > 2) inclusion of these steps in the binary distribution > > 3) bonus points for a script which accomplishes these things automatically > > > > If someone can help with (1), I would happily volunteer my time to work on > > (3). > > > > Regards, > > Jason > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > ------------------------------ > > Message: 4 > Date: Thu, 19 May 2011 21:26:43 -0400 > From: Jeff Squyres <jsquy...@cisco.com> > Subject: Re: [OMPI users] Error: Entry Point Not Found > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <f830ec35-fc9b-4801-b2a3-50f54d215...@cisco.com> > Content-Type: text/plain; charset=windows-1252 > > On May 19, 2011, at 10:54 AM, Zhangping Wei wrote: > > > 4, I use command window to run it in this way: ?mpirun ?n 4 **.exe ?,then I > > met the error: ?entry point not found: the procedure entry point inet_pton > > could not be located in the dynamic link library WS2_32.dll? > > Unfortunately our Windows developer/maintainer is out for a little while > (he's getting married); he pretty much did the Windows stuff by himself, so > none of the rest of us know much about it. :( > > inet_pton is a standard function call relating to IP addresses that we use in > the internals of OMPI; I'm not sure why it wouldn't be found on Windows XP > (Shiqing did cite that the OMPI Windows port should work on Windows XP). > > This post seems to imply that inet_ntop is only available on Vista and above: > > http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/e40465f2-41b7-4243-ad33-15ae9366f4e6/ > > So perhaps Shiqing needs to put in some kind of portability workaround for > OMPI, and the current binaries won't actually work for XP...? > > I can't say that for sure because I really know very little about Windows; > we'll unfortunately have to wait until he returns to get a definitive answer. > :-( > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > ------------------------------ > > Message: 5 > Date: Thu, 19 May 2011 21:37:49 -0400 > From: Jeff Squyres <jsquy...@cisco.com> > Subject: Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer > XE 2011 (aka 12.0) > To: Open MPI Users <us...@open-mpi.org> > Cc: Giovanni Bracco <giovanni.bra...@enea.it>, Agostino Funel > <agostino.fu...@enea.it>, Fiorenzo Ambrosino > <fiorenzo.ambros...@enea.it>, Guido Guarnieri > <guido.guarni...@enea.it>, Roberto Ciavarella > <roberto.ciavare...@enea.it>, Salvatore Podda > <salvatore.po...@enea.it>, Giovanni Ponti <giovanni.po...@enea.it> > Message-ID: <45362608-b8b0-4ade-9959-b35c5690a...@cisco.com> > Content-Type: text/plain; charset=us-ascii > > Sorry for the late reply. > > Other users have seen something similar but we have never been able to > reproduce it. Is this only when using IB? If you use "mpirun --mca > btl_openib_cpc_if_include rdmacm", does the problem go away? > > > On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote: > > > I've seen the same thing when I build openmpi 1.4.3 with Intel 12, but only > > when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1 then the > > collectives hangs go away. I don't know what, if anything, the higher > > optimization buys you when compiling openmpi, so I'm not sure if that's an > > acceptable workaround or not. > > > > My system is similar to yours - Intel X5570 with QDR Mellanox IB running > > RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm using IMB 3.2.2 > > with a single iteration of Barrier to reproduce the hang, and it happens > > 100% of the time for me when I invoke it like this: > > > > # salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier > > > > The hang happens on the first Barrier (64 ranks) and each of the > > participating ranks have this backtrace: > > > > __poll (...) > > poll_dispatch () from [instdir]/lib/libopen-pal.so.0 > > opal_event_loop () from [instdir]/lib/libopen-pal.so.0 > > opal_progress () from [instdir]/lib/libopen-pal.so.0 > > ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0 > > ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0 > > ompi_coll_tuned_barrier_intra_recursivedoubling () from > > [instdir]/lib/libmpi.so.0 > > ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0 > > PMPI_Barrier () from [instdir]/lib/libmpi.so.0 > > IMB_barrier () > > IMB_init_buffers_iter () > > main () > > > > The one non-participating rank has this backtrace: > > > > __poll (...) > > poll_dispatch () from [instdir]/lib/libopen-pal.so.0 > > opal_event_loop () from [instdir]/lib/libopen-pal.so.0 > > opal_progress () from [instdir]/lib/libopen-pal.so.0 > > ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0 > > ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0 > > ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0 > > ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0 > > PMPI_Barrier () from [instdir]/lib/libmpi.so.0 > > main () > > > > If I use more nodes I can get it to hang with 1ppn, so that seems to rule > > out the sm btl (or interactions with it) as a culprit at least. > > > > I can't reproduce this with openmpi 1.5.3, interestingly. > > > > -Marcus > > > > > > On 05/10/2011 03:37 AM, Salvatore Podda wrote: > >> Dear all, > >> > >> we succeed in building several version of openmpi from 1.2.8 to 1.4.3 > >> with Intel composer XE 2011 (aka 12.0). > >> However we found a threshold in the number of cores (depending from the > >> application: IMB, xhpl or user applications > >> and form the number of required cores) above which the application hangs > >> (sort of deadlocks). > >> The building of openmpi with 'gcc' and 'pgi' does not show the same limits. > >> There are any known incompatibilities of openmpi with this version of > >> intel compiilers? > >> > >> The characteristics of our computational infrastructure are: > >> > >> Intel processors E7330, E5345, E5530 e E5620 > >> > >> CentOS 5.3, CentOS 5.5. > >> > >> Intel composer XE 2011 > >> gcc 4.1.2 > >> pgi 10.2-1 > >> > >> Regards > >> > >> Salvatore Podda > >> > >> ENEA UTICT-HPC > >> Department for Computer Science Development and ICT > >> Facilities Laboratory for Science and High Performace Computing > >> C.R. Frascati > >> Via E. Fermi, 45 > >> PoBox 65 > >> 00044 Frascati (Rome) > >> Italy > >> > >> Tel: +39 06 9400 5342 > >> Fax: +39 06 9400 5551 > >> Fax: +39 06 9400 5735 > >> E-mail: salvatore.po...@enea.it > >> Home Page: www.cresco.enea.it > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > ------------------------------ > > Message: 6 > Date: Thu, 19 May 2011 22:01:00 -0400 > From: Jeff Squyres <jsquy...@cisco.com> > Subject: Re: [OMPI users] Openib with > 32 cores per node > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <c18c4827-d305-484a-9dae-290902d40...@cisco.com> > Content-Type: text/plain; charset=us-ascii > > What Sam is alluding to is that the OpenFabrics driver code in OMPI is > sucking up oodles of memory for each IB connection that you're using. The > receive_queues param that he sent tells OMPI to use all shared receive queues > (instead of defaulting to one per-peer receive queue and the rest shared > receive queues -- the per-peer RQ sucks up all the memory when you multiple > it by N peers). > > > On May 19, 2011, at 11:59 AM, Samuel K. Gutierrez wrote: > > > Hi, > > > > On May 19, 2011, at 9:37 AM, Robert Horton wrote > > > >> On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote: > >>> Hi, > >>> > >>> Try the following QP parameters that only use shared receive queues. > >>> > >>> -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32 > >>> > >> > >> Thanks for that. If I run the job over 2 x 48 cores it now works and the > >> performance seems reasonable (I need to do some more tuning) but when I > >> go up to 4 x 48 cores I'm getting the same problem: > >> > >> [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] > >> error creating qp errno says Cannot allocate memory > >> [compute-1-7.local:18106] *** An error occurred in MPI_Isend > >> [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD > >> [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list > >> [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now > >> abort) > >> > >> Any thoughts? > > > > How much memory does each node have? Does this happen at startup? > > > > Try adding: > > > > -mca btl_openib_cpc_include rdmacm > > > > I'm not sure if your version of OFED supports this feature, but maybe using > > XRC may help. I **think** other tweaks are needed to get this going, but > > I'm not familiar with the details. > > > > Hope that helps, > > > > Samuel K. Gutierrez > > Los Alamos National Laboratory > > > > > >> > >> Thanks, > >> Rob > >> -- > >> Robert Horton > >> System Administrator (Research Support) - School of Mathematical Sciences > >> Queen Mary, University of London > >> r.hor...@qmul.ac.uk - +44 (0) 20 7882 7345 > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > ------------------------------ > > Message: 7 > Date: Thu, 19 May 2011 22:04:46 -0400 > From: Jeff Squyres <jsquy...@cisco.com> > Subject: Re: [OMPI users] MPI_COMM_DUP freeze with OpenMPI 1.4.1 > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <0dcf20b8-ca5c-4746-8187-a2dff39b1...@cisco.com> > Content-Type: text/plain; charset=us-ascii > > On May 13, 2011, at 8:31 AM, francoise.r...@obs.ujf-grenoble.fr wrote: > > > Here is the MUMPS portion of code (in zmumps_part1.F file) where the slaves > > call MPI_COMM_DUP , id%PAR and MASTER are initialized to 0 before : > > > > CALL MPI_COMM_SIZE(id%COMM, id%NPROCS, IERR ) > > I re-indented so that I could read it better: > > CALL MPI_COMM_SIZE(id%COMM, id%NPROCS, IERR ) > IF ( id%PAR .eq. 0 ) THEN > IF ( id%MYID .eq. MASTER ) THEN > color = MPI_UNDEFINED > ELSE > color = 0 > END IF > CALL MPI_COMM_SPLIT( id%COMM, color, 0, > & id%COMM_NODES, IERR ) > id%NSLAVES = id%NPROCS - 1 > ELSE > CALL MPI_COMM_DUP( id%COMM, id%COMM_NODES, IERR ) > id%NSLAVES = id%NPROCS > END IF > > IF (id%PAR .ne. 0 .or. id%MYID .NE. MASTER) THEN > CALL MPI_COMM_DUP( id%COMM_NODES, id%COMM_LOAD, IERR > ENDIF > > That doesn't look right -- both MPI_COMM_SPLIT and MPI_COMM_DUP are > collective, meaning that all processes in the communicator must call them. In > the first case, only some processes are calling MPI_COMM_SPLIT. Is there some > other logic that forces the rest of the processes to call MPI_COMM_SPLIT, too? > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > ------------------------------ > > Message: 8 > Date: Thu, 19 May 2011 22:30:03 -0400 > From: Jeff Squyres <jsquy...@cisco.com> > Subject: Re: [OMPI users] Trouble with MPI-IO > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <eefb638f-72f1-4208-8ea2-4f25f610c...@cisco.com> > Content-Type: text/plain; charset=us-ascii > > Props for that testio script. I think you win the award for "most easy to > reproduce test case." :-) > > I notice that some of the lines went over 72 columns, so I renamed the file > x.f90 and changed all the comments from "c" to "!" and joined the two &-split > lines. The error about implicit type for lenr went away, but then when I > enabled better type checking by using "use mpi" instead of "include > 'mpif.h'", I got the following: > > x.f90:99.77: > > call mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr) > 1 > Error: There is no specific subroutine for the generic 'mpi_type_indexed' at > (1) > > I looked at our mpi F90 module and see the following: > > interface MPI_Type_indexed > subroutine MPI_Type_indexed(count, array_of_blocklengths, > array_of_displacements, oldtype, newtype, ierr) > integer, intent(in) :: count > integer, dimension(*), intent(in) :: array_of_blocklengths > integer, dimension(*), intent(in) :: array_of_displacements > integer, intent(in) :: oldtype > integer, intent(out) :: newtype > integer, intent(out) :: ierr > end subroutine MPI_Type_indexed > end interface > > I don't quite grok the syntax of the "allocatable" type ijdisp, so that might > be the problem here...? > > Regardless, I'm not entirely sure if the problem is the >72 character lines, > but then when that is gone, I'm not sure how the allocatable stuff fits in... > (I'm not enough of a Fortran programmer to know) > > > > > On May 10, 2011, at 7:14 PM, Tom Rosmond wrote: > > > I would appreciate someone with experience with MPI-IO look at the > > simple fortran program gzipped and attached to this note. It is > > imbedded in a script so that all that is necessary to run it is do: > > 'testio' from the command line. The program generates a small 2-D input > > array, sets up an MPI-IO environment, and write a 2-D output array > > twice, with the only difference being the displacement arrays used to > > construct the indexed datatype. For the first write, simple > > monotonically increasing displacements are used, for the second the > > displacements are 'shuffled' in one dimension. They are printed during > > the run. > > > > For the first case the file is written properly, but for the second the > > program hangs on MPI_FILE_WRITE_AT_ALL and must be aborted manually. > > Although the program is compiled as an mpi program, I am running on a > > single processor, which makes the problem more puzzling. > > > > The program should be relatively self-explanatory, but if more > > information is needed, please ask. I am on an 8 core Xeon based Dell > > workstation running Scientific Linux 5.5, Intel fortran 12.0.3, and > > OpenMPI 1.5.3. I have also attached output from 'ompi_info'. > > > > T. Rosmond > > > > > > <testio.gz><info_ompi.gz>_______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > ------------------------------ > > Message: 9 > Date: Thu, 19 May 2011 20:24:25 -0700 > From: Tom Rosmond <rosm...@reachone.com> > Subject: Re: [OMPI users] Trouble with MPI-IO > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <1305861865.4284.104.ca...@cedar.reachone.com> > Content-Type: text/plain > > Thanks for looking at my problem. Sounds like you did reproduce my > problem. I have added some comments below > > On Thu, 2011-05-19 at 22:30 -0400, Jeff Squyres wrote: > > Props for that testio script. I think you win the award for "most easy to > > reproduce test case." :-) > > > > I notice that some of the lines went over 72 columns, so I renamed the file > > x.f90 and changed all the comments from "c" to "!" and joined the two > > &-split lines. The error about implicit type for lenr went away, but then > > when I enabled better type checking by using "use mpi" instead of "include > > 'mpif.h'", I got the following: > > What fortran compiler did you use? > > In the original script my Intel compile used the -132 option, > allowing up to that many columns per line. I still think in > F77 fortran much of the time, and use 'c' for comments out > of habit. The change to '!' doesn't make any difference. > > > > x.f90:99.77: > > > > call mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr) > > 1 > > Error: There is no specific subroutine for the generic 'mpi_type_indexed' > > at (1) > > Hmmm, very strange, since I am looking right at the MPI standard > documents with that routine documented. I too get this compile failure > when I switch to 'use mpi'. Could that be a problem with the Open MPI > fortran libraries??? > > > > I looked at our mpi F90 module and see the following: > > > > interface MPI_Type_indexed > > subroutine MPI_Type_indexed(count, array_of_blocklengths, > > array_of_displacements, oldtype, newtype, ierr) > > integer, intent(in) :: count > > integer, dimension(*), intent(in) :: array_of_blocklengths > > integer, dimension(*), intent(in) :: array_of_displacements > > integer, intent(in) :: oldtype > > integer, intent(out) :: newtype > > integer, intent(out) :: ierr > > end subroutine MPI_Type_indexed > > end interface > > > > I don't quite grok the syntax of the "allocatable" type ijdisp, so that > > might be the problem here...? > > Just a standard F90 'allocatable' statement. I've written thousands > just like it. > > > > Regardless, I'm not entirely sure if the problem is the >72 character > > lines, but then when that is gone, I'm not sure how the allocatable stuff > > fits in... (I'm not enough of a Fortran programmer to know) > > > Anyone else out that who can comment???? > > > T. Rosmond > > > > > > > On May 10, 2011, at 7:14 PM, Tom Rosmond wrote: > > > > > I would appreciate someone with experience with MPI-IO look at the > > > simple fortran program gzipped and attached to this note. It is > > > imbedded in a script so that all that is necessary to run it is do: > > > 'testio' from the command line. The program generates a small 2-D input > > > array, sets up an MPI-IO environment, and write a 2-D output array > > > twice, with the only difference being the displacement arrays used to > > > construct the indexed datatype. For the first write, simple > > > monotonically increasing displacements are used, for the second the > > > displacements are 'shuffled' in one dimension. They are printed during > > > the run. > > > > > > For the first case the file is written properly, but for the second the > > > program hangs on MPI_FILE_WRITE_AT_ALL and must be aborted manually. > > > Although the program is compiled as an mpi program, I am running on a > > > single processor, which makes the problem more puzzling. > > > > > > The program should be relatively self-explanatory, but if more > > > information is needed, please ask. I am on an 8 core Xeon based Dell > > > workstation running Scientific Linux 5.5, Intel fortran 12.0.3, and > > > OpenMPI 1.5.3. I have also attached output from 'ompi_info'. > > > > > > T. Rosmond > > > > > > > > > <testio.gz><info_ompi.gz>_______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > ------------------------------ > > Message: 10 > Date: Fri, 20 May 2011 09:25:14 +0200 > From: David B?ttner <david.buett...@in.tum.de> > Subject: Re: [OMPI users] Problem with MPI_Request, MPI_Isend/recv and > MPI_Wait/Test > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <4dd6175a.1080...@in.tum.de> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Hello, > > thanks for the quick answer. I am sorry that I forgot to mention this: I > did compile OpenMPI with MPI_THREAD_MULTIPLE support and test if > required == provided after the MPI_Thread_init call. > > > I do not see any mechanism for protecting the accesses to the requests to a > > single thread? What is the thread model you're using? > > > Again I am sorry that this was not clear: In the pseudo code below I > wanted to indicate the access-protection I do by thread-id dependent > calls if(0 == thread-id) and by using the trylock(...) (using > pthread-mutexes). In the code all accesses concerning one MPI_Request > (which are pthread-global-pointers in my case) are protected and called > in sequential order, i.e. MPI_Isend/recv is returns before any thread is > allowed to call the corresponding MPI_Test and no-one can call MPI_Test > any more when a thread is allowed to call MPI_Wait. > I did this in the same manner before with other MPI implementations, but > also on the same machine with the same (untouched) OpenMPI > implementation, also using pthreads and MPI in combination, but I used > > MPI_Request req; > > instead of > > MPI_Request* req; > (and later) > req = (MPI_Request*)malloc(sizeof(MPI_Request)); > > > In my recent (problem) code, I also tried not using pointers, but got > the same problem. Also, as I described in the first mail, I tried > everything concerning the memory allocation of the MPI_Request objects. > I tried not calling malloc. This I guessed wouldn't work, but the > OpenMPI documentation says this: > > " Nonblocking calls allocate a communication request object and > associate it with the request handle the argument request). " > [http://www.open-mpi.org/doc/v1.4/man3/MPI_Isend.3.php] and > > " [...] if the communication object was created by a nonblocking send or > receive, then it is deallocated and the request handle is set to > MPI_REQUEST_NULL." > [http://www.open-mpi.org/doc/v1.4/man3/MPI_Test.3.php] and (in slightly > different words) [http://www.open-mpi.org/doc/v1.4/man3/MPI_Wait.3.php] > > So I thought that it might do some kind of optimized memory stuff > internally. > > I also tried allocating req (for each used MPI_Request) once before the > first use and deallocation after the last use (which I thought was the > way it was supposed to work), but that crashes also. > > I tried replacing the pointers through global variables > > MPI_Request req; > > which didn't do the job... > > The only thing that seems to work is what I mentioned below: Allocate > every time I am going to need it in the MPI_Isend/recv, use it in > MPI_Test/Wait and after that deallocate it by hand each time. > I don't think that this is supposed to be like this since I have to do a > call to malloc and free so often (for multiple MPI_Request objects in > each iteration) that it will most likely limit performance... > > Anyway I still have the same problem and am still unclear on what kind > of memory allocation I should be doing for the MPI_Requests. Is there > anything else (besides MPI_THREAD_MULTIPLE support, thread access > control, sequential order of MPI_Isend/recv, MPI_Test and MPI_Wait for > one MPI_Request object) I need to take care of? If not, what could I do > to find the source of my problem? > > Thanks again for any kind of help! > > Kind regards, > David > > > > > > From an implementation perspective, your code is correct only if you > > > initialize the MPI library with MPI_THREAD_MULTIPLE and if the library > > > accepts. Otherwise, there is an assumption that the application is single > > > threaded, or that the MPI behavior is implementation dependent. Please > > > read the MPI standard regarding to MPI_Init_thread for more details. > > > > Regards, > > george. > > > > On May 19, 2011, at 02:34 , David B?ttner wrote: > > > >> Hello, > >> > >> I am working on a hybrid MPI (OpenMPI 1.4.3) and Pthread code. I am using > >> MPI_Isend and MPI_Irecv for communication and MPI_Test/MPI_Wait to check > >> if it is done. I do this repeatedly in the outer loop of my code. The > >> MPI_Test is used in the inner loop to check if some function can be called > >> which depends on the received data. > >> The program regularly crashed (only when not using printf...) and after > >> debugging it I figured out the following problem: > >> > >> In MPI_Isend I have an invalid read of memory. I fixed the problem with > >> not re-using a > >> > >> MPI_Request req_s, req_r; > >> > >> but by using > >> > >> MPI_Request* req_s; > >> MPI_Request* req_r > >> > >> and re-allocating them before the MPI_Isend/recv. > >> > >> The documentation says, that in MPI_Wait and MPI_Test (if successful) the > >> request-objects are deallocated and set to MPI_REQUEST_NULL. > >> It also says, that in MPI_Isend and MPI_Irecv, it allocates the Objects > >> and associates it with the request object. > >> > >> As I understand this, this either means I can use a pointer to MPI_Request > >> which I don't have to initialize for this (it doesn't work but crashes), > >> or that I can use a MPI_Request pointer which I have initialized with > >> malloc(sizeof(MPI_REQUEST)) (or passing the address of a MPI_Request req), > >> which is set and unset in the functions. But this version crashes, too. > >> What works is using a pointer, which I allocate before the MPI_Isend/recv > >> and which I free after MPI_Wait in every iteration. In other words: It > >> only uses if I don't reuse any kind of MPI_Request. Only if I recreate one > >> every time. > >> > >> Is this, what is should be like? I believe that a reuse of the memory > >> would be a lot more efficient (less calls to malloc...). Am I missing > >> something here? Or am I doing something wrong? > >> > >> > >> Let me provide some more detailed information about my problem: > >> > >> I am running the program on a 30 node infiniband cluster. Each node has 4 > >> single core Opteron CPUs. I am running 1 MPI Rank per node and 4 threads > >> per rank (-> one thread per core). > >> I am compiling with mpicc of OpenMPI using gcc below. > >> Some pseudo-code of the program can be found at the end of this e-mail. > >> > >> I was able to reproduce the problem using different amount of nodes and > >> even using one node only. The problem does not arise when I put > >> printf-debugging information into the code. This pointed me into the > >> direction that I have some memory problem, where some write accesses some > >> memory it is not supposed to. > >> I ran the tests using valgrind with --leak-check=full and > >> --show-reachable=yes, which pointed me either to MPI_Isend or MPI_Wait > >> depending on whether I had the threads spin in a loop for MPI_Test to > >> return success or used MPI_Wait respectively. > >> > >> I would appreciate your help with this. Am I missing something important > >> here? Is there a way to re-use the request in the different iterations > >> other than I thought it should work? > >> Or is there a way to re-initialize the allocated memory before the > >> MPI_Isend/recv so that I at least don't have to call free and malloc each > >> time? > >> > >> Thank you very much for your help! > >> Kind regards, > >> David B?ttner > >> > >> _____________________ > >> Pseudo-Code of program: > >> > >> MPI_Request* req_s; > >> MPI_Request* req_w; > >> OUTER-LOOP > >> if(0 == threadid) > >> { > >> req_s = malloc(sizeof(MPI_Request)); > >> req_r = malloc(sizeof(MPI_Request)); > >> MPI_Isend(..., req_s) > >> MPI_Irecv(..., req_r) > >> } > >> pthread_barrier > >> INNER-LOOP (while NOT_DONE or RET) > >> if(TRYLOCK&& NOT_DONE) > >> { > >> if(MPI_TEST(req_r)) > >> { > >> Call_Function_A; > >> NOT_DONE = 0; > >> } > >> > >> } > >> RET = Call_Function_B; > >> } > >> pthread_barrier_wait > >> if(0 == threadid) > >> { > >> MPI_WAIT(req_s) > >> MPI_WAIT(req_r) > >> free(req_s); > >> free(req_r); > >> } > >> _____________ > >> > >> > >> -- > >> David B?ttner, Informatik, Technische Universit?t M?nchen > >> TUM I-10 - FMI 01.06.059 - Tel. 089 / 289-17676 > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > "To preserve the freedom of the human mind then and freedom of the press, > > every spirit should be ready to devote itself to martyrdom; for as long as > > we may think as we will, and speak as we think, the condition of man will > > proceed in improvement." > > -- Thomas Jefferson, 1799 > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- > David B?ttner, Informatik, Technische Universit?t M?nchen > TUM I-10 - FMI 01.06.059 - Tel. 089 / 289-17676 > > > > ------------------------------ > > Message: 11 > Date: Fri, 20 May 2011 06:23:21 -0400 > From: Jeff Squyres <jsquy...@cisco.com> > Subject: Re: [OMPI users] Trouble with MPI-IO > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <a5b121e9-e664-49d0-ae54-2cfe52712...@cisco.com> > Content-Type: text/plain; charset=us-ascii > > On May 19, 2011, at 11:24 PM, Tom Rosmond wrote: > > > What fortran compiler did you use? > > gfortran. > > > In the original script my Intel compile used the -132 option, > > allowing up to that many columns per line. > > Gotcha. > > >> x.f90:99.77: > >> > >> call mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr) > >> 1 > >> Error: There is no specific subroutine for the generic 'mpi_type_indexed' > >> at (1) > > > > Hmmm, very strange, since I am looking right at the MPI standard > > documents with that routine documented. I too get this compile failure > > when I switch to 'use mpi'. Could that be a problem with the Open MPI > > fortran libraries??? > > I think that that error is telling us that there's a compile-time mismatch -- > that the signature of what you've passed doesn't match the signature of > OMPI's MPI_Type_indexed subroutine. > > >> I looked at our mpi F90 module and see the following: > >> > >> interface MPI_Type_indexed > >> subroutine MPI_Type_indexed(count, array_of_blocklengths, > >> array_of_displacements, oldtype, newtype, ierr) > >> integer, intent(in) :: count > >> integer, dimension(*), intent(in) :: array_of_blocklengths > >> integer, dimension(*), intent(in) :: array_of_displacements > >> integer, intent(in) :: oldtype > >> integer, intent(out) :: newtype > >> integer, intent(out) :: ierr > >> end subroutine MPI_Type_indexed > >> end interface > > Shouldn't ijlena and ijdisp be 1D arrays, not 2D arrays? > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > ------------------------------ > > Message: 12 > Date: Fri, 20 May 2011 07:26:19 -0400 > From: Jeff Squyres <jsquy...@cisco.com> > Subject: Re: [OMPI users] MPI_Alltoallv function crashes when np > 100 > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <f9f71854-b9dd-459f-999d-8a8aef8d6...@cisco.com> > Content-Type: text/plain; charset=GB2312 > > I missed this email in my INBOX, sorry. > > Can you be more specific about what exact error is occurring? You just say > that the application crashes...? Please send all the information listed here: > > http://www.open-mpi.org/community/help/ > > > On Apr 26, 2011, at 10:51 PM, ?????? wrote: > > > It seems that the const variable SOMAXCONN who used by listen() system call > > causes this problem. Can anybody help me resolve this question? > > > > 2011/4/25 ?????? <xjun.m...@gmail.com> > > Dear all, > > > > As I mentioned, when I mpiruned an application with the parameter "np = > > 150(or bigger)", the application who used the MPI_Alltoallv function would > > carsh. The problem would recur no matter how many nodes we used. > > > > The edition of OpenMPI: 1.4.1 or 1.4.3 > > The OS: linux redhat 2.6.32 > > > > BTW, my nodes had enough memory to run the application, and the > > MPI_Alltoall function worked well at my environment. > > Did anybody meet the same problem? Thanks. > > > > > > Best Regards > > > > > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > ------------------------------ > > Message: 13 > Date: Fri, 20 May 2011 07:28:28 -0400 > From: Jeff Squyres <jsquy...@cisco.com> > Subject: Re: [OMPI users] MPI_ERR_TRUNCATE with MPI_Allreduce() error, > but only sometimes... > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <caef632e-757b-49ee-b545-5cccbc712...@cisco.com> > Content-Type: text/plain; charset=us-ascii > > Sorry for the super-late reply. :-\ > > Yes, ERR_TRUNCATE means that the receiver didn't have a large enough buffer. > > Have you tried upgrading to a newer version of Open MPI? 1.4.3 is the current > stable release (I have a very dim and not guaranteed to be correct > recollection that we fixed something in the internals of collectives > somewhere with regards to ERR_TRUNCATE...?). > > > On Apr 25, 2011, at 4:44 PM, Wei Hao wrote: > > > Hi: > > > > I'm running openmpi 1.2.8. I'm working on a project where one part involves > > communicating an integer, representing the number of data points I'm > > keeping track of, to all the processors. The line is simple: > > > > MPI_Allreduce(&np,&geo_N,1,MPI_INT,MPI_MAX,MPI_COMM_WORLD); > > > > where np and geo_N are integers, np is the result of a local calculation, > > and geo_N has been declared on all the processors. geo_N is nondecreasing. > > This line works the first time I call it (geo_N goes from 0 to some other > > integer), but if I call it later in the program, I get the following error: > > > > > > [woodhen-039:26189] *** An error occurred in MPI_Allreduce > > [woodhen-039:26189] *** on communicator MPI_COMM_WORLD > > [woodhen-039:26189] *** MPI_ERR_TRUNCATE: message truncated > > [woodhen-039:26189] *** MPI_ERRORS_ARE_FATAL (goodbye) > > > > > > As I understand it, MPI_ERR_TRUNCATE means that the output buffer is too > > small, but I'm not sure where I've made a mistake. It's particularly > > frustrating because it seems to work fine the first time. Does anyone have > > any thoughts? > > > > Thanks > > Wei > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > ------------------------------ > > Message: 14 > Date: Fri, 20 May 2011 08:14:07 -0400 > From: Jeff Squyres <jsquy...@cisco.com> > Subject: Re: [OMPI users] Trouble with MPI-IO > To: Open MPI Users <us...@open-mpi.org> > Message-ID: <42db03b3-9cf4-4acb-aa20-b857e5f76...@cisco.com> > Content-Type: text/plain; charset="us-ascii" > > On May 20, 2011, at 6:23 AM, Jeff Squyres wrote: > > > Shouldn't ijlena and ijdisp be 1D arrays, not 2D arrays? > > Ok, if I convert ijlena and ijdisp to 1D arrays, I don't get the compile > error (even though they're allocatable -- so allocate was a red herring, > sorry). That's all that "use mpi" is complaining about -- that the function > signatures didn't match. > > use mpi is your friend -- even if you don't use F90 constructs much. > Compile-time checking is Very Good Thing (you were effectively "getting > lucky" by passing in the 2D arrays, I think). > > Attached is my final version. And with this version, I see the hang when > running it with the "T" parameter. > > That being said, I'm not an expert on the MPI IO stuff -- your code *looks* > right to me, but I could be missing something subtle in the interpretation of > MPI_FILE_SET_VIEW. I tried running your code with MPICH 1.3.2p1 and it also > hung. > > Rob (ROMIO guy) -- can you comment this code? Is it correct? > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: x.f90 > Type: application/octet-stream > Size: 3820 bytes > Desc: not available > URL: > <http://www.open-mpi.org/MailArchives/users/attachments/20110520/53a5461b/attachment.obj> > > ------------------------------ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > End of users Digest, Vol 1911, Issue 1 > **************************************