Re: [OMPI users] v1.5.3-x64 does not work on Windows 7 workgroup

Jason Mackay Fri, 20 May 2011 14:53:53 -0400

I have verified that disabling UAC does not fix the problem. xhlp.exe starts, 
threads spin up on both machines, CPU usage is at 80-90% but no progress is 
ever made.
 
>From this state, Ctrl-break on the head node yields the following output:


[REMOTEMACHINE:02032] [[20816,1],0]-[[20816,0],0] mca_oob_tcp_msg_recv: readv 
failed: Unknown error (108)
[REMOTEMACHINE:05064] [[20816,1],1]-[[20816,0],0] mca_oob_tcp_msg_recv: readv 
failed: Unknown error (108)
[REMOTEMACHINE:05420] [[20816,1],2]-[[20816,0],0] mca_oob_tcp_msg_recv: readv 
failed: Unknown error (108)
[REMOTEMACHINE:03852] [[20816,1],3]-[[20816,0],0] mca_oob_tcp_msg_recv: readv 
failed: Unknown error (108)
[REMOTEMACHINE:05436] [[20816,1],4]-[[20816,0],0] mca_oob_tcp_msg_recv: readv 
failed: Unknown error (108)
[REMOTEMACHINE:04416] [[20816,1],5]-[[20816,0],0] mca_oob_tcp_msg_recv: readv 
failed: Unknown error (108)
[REMOTEMACHINE:02032] [[20816,1],0] routed:binomial: Connection to lifeline 
[[20816,0],0] lost
[REMOTEMACHINE:05064] [[20816,1],1] routed:binomial: Connection to lifeline 
[[20816,0],0] lost
[REMOTEMACHINE:05420] [[20816,1],2] routed:binomial: Connection to lifeline 
[[20816,0],0] lost
[REMOTEMACHINE:03852] [[20816,1],3] routed:binomial: Connection to lifeline 
[[20816,0],0] lost
[REMOTEMACHINE:05436] [[20816,1],4] routed:binomial: Connection to lifeline 
[[20816,0],0] lost
[REMOTEMACHINE:04416] [[20816,1],5] routed:binomial: Connection to lifeline 
[[20816,0],0] lost
 
 
 
> From: users-requ...@open-mpi.org
> Subject: users Digest, Vol 1911, Issue 1
> To: us...@open-mpi.org
> Date: Fri, 20 May 2011 08:14:13 -0400
> 
> Send users mailing list submissions to
> us...@open-mpi.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
> 
> You can reach the person managing the list at
> users-ow...@open-mpi.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
> 
> 
> Today's Topics:
> 
> 1. Re: Error: Entry Point Not Found (Zhangping Wei)
> 2. Re: Problem with MPI_Request, MPI_Isend/recv and
> MPI_Wait/Test (George Bosilca)
> 3. Re: v1.5.3-x64 does not work on Windows 7 workgroup (Jeff Squyres)
> 4. Re: Error: Entry Point Not Found (Jeff Squyres)
> 5. Re: openmpi (1.2.8 or above) and Intel composer XE 2011 (aka
> 12.0) (Jeff Squyres)
> 6. Re: Openib with > 32 cores per node (Jeff Squyres)
> 7. Re: MPI_COMM_DUP freeze with OpenMPI 1.4.1 (Jeff Squyres)
> 8. Re: Trouble with MPI-IO (Jeff Squyres)
> 9. Re: Trouble with MPI-IO (Tom Rosmond)
> 10. Re: Problem with MPI_Request, MPI_Isend/recv and
> MPI_Wait/Test (David B?ttner)
> 11. Re: Trouble with MPI-IO (Jeff Squyres)
> 12. Re: MPI_Alltoallv function crashes when np > 100 (Jeff Squyres)
> 13. Re: MPI_ERR_TRUNCATE with MPI_Allreduce() error, but only
> sometimes... (Jeff Squyres)
> 14. Re: Trouble with MPI-IO (Jeff Squyres)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Thu, 19 May 2011 09:13:53 -0700 (PDT)
> From: Zhangping Wei <zhangping_...@yahoo.com>
> Subject: Re: [OMPI users] Error: Entry Point Not Found
> To: us...@open-mpi.org
> Message-ID: <101342.7961...@web111818.mail.gq1.yahoo.com>
> Content-Type: text/plain; charset="gb2312"
> 
> Dear Paul,
> 
> I checked the way 'mpirun -np N <cmd>' you mentioned, but it was the same 
> problem.
> 
> I guess it may related to the system I used, because I have used it correctly 
> in 
> another XP 32 bit system.
> 
> I look forward to more advice.Thanks.
> 
> Zhangping 
> 
> 
> 
> 
> ________________________________
> ???????? "users-requ...@open-mpi.org" <users-requ...@open-mpi.org>
> ???????? us...@open-mpi.org
> ?????????? 2011/5/19 (????) 11:00:02 ????
> ?? ???? users Digest, Vol 1910, Issue 2
> 
> Send users mailing list submissions to
> us...@open-mpi.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
> users-requ...@open-mpi.org
> 
> You can reach the person managing the list at
> users-ow...@open-mpi.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
> 
> 
> Today's Topics:
> 
> 1. Re: Error: Entry Point Not Found (Paul van der Walt)
> 2. Re: Openib with > 32 cores per node (Robert Horton)
> 3. Re: Openib with > 32 cores per node (Samuel K. Gutierrez)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Thu, 19 May 2011 16:14:02 +0100
> From: Paul van der Walt <p...@denknerd.nl>
> Subject: Re: [OMPI users] Error: Entry Point Not Found
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <banlktinjz0cntchqjczyhfgsnr51jpu...@mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
> 
> Hi,
> 
> On 19 May 2011 15:54, Zhangping Wei <zhangping_...@yahoo.com> wrote:
> > 4, I use command window to run it in this way: ?mpirun ?n 4 ?**.exe ?,then I
> 
> Probably not the problem, but shouldn't that be 'mpirun -np N <cmd>' ?
> 
> Paul
> 
> -- 
> O< ascii ribbon campaign - stop html mail - www.asciiribbon.org
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Thu, 19 May 2011 16:37:56 +0100
> From: Robert Horton <r.hor...@qmul.ac.uk>
> Subject: Re: [OMPI users] Openib with > 32 cores per node
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <1305819476.9663.148.camel@moelwyn>
> Content-Type: text/plain; charset="UTF-8"
> 
> On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote:
> > Hi,
> > 
> > Try the following QP parameters that only use shared receive queues.
> > 
> > -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32
> > 
> 
> Thanks for that. If I run the job over 2 x 48 cores it now works and the
> performance seems reasonable (I need to do some more tuning) but when I
> go up to 4 x 48 cores I'm getting the same problem:
> 
> [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
> error creating qp errno says Cannot allocate memory
> [compute-1-7.local:18106] *** An error occurred in MPI_Isend
> [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD
> [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list
> [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now 
> abort)
> 
> Any thoughts?
> 
> Thanks,
> Rob
> -- 
> Robert Horton
> System Administrator (Research Support) - School of Mathematical Sciences
> Queen Mary, University of London
> r.hor...@qmul.ac.uk - +44 (0) 20 7882 7345
> 
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Thu, 19 May 2011 09:59:13 -0600
> From: "Samuel K. Gutierrez" <sam...@lanl.gov>
> Subject: Re: [OMPI users] Openib with > 32 cores per node
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <b3e83138-9af0-48c0-871c-dbbb2e712...@lanl.gov>
> Content-Type: text/plain; charset=us-ascii
> 
> Hi,
> 
> On May 19, 2011, at 9:37 AM, Robert Horton wrote
> 
> > On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote:
> >> Hi,
> >> 
> >> Try the following QP parameters that only use shared receive queues.
> >> 
> >> -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32
> >> 
> > 
> > Thanks for that. If I run the job over 2 x 48 cores it now works and the
> > performance seems reasonable (I need to do some more tuning) but when I
> > go up to 4 x 48 cores I'm getting the same problem:
> > 
> >[compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
> >] error creating qp errno says Cannot allocate memory
> > [compute-1-7.local:18106] *** An error occurred in MPI_Isend
> > [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD
> > [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list
> > [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now 
> >abort)
> > 
> > Any thoughts?
> 
> How much memory does each node have? Does this happen at startup?
> 
> Try adding:
> 
> -mca btl_openib_cpc_include rdmacm
> 
> I'm not sure if your version of OFED supports this feature, but maybe using 
> XRC 
> may help. I **think** other tweaks are needed to get this going, but I'm not 
> familiar with the details.
> 
> Hope that helps,
> 
> Samuel K. Gutierrez
> Los Alamos National Laboratory
> 
> 
> > 
> > Thanks,
> > Rob
> > -- 
> > Robert Horton
> > System Administrator (Research Support) - School of Mathematical Sciences
> > Queen Mary, University of London
> > r.hor...@qmul.ac.uk - +44 (0) 20 7882 7345
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> 
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> End of users Digest, Vol 1910, Issue 2
> **************************************
> -------------- next part --------------
> HTML attachment scrubbed and removed
> 
> ------------------------------
> 
> Message: 2
> Date: Thu, 19 May 2011 08:48:03 -0800
> From: George Bosilca <bosi...@eecs.utk.edu>
> Subject: Re: [OMPI users] Problem with MPI_Request, MPI_Isend/recv and
> MPI_Wait/Test
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <fcac66f9-fdb5-48bb-a800-263d8a4f9...@eecs.utk.edu>
> Content-Type: text/plain; charset=iso-8859-1
> 
> David,
> 
> I do not see any mechanism for protecting the accesses to the requests to a 
> single thread? What is the thread model you're using?
> 
> >From an implementation perspective, your code is correct only if you 
> >initialize the MPI library with MPI_THREAD_MULTIPLE and if the library 
> >accepts. Otherwise, there is an assumption that the application is single 
> >threaded, or that the MPI behavior is implementation dependent. Please read 
> >the MPI standard regarding to MPI_Init_thread for more details.
> 
> Regards,
> george.
> 
> On May 19, 2011, at 02:34 , David B?ttner wrote:
> 
> > Hello,
> > 
> > I am working on a hybrid MPI (OpenMPI 1.4.3) and Pthread code. I am using 
> > MPI_Isend and MPI_Irecv for communication and MPI_Test/MPI_Wait to check if 
> > it is done. I do this repeatedly in the outer loop of my code. The MPI_Test 
> > is used in the inner loop to check if some function can be called which 
> > depends on the received data.
> > The program regularly crashed (only when not using printf...) and after 
> > debugging it I figured out the following problem:
> > 
> > In MPI_Isend I have an invalid read of memory. I fixed the problem with not 
> > re-using a
> > 
> > MPI_Request req_s, req_r;
> > 
> > but by using
> > 
> > MPI_Request* req_s;
> > MPI_Request* req_r
> > 
> > and re-allocating them before the MPI_Isend/recv.
> > 
> > The documentation says, that in MPI_Wait and MPI_Test (if successful) the 
> > request-objects are deallocated and set to MPI_REQUEST_NULL.
> > It also says, that in MPI_Isend and MPI_Irecv, it allocates the Objects and 
> > associates it with the request object.
> > 
> > As I understand this, this either means I can use a pointer to MPI_Request 
> > which I don't have to initialize for this (it doesn't work but crashes), or 
> > that I can use a MPI_Request pointer which I have initialized with 
> > malloc(sizeof(MPI_REQUEST)) (or passing the address of a MPI_Request req), 
> > which is set and unset in the functions. But this version crashes, too.
> > What works is using a pointer, which I allocate before the MPI_Isend/recv 
> > and which I free after MPI_Wait in every iteration. In other words: It only 
> > uses if I don't reuse any kind of MPI_Request. Only if I recreate one every 
> > time.
> > 
> > Is this, what is should be like? I believe that a reuse of the memory would 
> > be a lot more efficient (less calls to malloc...). Am I missing something 
> > here? Or am I doing something wrong?
> > 
> > 
> > Let me provide some more detailed information about my problem:
> > 
> > I am running the program on a 30 node infiniband cluster. Each node has 4 
> > single core Opteron CPUs. I am running 1 MPI Rank per node and 4 threads 
> > per rank (-> one thread per core).
> > I am compiling with mpicc of OpenMPI using gcc below.
> > Some pseudo-code of the program can be found at the end of this e-mail.
> > 
> > I was able to reproduce the problem using different amount of nodes and 
> > even using one node only. The problem does not arise when I put 
> > printf-debugging information into the code. This pointed me into the 
> > direction that I have some memory problem, where some write accesses some 
> > memory it is not supposed to.
> > I ran the tests using valgrind with --leak-check=full and 
> > --show-reachable=yes, which pointed me either to MPI_Isend or MPI_Wait 
> > depending on whether I had the threads spin in a loop for MPI_Test to 
> > return success or used MPI_Wait respectively.
> > 
> > I would appreciate your help with this. Am I missing something important 
> > here? Is there a way to re-use the request in the different iterations 
> > other than I thought it should work?
> > Or is there a way to re-initialize the allocated memory before the 
> > MPI_Isend/recv so that I at least don't have to call free and malloc each 
> > time?
> > 
> > Thank you very much for your help!
> > Kind regards,
> > David B?ttner
> > 
> > _____________________
> > Pseudo-Code of program:
> > 
> > MPI_Request* req_s;
> > MPI_Request* req_w;
> > OUTER-LOOP
> > if(0 == threadid)
> > {
> > req_s = malloc(sizeof(MPI_Request));
> > req_r = malloc(sizeof(MPI_Request));
> > MPI_Isend(..., req_s)
> > MPI_Irecv(..., req_r)
> > }
> > pthread_barrier
> > INNER-LOOP (while NOT_DONE or RET)
> > if(TRYLOCK && NOT_DONE)
> > {
> > if(MPI_TEST(req_r))
> > {
> > Call_Function_A;
> > NOT_DONE = 0;
> > }
> > 
> > }
> > RET = Call_Function_B;
> > }
> > pthread_barrier_wait
> > if(0 == threadid)
> > {
> > MPI_WAIT(req_s)
> > MPI_WAIT(req_r)
> > free(req_s);
> > free(req_r);
> > }
> > _____________
> > 
> > 
> > -- 
> > David B?ttner, Informatik, Technische Universit?t M?nchen
> > TUM I-10 - FMI 01.06.059 - Tel. 089 / 289-17676
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> "To preserve the freedom of the human mind then and freedom of the press, 
> every spirit should be ready to devote itself to martyrdom; for as long as we 
> may think as we will, and speak as we think, the condition of man will 
> proceed in improvement."
> -- Thomas Jefferson, 1799
> 
> 
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Thu, 19 May 2011 21:22:48 -0400
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] v1.5.3-x64 does not work on Windows 7
> workgroup
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <278274f0-bf00-4498-950f-9779e0083...@cisco.com>
> Content-Type: text/plain; charset=us-ascii
> 
> Unfortunately, our Windows guy (Shiqing) is off getting married and will be 
> out for a little while. :-(
> 
> All that I can cite is the README.WINDOWS.txt file in the top-level 
> directory. I'm afraid that I don't know much else about Windows. :-(
> 
> 
> On May 18, 2011, at 8:17 PM, Jason Mackay wrote:
> 
> > Hi all,
> > 
> > My thanks to all those involved for putting together this Windows binary 
> > release of OpenMPI! I am hoping to use it in a small Windows based OpenMPI 
> > cluster at home.
> > 
> > Unfortunately my experience so far has not exactly been trouble free. It 
> > seems that, due to the fact that this release is using WMI, there are a 
> > number of settings that must be configured on the machines in order to get 
> > this to work. These settings are not documented in the distribution at all. 
> > I have been experimenting with it for over a week on and off and as soon as 
> > I solve one problem, another one arises.
> > 
> > Currently, after much searching, reading, and tinkering with DCOM settings 
> > etc..., I can remotely start processes on all my machines using mpirun but 
> > those processes cannot access network shares (e.g. for binary distribution) 
> > and HPL (which works on any one node) does not seem to work if I run it 
> > across multiple nodes, also indicating a network issue (CPU sits at 100% in 
> > all processes with no network traffic and never terminates). To eliminate 
> > premission issues that may be caused by UAC I tried the same setup on two 
> > domain machines using an administrative account to launch and the behavior 
> > was the same. I have read that WMI processes cannot access network 
> > resources and I am at a loss for a solution to this newest of problems. If 
> > anyone knows how to make this work I would appreciate the help. I assume 
> > that someone has gotten this working and has the answers.
> > 
> > I have searched the mailing list archives and I found other users with 
> > similar problems but no clear guidance on the threads. Some threads make 
> > references to Microsoft KB articles but do not explicitly tell the user 
> > what needs to be done, leaving each new user to rediscover the tricks on 
> > their own. One thread made it appear that testing had only been done on 
> > Windows XP. Needless to say, security has changed dramatically in Windows 
> > since XP!
> > 
> > I would like to see OpenMPI for Windows be usable by a newcomer without all 
> > of this pain.
> > 
> > What would be fantastic would be:
> > 1) a step-by-step procedure for how to get OpenMPI 1.5 working on Windows
> > a) preferably in a bare Windows 7 workgroup environment with nothing else 
> > (i.e. no Microsoft Cluster Compute Pack, no domain etc...)
> > 2) inclusion of these steps in the binary distribution
> > 3) bonus points for a script which accomplishes these things automatically
> > 
> > If someone can help with (1), I would happily volunteer my time to work on 
> > (3).
> > 
> > Regards,
> > Jason
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> 
> ------------------------------
> 
> Message: 4
> Date: Thu, 19 May 2011 21:26:43 -0400
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] Error: Entry Point Not Found
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <f830ec35-fc9b-4801-b2a3-50f54d215...@cisco.com>
> Content-Type: text/plain; charset=windows-1252
> 
> On May 19, 2011, at 10:54 AM, Zhangping Wei wrote:
> 
> > 4, I use command window to run it in this way: ?mpirun ?n 4 **.exe ?,then I 
> > met the error: ?entry point not found: the procedure entry point inet_pton 
> > could not be located in the dynamic link library WS2_32.dll?
> 
> Unfortunately our Windows developer/maintainer is out for a little while 
> (he's getting married); he pretty much did the Windows stuff by himself, so 
> none of the rest of us know much about it. :(
> 
> inet_pton is a standard function call relating to IP addresses that we use in 
> the internals of OMPI; I'm not sure why it wouldn't be found on Windows XP 
> (Shiqing did cite that the OMPI Windows port should work on Windows XP). 
> 
> This post seems to imply that inet_ntop is only available on Vista and above:
> 
> http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/e40465f2-41b7-4243-ad33-15ae9366f4e6/
> 
> So perhaps Shiqing needs to put in some kind of portability workaround for 
> OMPI, and the current binaries won't actually work for XP...?
> 
> I can't say that for sure because I really know very little about Windows; 
> we'll unfortunately have to wait until he returns to get a definitive answer. 
> :-(
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> 
> ------------------------------
> 
> Message: 5
> Date: Thu, 19 May 2011 21:37:49 -0400
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer
> XE 2011 (aka 12.0)
> To: Open MPI Users <us...@open-mpi.org>
> Cc: Giovanni Bracco <giovanni.bra...@enea.it>, Agostino Funel
> <agostino.fu...@enea.it>, Fiorenzo Ambrosino
> <fiorenzo.ambros...@enea.it>, Guido Guarnieri
> <guido.guarni...@enea.it>, Roberto Ciavarella
> <roberto.ciavare...@enea.it>, Salvatore Podda
> <salvatore.po...@enea.it>, Giovanni Ponti <giovanni.po...@enea.it>
> Message-ID: <45362608-b8b0-4ade-9959-b35c5690a...@cisco.com>
> Content-Type: text/plain; charset=us-ascii
> 
> Sorry for the late reply.
> 
> Other users have seen something similar but we have never been able to 
> reproduce it. Is this only when using IB? If you use "mpirun --mca 
> btl_openib_cpc_if_include rdmacm", does the problem go away?
> 
> 
> On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote:
> 
> > I've seen the same thing when I build openmpi 1.4.3 with Intel 12, but only 
> > when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1 then the 
> > collectives hangs go away. I don't know what, if anything, the higher 
> > optimization buys you when compiling openmpi, so I'm not sure if that's an 
> > acceptable workaround or not.
> > 
> > My system is similar to yours - Intel X5570 with QDR Mellanox IB running 
> > RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm using IMB 3.2.2 
> > with a single iteration of Barrier to reproduce the hang, and it happens 
> > 100% of the time for me when I invoke it like this:
> > 
> > # salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier
> > 
> > The hang happens on the first Barrier (64 ranks) and each of the 
> > participating ranks have this backtrace:
> > 
> > __poll (...)
> > poll_dispatch () from [instdir]/lib/libopen-pal.so.0
> > opal_event_loop () from [instdir]/lib/libopen-pal.so.0
> > opal_progress () from [instdir]/lib/libopen-pal.so.0
> > ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_barrier_intra_recursivedoubling () from 
> > [instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0
> > PMPI_Barrier () from [instdir]/lib/libmpi.so.0
> > IMB_barrier ()
> > IMB_init_buffers_iter ()
> > main ()
> > 
> > The one non-participating rank has this backtrace:
> > 
> > __poll (...)
> > poll_dispatch () from [instdir]/lib/libopen-pal.so.0
> > opal_event_loop () from [instdir]/lib/libopen-pal.so.0
> > opal_progress () from [instdir]/lib/libopen-pal.so.0
> > ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0
> > ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0
> > PMPI_Barrier () from [instdir]/lib/libmpi.so.0
> > main ()
> > 
> > If I use more nodes I can get it to hang with 1ppn, so that seems to rule 
> > out the sm btl (or interactions with it) as a culprit at least.
> > 
> > I can't reproduce this with openmpi 1.5.3, interestingly.
> > 
> > -Marcus
> > 
> > 
> > On 05/10/2011 03:37 AM, Salvatore Podda wrote:
> >> Dear all,
> >> 
> >> we succeed in building several version of openmpi from 1.2.8 to 1.4.3 
> >> with Intel composer XE 2011 (aka 12.0).
> >> However we found a threshold in the number of cores (depending from the 
> >> application: IMB, xhpl or user applications
> >> and form the number of required cores) above which the application hangs 
> >> (sort of deadlocks).
> >> The building of openmpi with 'gcc' and 'pgi' does not show the same limits.
> >> There are any known incompatibilities of openmpi with this version of 
> >> intel compiilers?
> >> 
> >> The characteristics of our computational infrastructure are:
> >> 
> >> Intel processors E7330, E5345, E5530 e E5620
> >> 
> >> CentOS 5.3, CentOS 5.5.
> >> 
> >> Intel composer XE 2011
> >> gcc 4.1.2
> >> pgi 10.2-1
> >> 
> >> Regards
> >> 
> >> Salvatore Podda
> >> 
> >> ENEA UTICT-HPC
> >> Department for Computer Science Development and ICT
> >> Facilities Laboratory for Science and High Performace Computing
> >> C.R. Frascati
> >> Via E. Fermi, 45
> >> PoBox 65
> >> 00044 Frascati (Rome)
> >> Italy
> >> 
> >> Tel: +39 06 9400 5342
> >> Fax: +39 06 9400 5551
> >> Fax: +39 06 9400 5735
> >> E-mail: salvatore.po...@enea.it
> >> Home Page: www.cresco.enea.it
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> 
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> 
> ------------------------------
> 
> Message: 6
> Date: Thu, 19 May 2011 22:01:00 -0400
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] Openib with > 32 cores per node
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <c18c4827-d305-484a-9dae-290902d40...@cisco.com>
> Content-Type: text/plain; charset=us-ascii
> 
> What Sam is alluding to is that the OpenFabrics driver code in OMPI is 
> sucking up oodles of memory for each IB connection that you're using. The 
> receive_queues param that he sent tells OMPI to use all shared receive queues 
> (instead of defaulting to one per-peer receive queue and the rest shared 
> receive queues -- the per-peer RQ sucks up all the memory when you multiple 
> it by N peers).
> 
> 
> On May 19, 2011, at 11:59 AM, Samuel K. Gutierrez wrote:
> 
> > Hi,
> > 
> > On May 19, 2011, at 9:37 AM, Robert Horton wrote
> > 
> >> On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote:
> >>> Hi,
> >>> 
> >>> Try the following QP parameters that only use shared receive queues.
> >>> 
> >>> -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32
> >>> 
> >> 
> >> Thanks for that. If I run the job over 2 x 48 cores it now works and the
> >> performance seems reasonable (I need to do some more tuning) but when I
> >> go up to 4 x 48 cores I'm getting the same problem:
> >> 
> >> [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one]
> >>  error creating qp errno says Cannot allocate memory
> >> [compute-1-7.local:18106] *** An error occurred in MPI_Isend
> >> [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD
> >> [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list
> >> [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now 
> >> abort)
> >> 
> >> Any thoughts?
> > 
> > How much memory does each node have? Does this happen at startup?
> > 
> > Try adding:
> > 
> > -mca btl_openib_cpc_include rdmacm
> > 
> > I'm not sure if your version of OFED supports this feature, but maybe using 
> > XRC may help. I **think** other tweaks are needed to get this going, but 
> > I'm not familiar with the details.
> > 
> > Hope that helps,
> > 
> > Samuel K. Gutierrez
> > Los Alamos National Laboratory
> > 
> > 
> >> 
> >> Thanks,
> >> Rob
> >> -- 
> >> Robert Horton
> >> System Administrator (Research Support) - School of Mathematical Sciences
> >> Queen Mary, University of London
> >> r.hor...@qmul.ac.uk - +44 (0) 20 7882 7345
> >> 
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> > 
> > 
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> 
> ------------------------------
> 
> Message: 7
> Date: Thu, 19 May 2011 22:04:46 -0400
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] MPI_COMM_DUP freeze with OpenMPI 1.4.1
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <0dcf20b8-ca5c-4746-8187-a2dff39b1...@cisco.com>
> Content-Type: text/plain; charset=us-ascii
> 
> On May 13, 2011, at 8:31 AM, francoise.r...@obs.ujf-grenoble.fr wrote:
> 
> > Here is the MUMPS portion of code (in zmumps_part1.F file) where the slaves 
> > call MPI_COMM_DUP , id%PAR and MASTER are initialized to 0 before :
> > 
> > CALL MPI_COMM_SIZE(id%COMM, id%NPROCS, IERR )
> 
> I re-indented so that I could read it better:
> 
> CALL MPI_COMM_SIZE(id%COMM, id%NPROCS, IERR )
> IF ( id%PAR .eq. 0 ) THEN
> IF ( id%MYID .eq. MASTER ) THEN
> color = MPI_UNDEFINED
> ELSE
> color = 0
> END IF
> CALL MPI_COMM_SPLIT( id%COMM, color, 0,
> & id%COMM_NODES, IERR )
> id%NSLAVES = id%NPROCS - 1
> ELSE
> CALL MPI_COMM_DUP( id%COMM, id%COMM_NODES, IERR )
> id%NSLAVES = id%NPROCS
> END IF
> 
> IF (id%PAR .ne. 0 .or. id%MYID .NE. MASTER) THEN
> CALL MPI_COMM_DUP( id%COMM_NODES, id%COMM_LOAD, IERR
> ENDIF
> 
> That doesn't look right -- both MPI_COMM_SPLIT and MPI_COMM_DUP are 
> collective, meaning that all processes in the communicator must call them. In 
> the first case, only some processes are calling MPI_COMM_SPLIT. Is there some 
> other logic that forces the rest of the processes to call MPI_COMM_SPLIT, too?
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> 
> ------------------------------
> 
> Message: 8
> Date: Thu, 19 May 2011 22:30:03 -0400
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] Trouble with MPI-IO
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <eefb638f-72f1-4208-8ea2-4f25f610c...@cisco.com>
> Content-Type: text/plain; charset=us-ascii
> 
> Props for that testio script. I think you win the award for "most easy to 
> reproduce test case." :-)
> 
> I notice that some of the lines went over 72 columns, so I renamed the file 
> x.f90 and changed all the comments from "c" to "!" and joined the two &-split 
> lines. The error about implicit type for lenr went away, but then when I 
> enabled better type checking by using "use mpi" instead of "include 
> 'mpif.h'", I got the following:
> 
> x.f90:99.77:
> 
> call mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr)
> 1 
> Error: There is no specific subroutine for the generic 'mpi_type_indexed' at 
> (1)
> 
> I looked at our mpi F90 module and see the following:
> 
> interface MPI_Type_indexed
> subroutine MPI_Type_indexed(count, array_of_blocklengths, 
> array_of_displacements, oldtype, newtype, ierr)
> integer, intent(in) :: count
> integer, dimension(*), intent(in) :: array_of_blocklengths
> integer, dimension(*), intent(in) :: array_of_displacements
> integer, intent(in) :: oldtype
> integer, intent(out) :: newtype
> integer, intent(out) :: ierr
> end subroutine MPI_Type_indexed
> end interface
> 
> I don't quite grok the syntax of the "allocatable" type ijdisp, so that might 
> be the problem here...?
> 
> Regardless, I'm not entirely sure if the problem is the >72 character lines, 
> but then when that is gone, I'm not sure how the allocatable stuff fits in... 
> (I'm not enough of a Fortran programmer to know)
> 
> 
> 
> 
> On May 10, 2011, at 7:14 PM, Tom Rosmond wrote:
> 
> > I would appreciate someone with experience with MPI-IO look at the
> > simple fortran program gzipped and attached to this note. It is
> > imbedded in a script so that all that is necessary to run it is do:
> > 'testio' from the command line. The program generates a small 2-D input
> > array, sets up an MPI-IO environment, and write a 2-D output array
> > twice, with the only difference being the displacement arrays used to
> > construct the indexed datatype. For the first write, simple
> > monotonically increasing displacements are used, for the second the
> > displacements are 'shuffled' in one dimension. They are printed during
> > the run.
> > 
> > For the first case the file is written properly, but for the second the
> > program hangs on MPI_FILE_WRITE_AT_ALL and must be aborted manually.
> > Although the program is compiled as an mpi program, I am running on a
> > single processor, which makes the problem more puzzling.
> > 
> > The program should be relatively self-explanatory, but if more
> > information is needed, please ask. I am on an 8 core Xeon based Dell
> > workstation running Scientific Linux 5.5, Intel fortran 12.0.3, and
> > OpenMPI 1.5.3. I have also attached output from 'ompi_info'.
> > 
> > T. Rosmond
> > 
> > 
> > <testio.gz><info_ompi.gz>_______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> 
> ------------------------------
> 
> Message: 9
> Date: Thu, 19 May 2011 20:24:25 -0700
> From: Tom Rosmond <rosm...@reachone.com>
> Subject: Re: [OMPI users] Trouble with MPI-IO
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <1305861865.4284.104.ca...@cedar.reachone.com>
> Content-Type: text/plain
> 
> Thanks for looking at my problem. Sounds like you did reproduce my
> problem. I have added some comments below
> 
> On Thu, 2011-05-19 at 22:30 -0400, Jeff Squyres wrote:
> > Props for that testio script. I think you win the award for "most easy to 
> > reproduce test case." :-)
> > 
> > I notice that some of the lines went over 72 columns, so I renamed the file 
> > x.f90 and changed all the comments from "c" to "!" and joined the two 
> > &-split lines. The error about implicit type for lenr went away, but then 
> > when I enabled better type checking by using "use mpi" instead of "include 
> > 'mpif.h'", I got the following:
> 
> What fortran compiler did you use?
> 
> In the original script my Intel compile used the -132 option, 
> allowing up to that many columns per line. I still think in
> F77 fortran much of the time, and use 'c' for comments out
> of habit. The change to '!' doesn't make any difference.
> 
> 
> > x.f90:99.77:
> > 
> > call mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr)
> > 1 
> > Error: There is no specific subroutine for the generic 'mpi_type_indexed' 
> > at (1)
> 
> Hmmm, very strange, since I am looking right at the MPI standard
> documents with that routine documented. I too get this compile failure
> when I switch to 'use mpi'. Could that be a problem with the Open MPI
> fortran libraries???
> > 
> > I looked at our mpi F90 module and see the following:
> > 
> > interface MPI_Type_indexed
> > subroutine MPI_Type_indexed(count, array_of_blocklengths, 
> > array_of_displacements, oldtype, newtype, ierr)
> > integer, intent(in) :: count
> > integer, dimension(*), intent(in) :: array_of_blocklengths
> > integer, dimension(*), intent(in) :: array_of_displacements
> > integer, intent(in) :: oldtype
> > integer, intent(out) :: newtype
> > integer, intent(out) :: ierr
> > end subroutine MPI_Type_indexed
> > end interface
> > 
> > I don't quite grok the syntax of the "allocatable" type ijdisp, so that 
> > might be the problem here...?
> 
> Just a standard F90 'allocatable' statement. I've written thousands
> just like it.
> > 
> > Regardless, I'm not entirely sure if the problem is the >72 character 
> > lines, but then when that is gone, I'm not sure how the allocatable stuff 
> > fits in... (I'm not enough of a Fortran programmer to know)
> > 
> Anyone else out that who can comment????
> 
> 
> T. Rosmond
> 
> 
> 
> > 
> > On May 10, 2011, at 7:14 PM, Tom Rosmond wrote:
> > 
> > > I would appreciate someone with experience with MPI-IO look at the
> > > simple fortran program gzipped and attached to this note. It is
> > > imbedded in a script so that all that is necessary to run it is do:
> > > 'testio' from the command line. The program generates a small 2-D input
> > > array, sets up an MPI-IO environment, and write a 2-D output array
> > > twice, with the only difference being the displacement arrays used to
> > > construct the indexed datatype. For the first write, simple
> > > monotonically increasing displacements are used, for the second the
> > > displacements are 'shuffled' in one dimension. They are printed during
> > > the run.
> > > 
> > > For the first case the file is written properly, but for the second the
> > > program hangs on MPI_FILE_WRITE_AT_ALL and must be aborted manually.
> > > Although the program is compiled as an mpi program, I am running on a
> > > single processor, which makes the problem more puzzling.
> > > 
> > > The program should be relatively self-explanatory, but if more
> > > information is needed, please ask. I am on an 8 core Xeon based Dell
> > > workstation running Scientific Linux 5.5, Intel fortran 12.0.3, and
> > > OpenMPI 1.5.3. I have also attached output from 'ompi_info'.
> > > 
> > > T. Rosmond
> > > 
> > > 
> > > <testio.gz><info_ompi.gz>_______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> > 
> 
> 
> 
> ------------------------------
> 
> Message: 10
> Date: Fri, 20 May 2011 09:25:14 +0200
> From: David B?ttner <david.buett...@in.tum.de>
> Subject: Re: [OMPI users] Problem with MPI_Request, MPI_Isend/recv and
> MPI_Wait/Test
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <4dd6175a.1080...@in.tum.de>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> Hello,
> 
> thanks for the quick answer. I am sorry that I forgot to mention this: I 
> did compile OpenMPI with MPI_THREAD_MULTIPLE support and test if 
> required == provided after the MPI_Thread_init call.
> 
> > I do not see any mechanism for protecting the accesses to the requests to a 
> > single thread? What is the thread model you're using?
> >
> Again I am sorry that this was not clear: In the pseudo code below I 
> wanted to indicate the access-protection I do by thread-id dependent 
> calls if(0 == thread-id) and by using the trylock(...) (using 
> pthread-mutexes). In the code all accesses concerning one MPI_Request 
> (which are pthread-global-pointers in my case) are protected and called 
> in sequential order, i.e. MPI_Isend/recv is returns before any thread is 
> allowed to call the corresponding MPI_Test and no-one can call MPI_Test 
> any more when a thread is allowed to call MPI_Wait.
> I did this in the same manner before with other MPI implementations, but 
> also on the same machine with the same (untouched) OpenMPI 
> implementation, also using pthreads and MPI in combination, but I used
> 
> MPI_Request req;
> 
> instead of
> 
> MPI_Request* req;
> (and later)
> req = (MPI_Request*)malloc(sizeof(MPI_Request));
> 
> 
> In my recent (problem) code, I also tried not using pointers, but got 
> the same problem. Also, as I described in the first mail, I tried 
> everything concerning the memory allocation of the MPI_Request objects.
> I tried not calling malloc. This I guessed wouldn't work, but the 
> OpenMPI documentation says this:
> 
> " Nonblocking calls allocate a communication request object and 
> associate it with the request handle the argument request). " 
> [http://www.open-mpi.org/doc/v1.4/man3/MPI_Isend.3.php] and
> 
> " [...] if the communication object was created by a nonblocking send or 
> receive, then it is deallocated and the request handle is set to 
> MPI_REQUEST_NULL." 
> [http://www.open-mpi.org/doc/v1.4/man3/MPI_Test.3.php] and (in slightly 
> different words) [http://www.open-mpi.org/doc/v1.4/man3/MPI_Wait.3.php]
> 
> So I thought that it might do some kind of optimized memory stuff 
> internally.
> 
> I also tried allocating req (for each used MPI_Request) once before the 
> first use and deallocation after the last use (which I thought was the 
> way it was supposed to work), but that crashes also.
> 
> I tried replacing the pointers through global variables
> 
> MPI_Request req;
> 
> which didn't do the job...
> 
> The only thing that seems to work is what I mentioned below: Allocate 
> every time I am going to need it in the MPI_Isend/recv, use it in 
> MPI_Test/Wait and after that deallocate it by hand each time.
> I don't think that this is supposed to be like this since I have to do a 
> call to malloc and free so often (for multiple MPI_Request objects in 
> each iteration) that it will most likely limit performance...
> 
> Anyway I still have the same problem and am still unclear on what kind 
> of memory allocation I should be doing for the MPI_Requests. Is there 
> anything else (besides MPI_THREAD_MULTIPLE support, thread access 
> control, sequential order of MPI_Isend/recv, MPI_Test and MPI_Wait for 
> one MPI_Request object) I need to take care of? If not, what could I do 
> to find the source of my problem?
> 
> Thanks again for any kind of help!
> 
> Kind regards,
> David
> 
> 
> 
> > > From an implementation perspective, your code is correct only if you 
> > > initialize the MPI library with MPI_THREAD_MULTIPLE and if the library 
> > > accepts. Otherwise, there is an assumption that the application is single 
> > > threaded, or that the MPI behavior is implementation dependent. Please 
> > > read the MPI standard regarding to MPI_Init_thread for more details.
> >
> > Regards,
> > george.
> >
> > On May 19, 2011, at 02:34 , David B?ttner wrote:
> >
> >> Hello,
> >>
> >> I am working on a hybrid MPI (OpenMPI 1.4.3) and Pthread code. I am using 
> >> MPI_Isend and MPI_Irecv for communication and MPI_Test/MPI_Wait to check 
> >> if it is done. I do this repeatedly in the outer loop of my code. The 
> >> MPI_Test is used in the inner loop to check if some function can be called 
> >> which depends on the received data.
> >> The program regularly crashed (only when not using printf...) and after 
> >> debugging it I figured out the following problem:
> >>
> >> In MPI_Isend I have an invalid read of memory. I fixed the problem with 
> >> not re-using a
> >>
> >> MPI_Request req_s, req_r;
> >>
> >> but by using
> >>
> >> MPI_Request* req_s;
> >> MPI_Request* req_r
> >>
> >> and re-allocating them before the MPI_Isend/recv.
> >>
> >> The documentation says, that in MPI_Wait and MPI_Test (if successful) the 
> >> request-objects are deallocated and set to MPI_REQUEST_NULL.
> >> It also says, that in MPI_Isend and MPI_Irecv, it allocates the Objects 
> >> and associates it with the request object.
> >>
> >> As I understand this, this either means I can use a pointer to MPI_Request 
> >> which I don't have to initialize for this (it doesn't work but crashes), 
> >> or that I can use a MPI_Request pointer which I have initialized with 
> >> malloc(sizeof(MPI_REQUEST)) (or passing the address of a MPI_Request req), 
> >> which is set and unset in the functions. But this version crashes, too.
> >> What works is using a pointer, which I allocate before the MPI_Isend/recv 
> >> and which I free after MPI_Wait in every iteration. In other words: It 
> >> only uses if I don't reuse any kind of MPI_Request. Only if I recreate one 
> >> every time.
> >>
> >> Is this, what is should be like? I believe that a reuse of the memory 
> >> would be a lot more efficient (less calls to malloc...). Am I missing 
> >> something here? Or am I doing something wrong?
> >>
> >>
> >> Let me provide some more detailed information about my problem:
> >>
> >> I am running the program on a 30 node infiniband cluster. Each node has 4 
> >> single core Opteron CPUs. I am running 1 MPI Rank per node and 4 threads 
> >> per rank (-> one thread per core).
> >> I am compiling with mpicc of OpenMPI using gcc below.
> >> Some pseudo-code of the program can be found at the end of this e-mail.
> >>
> >> I was able to reproduce the problem using different amount of nodes and 
> >> even using one node only. The problem does not arise when I put 
> >> printf-debugging information into the code. This pointed me into the 
> >> direction that I have some memory problem, where some write accesses some 
> >> memory it is not supposed to.
> >> I ran the tests using valgrind with --leak-check=full and 
> >> --show-reachable=yes, which pointed me either to MPI_Isend or MPI_Wait 
> >> depending on whether I had the threads spin in a loop for MPI_Test to 
> >> return success or used MPI_Wait respectively.
> >>
> >> I would appreciate your help with this. Am I missing something important 
> >> here? Is there a way to re-use the request in the different iterations 
> >> other than I thought it should work?
> >> Or is there a way to re-initialize the allocated memory before the 
> >> MPI_Isend/recv so that I at least don't have to call free and malloc each 
> >> time?
> >>
> >> Thank you very much for your help!
> >> Kind regards,
> >> David B?ttner
> >>
> >> _____________________
> >> Pseudo-Code of program:
> >>
> >> MPI_Request* req_s;
> >> MPI_Request* req_w;
> >> OUTER-LOOP
> >> if(0 == threadid)
> >> {
> >> req_s = malloc(sizeof(MPI_Request));
> >> req_r = malloc(sizeof(MPI_Request));
> >> MPI_Isend(..., req_s)
> >> MPI_Irecv(..., req_r)
> >> }
> >> pthread_barrier
> >> INNER-LOOP (while NOT_DONE or RET)
> >> if(TRYLOCK&& NOT_DONE)
> >> {
> >> if(MPI_TEST(req_r))
> >> {
> >> Call_Function_A;
> >> NOT_DONE = 0;
> >> }
> >>
> >> }
> >> RET = Call_Function_B;
> >> }
> >> pthread_barrier_wait
> >> if(0 == threadid)
> >> {
> >> MPI_WAIT(req_s)
> >> MPI_WAIT(req_r)
> >> free(req_s);
> >> free(req_r);
> >> }
> >> _____________
> >>
> >>
> >> -- 
> >> David B?ttner, Informatik, Technische Universit?t M?nchen
> >> TUM I-10 - FMI 01.06.059 - Tel. 089 / 289-17676
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > "To preserve the freedom of the human mind then and freedom of the press, 
> > every spirit should be ready to devote itself to martyrdom; for as long as 
> > we may think as we will, and speak as we think, the condition of man will 
> > proceed in improvement."
> > -- Thomas Jefferson, 1799
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> -- 
> David B?ttner, Informatik, Technische Universit?t M?nchen
> TUM I-10 - FMI 01.06.059 - Tel. 089 / 289-17676
> 
> 
> 
> ------------------------------
> 
> Message: 11
> Date: Fri, 20 May 2011 06:23:21 -0400
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] Trouble with MPI-IO
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <a5b121e9-e664-49d0-ae54-2cfe52712...@cisco.com>
> Content-Type: text/plain; charset=us-ascii
> 
> On May 19, 2011, at 11:24 PM, Tom Rosmond wrote:
> 
> > What fortran compiler did you use?
> 
> gfortran.
> 
> > In the original script my Intel compile used the -132 option, 
> > allowing up to that many columns per line. 
> 
> Gotcha.
> 
> >> x.f90:99.77:
> >> 
> >> call mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr)
> >> 1 
> >> Error: There is no specific subroutine for the generic 'mpi_type_indexed' 
> >> at (1)
> > 
> > Hmmm, very strange, since I am looking right at the MPI standard
> > documents with that routine documented. I too get this compile failure
> > when I switch to 'use mpi'. Could that be a problem with the Open MPI
> > fortran libraries???
> 
> I think that that error is telling us that there's a compile-time mismatch -- 
> that the signature of what you've passed doesn't match the signature of 
> OMPI's MPI_Type_indexed subroutine.
> 
> >> I looked at our mpi F90 module and see the following:
> >> 
> >> interface MPI_Type_indexed
> >> subroutine MPI_Type_indexed(count, array_of_blocklengths, 
> >> array_of_displacements, oldtype, newtype, ierr)
> >> integer, intent(in) :: count
> >> integer, dimension(*), intent(in) :: array_of_blocklengths
> >> integer, dimension(*), intent(in) :: array_of_displacements
> >> integer, intent(in) :: oldtype
> >> integer, intent(out) :: newtype
> >> integer, intent(out) :: ierr
> >> end subroutine MPI_Type_indexed
> >> end interface
> 
> Shouldn't ijlena and ijdisp be 1D arrays, not 2D arrays?
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> 
> ------------------------------
> 
> Message: 12
> Date: Fri, 20 May 2011 07:26:19 -0400
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] MPI_Alltoallv function crashes when np > 100
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <f9f71854-b9dd-459f-999d-8a8aef8d6...@cisco.com>
> Content-Type: text/plain; charset=GB2312
> 
> I missed this email in my INBOX, sorry.
> 
> Can you be more specific about what exact error is occurring? You just say 
> that the application crashes...? Please send all the information listed here:
> 
> http://www.open-mpi.org/community/help/
> 
> 
> On Apr 26, 2011, at 10:51 PM, ?????? wrote:
> 
> > It seems that the const variable SOMAXCONN who used by listen() system call 
> > causes this problem. Can anybody help me resolve this question?
> > 
> > 2011/4/25 ?????? <xjun.m...@gmail.com>
> > Dear all,
> > 
> > As I mentioned, when I mpiruned an application with the parameter "np = 
> > 150(or bigger)", the application who used the MPI_Alltoallv function would 
> > carsh. The problem would recur no matter how many nodes we used. 
> > 
> > The edition of OpenMPI: 1.4.1 or 1.4.3
> > The OS: linux redhat 2.6.32
> > 
> > BTW, my nodes had enough memory to run the application, and the 
> > MPI_Alltoall function worked well at my environment.
> > Did anybody meet the same problem? Thanks.
> > 
> > 
> > Best Regards
> > 
> > 
> > 
> > 
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> 
> ------------------------------
> 
> Message: 13
> Date: Fri, 20 May 2011 07:28:28 -0400
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] MPI_ERR_TRUNCATE with MPI_Allreduce() error,
> but only sometimes...
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <caef632e-757b-49ee-b545-5cccbc712...@cisco.com>
> Content-Type: text/plain; charset=us-ascii
> 
> Sorry for the super-late reply. :-\
> 
> Yes, ERR_TRUNCATE means that the receiver didn't have a large enough buffer.
> 
> Have you tried upgrading to a newer version of Open MPI? 1.4.3 is the current 
> stable release (I have a very dim and not guaranteed to be correct 
> recollection that we fixed something in the internals of collectives 
> somewhere with regards to ERR_TRUNCATE...?).
> 
> 
> On Apr 25, 2011, at 4:44 PM, Wei Hao wrote:
> 
> > Hi:
> > 
> > I'm running openmpi 1.2.8. I'm working on a project where one part involves 
> > communicating an integer, representing the number of data points I'm 
> > keeping track of, to all the processors. The line is simple:
> > 
> > MPI_Allreduce(&np,&geo_N,1,MPI_INT,MPI_MAX,MPI_COMM_WORLD);
> > 
> > where np and geo_N are integers, np is the result of a local calculation, 
> > and geo_N has been declared on all the processors. geo_N is nondecreasing. 
> > This line works the first time I call it (geo_N goes from 0 to some other 
> > integer), but if I call it later in the program, I get the following error:
> > 
> > 
> > [woodhen-039:26189] *** An error occurred in MPI_Allreduce
> > [woodhen-039:26189] *** on communicator MPI_COMM_WORLD
> > [woodhen-039:26189] *** MPI_ERR_TRUNCATE: message truncated
> > [woodhen-039:26189] *** MPI_ERRORS_ARE_FATAL (goodbye)
> > 
> > 
> > As I understand it, MPI_ERR_TRUNCATE means that the output buffer is too 
> > small, but I'm not sure where I've made a mistake. It's particularly 
> > frustrating because it seems to work fine the first time. Does anyone have 
> > any thoughts?
> > 
> > Thanks
> > Wei
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> 
> 
> ------------------------------
> 
> Message: 14
> Date: Fri, 20 May 2011 08:14:07 -0400
> From: Jeff Squyres <jsquy...@cisco.com>
> Subject: Re: [OMPI users] Trouble with MPI-IO
> To: Open MPI Users <us...@open-mpi.org>
> Message-ID: <42db03b3-9cf4-4acb-aa20-b857e5f76...@cisco.com>
> Content-Type: text/plain; charset="us-ascii"
> 
> On May 20, 2011, at 6:23 AM, Jeff Squyres wrote:
> 
> > Shouldn't ijlena and ijdisp be 1D arrays, not 2D arrays?
> 
> Ok, if I convert ijlena and ijdisp to 1D arrays, I don't get the compile 
> error (even though they're allocatable -- so allocate was a red herring, 
> sorry). That's all that "use mpi" is complaining about -- that the function 
> signatures didn't match.
> 
> use mpi is your friend -- even if you don't use F90 constructs much. 
> Compile-time checking is Very Good Thing (you were effectively "getting 
> lucky" by passing in the 2D arrays, I think).
> 
> Attached is my final version. And with this version, I see the hang when 
> running it with the "T" parameter.
> 
> That being said, I'm not an expert on the MPI IO stuff -- your code *looks* 
> right to me, but I could be missing something subtle in the interpretation of 
> MPI_FILE_SET_VIEW. I tried running your code with MPICH 1.3.2p1 and it also 
> hung.
> 
> Rob (ROMIO guy) -- can you comment this code? Is it correct?
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: x.f90
> Type: application/octet-stream
> Size: 3820 bytes
> Desc: not available
> URL: 
> <http://www.open-mpi.org/MailArchives/users/attachments/20110520/53a5461b/attachment.obj>
> 
> ------------------------------
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> End of users Digest, Vol 1911, Issue 1
> **************************************

Re: [OMPI users] v1.5.3-x64 does not work on Windows 7 workgroup

Reply via email to