Re: [OMPI users] Problem with MPI_Request, MPI_Isend/recv and MPI_Wait/Test
Hello, thanks for the quick answer. I am sorry that I forgot to mention this: I did compile OpenMPI with MPI_THREAD_MULTIPLE support and test if required == provided after the MPI_Thread_init call. I do not see any mechanism for protecting the accesses to the requests to a single thread? What is the thread model you're using? Again I am sorry that this was not clear: In the pseudo code below I wanted to indicate the access-protection I do by thread-id dependent calls if(0 == thread-id) and by using the trylock(...) (using pthread-mutexes). In the code all accesses concerning one MPI_Request (which are pthread-global-pointers in my case) are protected and called in sequential order, i.e. MPI_Isend/recv is returns before any thread is allowed to call the corresponding MPI_Test and no-one can call MPI_Test any more when a thread is allowed to call MPI_Wait. I did this in the same manner before with other MPI implementations, but also on the same machine with the same (untouched) OpenMPI implementation, also using pthreads and MPI in combination, but I used MPI_Request req; instead of MPI_Request* req; (and later) req = (MPI_Request*)malloc(sizeof(MPI_Request)); In my recent (problem) code, I also tried not using pointers, but got the same problem. Also, as I described in the first mail, I tried everything concerning the memory allocation of the MPI_Request objects. I tried not calling malloc. This I guessed wouldn't work, but the OpenMPI documentation says this: " Nonblocking calls allocate a communication request object and associate it with the request handle the argument request). " [http://www.open-mpi.org/doc/v1.4/man3/MPI_Isend.3.php] and " [...] if the communication object was created by a nonblocking send or receive, then it is deallocated and the request handle is set to MPI_REQUEST_NULL." [http://www.open-mpi.org/doc/v1.4/man3/MPI_Test.3.php] and (in slightly different words) [http://www.open-mpi.org/doc/v1.4/man3/MPI_Wait.3.php] So I thought that it might do some kind of optimized memory stuff internally. I also tried allocating req (for each used MPI_Request) once before the first use and deallocation after the last use (which I thought was the way it was supposed to work), but that crashes also. I tried replacing the pointers through global variables MPI_Request req; which didn't do the job... The only thing that seems to work is what I mentioned below: Allocate every time I am going to need it in the MPI_Isend/recv, use it in MPI_Test/Wait and after that deallocate it by hand each time. I don't think that this is supposed to be like this since I have to do a call to malloc and free so often (for multiple MPI_Request objects in each iteration) that it will most likely limit performance... Anyway I still have the same problem and am still unclear on what kind of memory allocation I should be doing for the MPI_Requests. Is there anything else (besides MPI_THREAD_MULTIPLE support, thread access control, sequential order of MPI_Isend/recv, MPI_Test and MPI_Wait for one MPI_Request object) I need to take care of? If not, what could I do to find the source of my problem? Thanks again for any kind of help! Kind regards, David > From an implementation perspective, your code is correct only if you initialize the MPI library with MPI_THREAD_MULTIPLE and if the library accepts. Otherwise, there is an assumption that the application is single threaded, or that the MPI behavior is implementation dependent. Please read the MPI standard regarding to MPI_Init_thread for more details. Regards, george. On May 19, 2011, at 02:34 , David Büttner wrote: Hello, I am working on a hybrid MPI (OpenMPI 1.4.3) and Pthread code. I am using MPI_Isend and MPI_Irecv for communication and MPI_Test/MPI_Wait to check if it is done. I do this repeatedly in the outer loop of my code. The MPI_Test is used in the inner loop to check if some function can be called which depends on the received data. The program regularly crashed (only when not using printf...) and after debugging it I figured out the following problem: In MPI_Isend I have an invalid read of memory. I fixed the problem with not re-using a MPI_Request req_s, req_r; but by using MPI_Request* req_s; MPI_Request* req_r and re-allocating them before the MPI_Isend/recv. The documentation says, that in MPI_Wait and MPI_Test (if successful) the request-objects are deallocated and set to MPI_REQUEST_NULL. It also says, that in MPI_Isend and MPI_Irecv, it allocates the Objects and associates it with the request object. As I understand this, this either means I can use a pointer to MPI_Request which I don't have to initialize for this (it doesn't work but crashes), or that I can use a MPI_Request pointer which I have initialized with malloc(sizeof(MPI_REQUEST)) (or passing the address of a MPI_Request req), which is set and unset in the functions. But this version crashes, too. What
Re: [OMPI users] Trouble with MPI-IO
On May 19, 2011, at 11:24 PM, Tom Rosmond wrote: > What fortran compiler did you use? gfortran. > In the original script my Intel compile used the -132 option, > allowing up to that many columns per line. Gotcha. >> x.f90:99.77: >> >>call mpi_type_indexed(lenij,ijlena,ijdisp,mpi_real,ij_vector_type,ierr) >> 1 >> Error: There is no specific subroutine for the generic 'mpi_type_indexed' at >> (1) > > Hmmm, very strange, since I am looking right at the MPI standard > documents with that routine documented. I too get this compile failure > when I switch to 'use mpi'. Could that be a problem with the Open MPI > fortran libraries??? I think that that error is telling us that there's a compile-time mismatch -- that the signature of what you've passed doesn't match the signature of OMPI's MPI_Type_indexed subroutine. >> I looked at our mpi F90 module and see the following: >> >> interface MPI_Type_indexed >> subroutine MPI_Type_indexed(count, array_of_blocklengths, >> array_of_displacements, oldtype, newtype, ierr) >> integer, intent(in) :: count >> integer, dimension(*), intent(in) :: array_of_blocklengths >> integer, dimension(*), intent(in) :: array_of_displacements >> integer, intent(in) :: oldtype >> integer, intent(out) :: newtype >> integer, intent(out) :: ierr >> end subroutine MPI_Type_indexed >> end interface Shouldn't ijlena and ijdisp be 1D arrays, not 2D arrays? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] MPI_Alltoallv function crashes when np > 100
I missed this email in my INBOX, sorry. Can you be more specific about what exact error is occurring? You just say that the application crashes...? Please send all the information listed here: http://www.open-mpi.org/community/help/ On Apr 26, 2011, at 10:51 PM, 孟宪军 wrote: > It seems that the const variable SOMAXCONN who used by listen() system call > causes this problem. Can anybody help me resolve this question? > > 2011/4/25 孟宪军 > Dear all, > > As I mentioned, when I mpiruned an application with the parameter "np = > 150(or bigger)", the application who used the MPI_Alltoallv function would > carsh. The problem would recur no matter how many nodes we used. > > The edition of OpenMPI: 1.4.1 or 1.4.3 > The OS: linux redhat 2.6.32 > > BTW, my nodes had enough memory to run the application, and the MPI_Alltoall > function worked well at my environment. > Did anybody meet the same problem? Thanks. > > > Best Regards > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] MPI_ERR_TRUNCATE with MPI_Allreduce() error, but only sometimes...
Sorry for the super-late reply. :-\ Yes, ERR_TRUNCATE means that the receiver didn't have a large enough buffer. Have you tried upgrading to a newer version of Open MPI? 1.4.3 is the current stable release (I have a very dim and not guaranteed to be correct recollection that we fixed something in the internals of collectives somewhere with regards to ERR_TRUNCATE...?). On Apr 25, 2011, at 4:44 PM, Wei Hao wrote: > Hi: > > I'm running openmpi 1.2.8. I'm working on a project where one part involves > communicating an integer, representing the number of data points I'm keeping > track of, to all the processors. The line is simple: > >MPI_Allreduce(&np,&geo_N,1,MPI_INT,MPI_MAX,MPI_COMM_WORLD); > > where np and geo_N are integers, np is the result of a local calculation, and > geo_N has been declared on all the processors. geo_N is nondecreasing. This > line works the first time I call it (geo_N goes from 0 to some other > integer), but if I call it later in the program, I get the following error: > > >[woodhen-039:26189] *** An error occurred in MPI_Allreduce >[woodhen-039:26189] *** on communicator MPI_COMM_WORLD >[woodhen-039:26189] *** MPI_ERR_TRUNCATE: message truncated >[woodhen-039:26189] *** MPI_ERRORS_ARE_FATAL (goodbye) > > > As I understand it, MPI_ERR_TRUNCATE means that the output buffer is too > small, but I'm not sure where I've made a mistake. It's particularly > frustrating because it seems to work fine the first time. Does anyone have > any thoughts? > > Thanks > Wei > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Trouble with MPI-IO
On May 20, 2011, at 6:23 AM, Jeff Squyres wrote: > Shouldn't ijlena and ijdisp be 1D arrays, not 2D arrays? Ok, if I convert ijlena and ijdisp to 1D arrays, I don't get the compile error (even though they're allocatable -- so allocate was a red herring, sorry). That's all that "use mpi" is complaining about -- that the function signatures didn't match. use mpi is your friend -- even if you don't use F90 constructs much. Compile-time checking is Very Good Thing (you were effectively "getting lucky" by passing in the 2D arrays, I think). Attached is my final version. And with this version, I see the hang when running it with the "T" parameter. That being said, I'm not an expert on the MPI IO stuff -- your code *looks* right to me, but I could be missing something subtle in the interpretation of MPI_FILE_SET_VIEW. I tried running your code with MPICH 1.3.2p1 and it also hung. Rob (ROMIO guy) -- can you comment this code? Is it correct? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ x.f90 Description: Binary data
[OMPI users] Issue with mpicc --showme in windows
Hello, Here in my windows machine, if i ran mpicc -showme, i get erroneous output like below:- ** C:\>C:\Users\BAAMARNA5617\Programs\mpi\OpenMPI_v1.5.3-win32\bin\mpicc.exe --showme Cannot open configuration file C:/Users/hpcfan/Documents/OpenMPI/openmpi-1.5.3/i nstalled-32/share/openmpi\mpif77.exe-wrapper-data.txt Error parsing data file mpif77.exe: Not found ** I installed openmpi from http://www.open-mpi.org/software/ompi/v1.5/downloads/OpenMPI_v1.5.3-2_win32.exe and end up with error. (Read in a forum that 1.4 version of openmpi does not support fortran bindings and hence obtained one of the recent releases). Hope to fix this soon, With thanks and regards Balachandar The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, please notify Airbus immediately and delete this e-mail. Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately. All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free.
Re: [OMPI users] TotalView Memory debugging and OpenMPI
Thanks Ralph. I've seen the messages generated in b...@open-mpi.org, so I figured something was up! I was going to provide the unified diff, but then ran into another issue in testing where we immediately ran into a seq fault, even with this fix. It turns out that a pre-pending of /lib64 (and maybe /usr/lib64) to LD_LIBRARY_PATH works around that one though, so I don't think it's directly related, but it threw me off, along with the beta testing we're doing... Cheers, PeterT Ralph Castain wrote: Okay, I finally had time to parse this and fix it. Thanks! On May 16, 2011, at 1:02 PM, Peter Thompson wrote: Hmmm? We're not removing the putenv() calls. Just adding a strdup() beforehand, and then calling putenv() with the string duplicated from env[j]. Of course, if the strdup fails, then we bail out. As for why it's suddenly a problem, I'm not quite as certain. The problem we do show is a double free, so someone has already freed that memory used by putenv(), and I do know that while that used to be just flagged as an event before, now we seem to be unable to continue past it. Not sure if that is our change or a library/system change. PeterT Ralph Castain wrote: On May 16, 2011, at 12:45 PM, Peter Thompson wrote: Hi Ralph, We've had a number of user complaints about this. Since it seems on the face of it that it is a debugger issue, it may have not made it's way back here. Is your objection that the patch basically aborts if it gets a bad value? I could understand that being a concern. Of course, it aborts on TotalView now if we attempt to move forward without this patch. No - my concern is that you appear to be removing the "putenv" calls. OMPI places some values into the local environment so the user can control behavior. Removing those causes problems. What I need to know is why, after it has worked with TV for years, these putenv's are suddenly a problem. Is the problem occurring during shutdown? Or is this something that causes TV to break? I've passed your comment back to the engineer, with a suspicion about the concerns about the abort, but if you have other objections, let me know. Cheers, PeterT Ralph Castain wrote: That would be a problem, I fear. We need to push those envars into the environment. Is there some particular problem causing what you see? We have no other reports of this issue, and orterun has had that code forever. Sent from my iPad On May 11, 2011, at 2:05 PM, Peter Thompson wrote: We've gotten a few reports of problems with memory debugging when using OpenMPI under TotalView. Usually, TotalView will attach tot he processes started after an MPI_Init. However in the case where memory debugging is enabled, things seemed to run away or fail. My analysis showed that we had a number of core files left over from the attempt, and all were mpirun (or orterun) cores. It seemed to be a regression on our part, since testing seemed to indicate this worked okay before TotalView 8.9.0-0, so I filed an internal bug and passed it to engineering. After giving our engineer a brief tutorial on how to build a debug version of OpenMPI, he found what appears to be a problem in the code for orterun.c. He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 1.4.4rc2 and 1.5.3, those being the versions he's tested with so far.He doesn't subscribe to this list that I know of, so I offered to pass this by the group. Of course, I'm not sure if this is exactly the right place to submit patches, but I'm sure you'd tell me where to put it if I'm in the wrong here. It's a short patch, so I'll cut and paste it, and attach as well, since cut and paste can do weird things to formatting. Credit goes to Ariel Burton for this patch. Of course he used TotalVIew to find this ;-) It shows up if you do 'mpirun -tv -np 4 ./foo' or 'totalview mpirun -a -np 4 ./foo' Cheers, PeterT more ~/patches/anbs-patch *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400 --- /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../. ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 20:28:16.5881 83000 -0400 *** *** 1578,1588 } if (NULL != env) { size1 = opal_argv_count(env); for (j = 0; j < size1; ++j) { ! putenv(env[j]); } } /* All done */ --- 1578,1600 } if (NULL != env) { size1 = opal_argv_count(env); for (j = 0; j < size1; ++j) { ! /* Use-after-Free error possible here. putenv does not copy !the string passed to it, and instead stores only the pointer. !env[j] may be freed later, in which case the pointer !in environ will now be left dangling into a deallocated !region. !So we make a copy of the variable. ! */ !
Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)
We are still struggling we these problems. Actually the new version of intel compilers does not seem to be the real issue. We clash against the same errors using also the `gcc' compilers. We succeed in building an openmi-1.2.8 (with different compiler flavours) rpm from the installation of the cluster section where all seems to work well. We are now doing a severe IMB benchmark campaign. However, yes this happen only whe we use the --mca btl openib,self, on the contrary if we use --mca btl_tcp_if_include ib0 all works well. Yes we can try the flag you suggest. I can check on the FAQ and on the opem-mpi.org documentation, but can you be so kindly to explain the meaning of this flag? Thanks Salvatore Podda On 20/mag/11, at 03:37, Jeff Squyres wrote: Sorry for the late reply. Other users have seen something similar but we have never been able to reproduce it. Is this only when using IB? If you use "mpirun -- mca btl_openib_cpc_if_include rdmacm", does the problem go away? On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote: I've seen the same thing when I build openmpi 1.4.3 with Intel 12, but only when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1 then the collectives hangs go away. I don't know what, if anything, the higher optimization buys you when compiling openmpi, so I'm not sure if that's an acceptable workaround or not. My system is similar to yours - Intel X5570 with QDR Mellanox IB running RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm using IMB 3.2.2 with a single iteration of Barrier to reproduce the hang, and it happens 100% of the time for me when I invoke it like this: # salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier The hang happens on the first Barrier (64 ranks) and each of the participating ranks have this backtrace: __poll (...) poll_dispatch () from [instdir]/lib/libopen-pal.so.0 opal_event_loop () from [instdir]/lib/libopen-pal.so.0 opal_progress () from [instdir]/lib/libopen-pal.so.0 ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0 ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0 ompi_coll_tuned_barrier_intra_recursivedoubling () from [instdir]/ lib/libmpi.so.0 ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/ libmpi.so.0 PMPI_Barrier () from [instdir]/lib/libmpi.so.0 IMB_barrier () IMB_init_buffers_iter () main () The one non-participating rank has this backtrace: __poll (...) poll_dispatch () from [instdir]/lib/libopen-pal.so.0 opal_event_loop () from [instdir]/lib/libopen-pal.so.0 opal_progress () from [instdir]/lib/libopen-pal.so.0 ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0 ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0 ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0 ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/ libmpi.so.0 PMPI_Barrier () from [instdir]/lib/libmpi.so.0 main () If I use more nodes I can get it to hang with 1ppn, so that seems to rule out the sm btl (or interactions with it) as a culprit at least. I can't reproduce this with openmpi 1.5.3, interestingly. -Marcus On 05/10/2011 03:37 AM, Salvatore Podda wrote: Dear all, we succeed in building several version of openmpi from 1.2.8 to 1.4.3 with Intel composer XE 2011 (aka 12.0). However we found a threshold in the number of cores (depending from the application: IMB, xhpl or user applications and form the number of required cores) above which the application hangs (sort of deadlocks). The building of openmpi with 'gcc' and 'pgi' does not show the same limits. There are any known incompatibilities of openmpi with this version of intel compiilers? The characteristics of our computational infrastructure are: Intel processors E7330, E5345, E5530 e E5620 CentOS 5.3, CentOS 5.5. Intel composer XE 2011 gcc 4.1.2 pgi 10.2-1 Regards Salvatore Podda ENEA UTICT-HPC Department for Computer Science Development and ICT Facilities Laboratory for Science and High Performace Computing C.R. Frascati Via E. Fermi, 45 PoBox 65 00044 Frascati (Rome) Italy Tel: +39 06 9400 5342 Fax: +39 06 9400 5551 Fax: +39 06 9400 5735 E-mail: salvatore.po...@enea.it Home Page: www.cresco.enea.it ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Openib with > 32 cores per node
Hi, Thanks for getting back to me (and thanks to Jeff for the explanation too). On Thu, 2011-05-19 at 09:59 -0600, Samuel K. Gutierrez wrote: > Hi, > > On May 19, 2011, at 9:37 AM, Robert Horton wrote > > > On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote: > >> Hi, > >> > >> Try the following QP parameters that only use shared receive queues. > >> > >> -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32 > >> > > > > Thanks for that. If I run the job over 2 x 48 cores it now works and the > > performance seems reasonable (I need to do some more tuning) but when I > > go up to 4 x 48 cores I'm getting the same problem: > > > > [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] > > error creating qp errno says Cannot allocate memory > > [compute-1-7.local:18106] *** An error occurred in MPI_Isend > > [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD > > [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list > > [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now > > abort) > > > > Any thoughts? > > How much memory does each node have? Does this happen at startup? Each node has 64GB of RAM. The error happens fairly soon after the job starts. > > Try adding: > > -mca btl_openib_cpc_include rdmacm Ah - that looks much better. I can now run hpcc over all 15x48 cores. I need to look at the performance in a bit more detail but it seems to be "reasonable" at least :) One thing is puzzling me - when I compile OpenMPI myself it seems to lack rdmamc support - however the one compiled by the OFED install process does include it. I'm compiling with: '--prefix=/share/apps/openmpi/1.4.3/gcc' '--with-sge' '--with-openib' '--enable-openib-rdmacm' Any idea what might be going on there? > I'm not sure if your version of OFED supports this feature, but maybe using > XRC may help. I **think** other tweaks are needed to get this going, but I'm > not familiar with the details. I'm using the QLogic (QLE7340) rather than Mellanox cards so that doesn't seem to be an option to me (?). It would be interesting to know how much difference it would make though... Thanks again for your help and have a good weekend. Rob -- Robert Horton System Administrator (Research Support) - School of Mathematical Sciences Queen Mary, University of London r.hor...@qmul.ac.uk - +44 (0) 20 7882 7345
Re: [OMPI users] Openib with > 32 cores per node
If you're using QLogic, you might want to try the native PSM Open MPI support rather than the verbs support. QLogic cards only "sorta" support verbs in order to say that they're OFED-complaint; their native PSM interface is more performant than verbs for MPI. Assuming you built OMPI with PSM support: mpirun --mca pml cm --mca mtl psm (although probably just the pml/cm setting is sufficient -- the mtl/psm option will probably happen automatically) See the OMPI README file for some more details about MTLs, PMLs, etc. (look for "psm"/i in the file) On May 20, 2011, at 10:19 AM, Robert Horton wrote: > Hi, > > Thanks for getting back to me (and thanks to Jeff for the explanation > too). > > On Thu, 2011-05-19 at 09:59 -0600, Samuel K. Gutierrez wrote: >> Hi, >> >> On May 19, 2011, at 9:37 AM, Robert Horton wrote >> >>> On Thu, 2011-05-19 at 08:27 -0600, Samuel K. Gutierrez wrote: Hi, Try the following QP parameters that only use shared receive queues. -mca btl_openib_receive_queues S,12288,128,64,32:S,65536,128,64,32 >>> >>> Thanks for that. If I run the job over 2 x 48 cores it now works and the >>> performance seems reasonable (I need to do some more tuning) but when I >>> go up to 4 x 48 cores I'm getting the same problem: >>> >>> [compute-1-7.local][[14383,1],86][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_oob.c:464:qp_create_one] >>> error creating qp errno says Cannot allocate memory >>> [compute-1-7.local:18106] *** An error occurred in MPI_Isend >>> [compute-1-7.local:18106] *** on communicator MPI_COMM_WORLD >>> [compute-1-7.local:18106] *** MPI_ERR_OTHER: known error not in list >>> [compute-1-7.local:18106] *** MPI_ERRORS_ARE_FATAL (your MPI job will now >>> abort) >>> >>> Any thoughts? >> >> How much memory does each node have? Does this happen at startup? > > Each node has 64GB of RAM. The error happens fairly soon after the job > starts. > >> >> Try adding: >> >> -mca btl_openib_cpc_include rdmacm > > Ah - that looks much better. I can now run hpcc over all 15x48 cores. I > need to look at the performance in a bit more detail but it seems to be > "reasonable" at least :) > > One thing is puzzling me - when I compile OpenMPI myself it seems to > lack rdmamc support - however the one compiled by the OFED install > process does include it. I'm compiling with: > > '--prefix=/share/apps/openmpi/1.4.3/gcc' '--with-sge' '--with-openib' > '--enable-openib-rdmacm' > > Any idea what might be going on there? > >> I'm not sure if your version of OFED supports this feature, but maybe using >> XRC may help. I **think** other tweaks are needed to get this going, but >> I'm not familiar with the details. > > I'm using the QLogic (QLE7340) rather than Mellanox cards so that > doesn't seem to be an option to me (?). It would be interesting to know > how much difference it would make though... > > Thanks again for your help and have a good weekend. > > Rob > > -- > Robert Horton > System Administrator (Research Support) - School of Mathematical Sciences > Queen Mary, University of London > r.hor...@qmul.ac.uk - +44 (0) 20 7882 7345 > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)
Hi Salvatore Just in case ... You say you have problems when you use "--mca btl openib,self". Is this a typo in your email? I guess this will disable the shared memory btl intra-node, whereas your other choice "--mca btl_tcp_if_include ib0" will not. Could this be the problem? Here we use "--mca btl openib,self,sm", to enable the shared memory btl intra-node as well, and it works just fine on programs that do use collective calls. My two cents, Gus Correa Salvatore Podda wrote: We are still struggling we these problems. Actually the new version of intel compilers does not seem to be the real issue. We clash against the same errors using also the `gcc' compilers. We succeed in building an openmi-1.2.8 (with different compiler flavours) rpm from the installation of the cluster section where all seems to work well. We are now doing a severe IMB benchmark campaign. However, yes this happen only whe we use the --mca btl openib,self, on the contrary if we use --mca btl_tcp_if_include ib0 all works well. Yes we can try the flag you suggest. I can check on the FAQ and on the opem-mpi.org documentation, but can you be so kindly to explain the meaning of this flag? Thanks Salvatore Podda On 20/mag/11, at 03:37, Jeff Squyres wrote: Sorry for the late reply. Other users have seen something similar but we have never been able to reproduce it. Is this only when using IB? If you use "mpirun --mca btl_openib_cpc_if_include rdmacm", does the problem go away? On May 11, 2011, at 6:00 PM, Marcus R. Epperson wrote: I've seen the same thing when I build openmpi 1.4.3 with Intel 12, but only when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1 then the collectives hangs go away. I don't know what, if anything, the higher optimization buys you when compiling openmpi, so I'm not sure if that's an acceptable workaround or not. My system is similar to yours - Intel X5570 with QDR Mellanox IB running RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm using IMB 3.2.2 with a single iteration of Barrier to reproduce the hang, and it happens 100% of the time for me when I invoke it like this: # salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier The hang happens on the first Barrier (64 ranks) and each of the participating ranks have this backtrace: __poll (...) poll_dispatch () from [instdir]/lib/libopen-pal.so.0 opal_event_loop () from [instdir]/lib/libopen-pal.so.0 opal_progress () from [instdir]/lib/libopen-pal.so.0 ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0 ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0 ompi_coll_tuned_barrier_intra_recursivedoubling () from [instdir]/lib/libmpi.so.0 ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0 PMPI_Barrier () from [instdir]/lib/libmpi.so.0 IMB_barrier () IMB_init_buffers_iter () main () The one non-participating rank has this backtrace: __poll (...) poll_dispatch () from [instdir]/lib/libopen-pal.so.0 opal_event_loop () from [instdir]/lib/libopen-pal.so.0 opal_progress () from [instdir]/lib/libopen-pal.so.0 ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0 ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0 ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0 ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0 PMPI_Barrier () from [instdir]/lib/libmpi.so.0 main () If I use more nodes I can get it to hang with 1ppn, so that seems to rule out the sm btl (or interactions with it) as a culprit at least. I can't reproduce this with openmpi 1.5.3, interestingly. -Marcus On 05/10/2011 03:37 AM, Salvatore Podda wrote: Dear all, we succeed in building several version of openmpi from 1.2.8 to 1.4.3 with Intel composer XE 2011 (aka 12.0). However we found a threshold in the number of cores (depending from the application: IMB, xhpl or user applications and form the number of required cores) above which the application hangs (sort of deadlocks). The building of openmpi with 'gcc' and 'pgi' does not show the same limits. There are any known incompatibilities of openmpi with this version of intel compiilers? The characteristics of our computational infrastructure are: Intel processors E7330, E5345, E5530 e E5620 CentOS 5.3, CentOS 5.5. Intel composer XE 2011 gcc 4.1.2 pgi 10.2-1 Regards Salvatore Podda ENEA UTICT-HPC Department for Computer Science Development and ICT Facilities Laboratory for Science and High Performace Computing C.R. Frascati Via E. Fermi, 45 PoBox 65 00044 Frascati (Rome) Italy Tel: +39 06 9400 5342 Fax: +39 06 9400 5551 Fax: +39 06 9400 5735 E-mail: salvatore.po...@enea.it Home Page: www.cresco.enea.it ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listi
Re: [OMPI users] v1.5.3-x64 does not work on Windows 7 workgroup
I have verified that disabling UAC does not fix the problem. xhlp.exe starts, threads spin up on both machines, CPU usage is at 80-90% but no progress is ever made. >From this state, Ctrl-break on the head node yields the following output: [REMOTEMACHINE:02032] [[20816,1],0]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:05064] [[20816,1],1]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:05420] [[20816,1],2]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:03852] [[20816,1],3]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:05436] [[20816,1],4]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:04416] [[20816,1],5]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:02032] [[20816,1],0] routed:binomial: Connection to lifeline [[20816,0],0] lost [REMOTEMACHINE:05064] [[20816,1],1] routed:binomial: Connection to lifeline [[20816,0],0] lost [REMOTEMACHINE:05420] [[20816,1],2] routed:binomial: Connection to lifeline [[20816,0],0] lost [REMOTEMACHINE:03852] [[20816,1],3] routed:binomial: Connection to lifeline [[20816,0],0] lost [REMOTEMACHINE:05436] [[20816,1],4] routed:binomial: Connection to lifeline [[20816,0],0] lost [REMOTEMACHINE:04416] [[20816,1],5] routed:binomial: Connection to lifeline [[20816,0],0] lost > From: users-requ...@open-mpi.org > Subject: users Digest, Vol 1911, Issue 1 > To: us...@open-mpi.org > Date: Fri, 20 May 2011 08:14:13 -0400 > > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > > 1. Re: Error: Entry Point Not Found (Zhangping Wei) > 2. Re: Problem with MPI_Request, MPI_Isend/recv and > MPI_Wait/Test (George Bosilca) > 3. Re: v1.5.3-x64 does not work on Windows 7 workgroup (Jeff Squyres) > 4. Re: Error: Entry Point Not Found (Jeff Squyres) > 5. Re: openmpi (1.2.8 or above) and Intel composer XE 2011 (aka > 12.0) (Jeff Squyres) > 6. Re: Openib with > 32 cores per node (Jeff Squyres) > 7. Re: MPI_COMM_DUP freeze with OpenMPI 1.4.1 (Jeff Squyres) > 8. Re: Trouble with MPI-IO (Jeff Squyres) > 9. Re: Trouble with MPI-IO (Tom Rosmond) > 10. Re: Problem with MPI_Request, MPI_Isend/recv and > MPI_Wait/Test (David B?ttner) > 11. Re: Trouble with MPI-IO (Jeff Squyres) > 12. Re: MPI_Alltoallv function crashes when np > 100 (Jeff Squyres) > 13. Re: MPI_ERR_TRUNCATE with MPI_Allreduce() error, but only > sometimes... (Jeff Squyres) > 14. Re: Trouble with MPI-IO (Jeff Squyres) > > > -- > > Message: 1 > Date: Thu, 19 May 2011 09:13:53 -0700 (PDT) > From: Zhangping Wei > Subject: Re: [OMPI users] Error: Entry Point Not Found > To: us...@open-mpi.org > Message-ID: <101342.7961...@web111818.mail.gq1.yahoo.com> > Content-Type: text/plain; charset="gb2312" > > Dear Paul, > > I checked the way 'mpirun -np N ' you mentioned, but it was the same > problem. > > I guess it may related to the system I used, because I have used it correctly > in > another XP 32 bit system. > > I look forward to more advice.Thanks. > > Zhangping > > > > > > "users-requ...@open-mpi.org" > us...@open-mpi.org > ?? 2011/5/19 () 11:00:02 > ?? users Digest, Vol 1910, Issue 2 > > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > > 1. Re: Error: Entry Point Not Found (Paul van der Walt) > 2. Re: Openib with > 32 cores per node (Robert Horton) > 3. Re: Openib with > 32 cores per node (Samuel K. Gutierrez) > > > -- > > Message: 1 > Date: Thu, 19 May 2011 16:14:02 +0100 > From: Paul van der Walt > Subject: Re: [OMPI users] Error: Entry Point Not Found > To: Open MPI Users > Message-ID: > Content-Type: text/plain; charset=UTF-8 > > Hi, > > On 19 May 2011 15:54, Zhangping Wei wrote: > > 4, I use command window to run it in this way: ?mpirun ?n 4 ?**.exe ?,then I > > Probably not the problem, but shouldn't that b
Re: [OMPI users] v1.5.3-x64 does not work on Windows 7 workgroup
MPI can get through your firewall, right? Damien On 20/05/2011 12:53 PM, Jason Mackay wrote: I have verified that disabling UAC does not fix the problem. xhlp.exe starts, threads spin up on both machines, CPU usage is at 80-90% but no progress is ever made. >From this state, Ctrl-break on the head node yields the following output: [REMOTEMACHINE:02032] [[20816,1],0]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:05064] [[20816,1],1]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:05420] [[20816,1],2]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:03852] [[20816,1],3]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:05436] [[20816,1],4]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:04416] [[20816,1],5]-[[20816,0],0] mca_oob_tcp_msg_recv: readv failed: Unknown error (108) [REMOTEMACHINE:02032] [[20816,1],0] routed:binomial: Connection to lifeline [[20816,0],0] lost [REMOTEMACHINE:05064] [[20816,1],1] routed:binomial: Connection to lifeline [[20816,0],0] lost [REMOTEMACHINE:05420] [[20816,1],2] routed:binomial: Connection to lifeline [[20816,0],0] lost [REMOTEMACHINE:03852] [[20816,1],3] routed:binomial: Connection to lifeline [[20816,0],0] lost [REMOTEMACHINE:05436] [[20816,1],4] routed:binomial: Connection to lifeline [[20816,0],0] lost [REMOTEMACHINE:04416] [[20816,1],5] routed:binomial: Connection to lifeline [[20816,0],0] lost > From: users-requ...@open-mpi.org > Subject: users Digest, Vol 1911, Issue 1 > To: us...@open-mpi.org > Date: Fri, 20 May 2011 08:14:13 -0400 > > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > > 1. Re: Error: Entry Point Not Found (Zhangping Wei) > 2. Re: Problem with MPI_Request, MPI_Isend/recv and > MPI_Wait/Test (George Bosilca) > 3. Re: v1.5.3-x64 does not work on Windows 7 workgroup (Jeff Squyres) > 4. Re: Error: Entry Point Not Found (Jeff Squyres) > 5. Re: openmpi (1.2.8 or above) and Intel composer XE 2011 (aka > 12.0) (Jeff Squyres) > 6. Re: Openib with > 32 cores per node (Jeff Squyres) > 7. Re: MPI_COMM_DUP freeze with OpenMPI 1.4.1 (Jeff Squyres) > 8. Re: Trouble with MPI-IO (Jeff Squyres) > 9. Re: Trouble with MPI-IO (Tom Rosmond) > 10. Re: Problem with MPI_Request, MPI_Isend/recv and > MPI_Wait/Test (David B?ttner) > 11. Re: Trouble with MPI-IO (Jeff Squyres) > 12. Re: MPI_Alltoallv function crashes when np > 100 (Jeff Squyres) > 13. Re: MPI_ERR_TRUNCATE with MPI_Allreduce() error, but only > sometimes... (Jeff Squyres) > 14. Re: Trouble with MPI-IO (Jeff Squyres) > > > -- > > Message: 1 > Date: Thu, 19 May 2011 09:13:53 -0700 (PDT) > From: Zhangping Wei > Subject: Re: [OMPI users] Error: Entry Point Not Found > To: us...@open-mpi.org > Message-ID: <101342.7961...@web111818.mail.gq1.yahoo.com> > Content-Type: text/plain; charset="gb2312" > > Dear Paul, > > I checked the way 'mpirun -np N ' you mentioned, but it was the same > problem. > > I guess it may related to the system I used, because I have used it correctly in > another XP 32 bit system. > > I look forward to more advice.Thanks. > > Zhangping > > > > > > "users-requ...@open-mpi.org" > us...@open-mpi.org > ?? 2011/5/19 () 11:00:02 > ?? users Digest, Vol 1910, Issue 2 > > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > > 1. Re: Error: Entry Point Not Found (Paul van der Walt) > 2. Re: Openib with > 32 cores per node (Robert Horton) > 3. Re: Openib with > 32 cores per node (Samuel K. Gutierrez) > > > -- > > Message: 1 > Date: Thu, 19 May 2011 16:14:02 +0100 > From: Paul van der Walt > Subject: Re: [OMPI users] Error: Entry Point Not Found > To: Open MPI Users > Message-ID: > Content-Type: text/plain; charset=UTF-8 > > Hi, > > On 19 May 2011 15:54, Zhangping Wei wrote: > > 4, I use command window to run it in this way: ?mpirun ?n 4 ?**.exe
Re: [OMPI users] users Digest, Vol 1911, Issue 4
"MPI can get through your firewall, right?" As far as I can tell the firewall is not the problem - have tried it with firewalls disabled, automatic fw polices based on port requests from MPI, and with manual exception policies. > From: users-requ...@open-mpi.org > Subject: users Digest, Vol 1911, Issue 4 > To: us...@open-mpi.org > Date: Fri, 20 May 2011 14:58:40 -0400 > > Send users mailing list submissions to > us...@open-mpi.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.open-mpi.org/mailman/listinfo.cgi/users > or, via email, send a message with subject or body 'help' to > users-requ...@open-mpi.org > > You can reach the person managing the list at > users-ow...@open-mpi.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of users digest..." > > > Today's Topics: > > 1. Re: v1.5.3-x64 does not work on Windows 7 workgroup (Damien) > > > -- > > Message: 1 > Date: Fri, 20 May 2011 12:58:21 -0600 > From: Damien > Subject: Re: [OMPI users] v1.5.3-x64 does not work on Windows 7 > workgroup > To: Open MPI Users > Message-ID: <4dd6b9cd.8060...@khubla.com> > Content-Type: text/plain; charset="iso-8859-1"; Format="flowed" > > MPI can get through your firewall, right? > > Damien > > On 20/05/2011 12:53 PM, Jason Mackay wrote: > > I have verified that disabling UAC does not fix the problem. xhlp.exe > > starts, threads spin up on both machines, CPU usage is at 80-90% but > > no progress is ever made. > > > > >From this state, Ctrl-break on the head node yields the following output: > > > > [REMOTEMACHINE:02032] [[20816,1],0]-[[20816,0],0] > > mca_oob_tcp_msg_recv: readv failed: Unknown error (108) > > [REMOTEMACHINE:05064] [[20816,1],1]-[[20816,0],0] > > mca_oob_tcp_msg_recv: readv failed: Unknown error (108) > > [REMOTEMACHINE:05420] [[20816,1],2]-[[20816,0],0] > > mca_oob_tcp_msg_recv: readv failed: Unknown error (108) > > [REMOTEMACHINE:03852] [[20816,1],3]-[[20816,0],0] > > mca_oob_tcp_msg_recv: readv failed: Unknown error (108) > > [REMOTEMACHINE:05436] [[20816,1],4]-[[20816,0],0] > > mca_oob_tcp_msg_recv: readv failed: Unknown error (108) > > [REMOTEMACHINE:04416] [[20816,1],5]-[[20816,0],0] > > mca_oob_tcp_msg_recv: readv failed: Unknown error (108) > > [REMOTEMACHINE:02032] [[20816,1],0] routed:binomial: Connection to > > lifeline [[20816,0],0] lost > > [REMOTEMACHINE:05064] [[20816,1],1] routed:binomial: Connection to > > lifeline [[20816,0],0] lost > > [REMOTEMACHINE:05420] [[20816,1],2] routed:binomial: Connection to > > lifeline [[20816,0],0] lost > > [REMOTEMACHINE:03852] [[20816,1],3] routed:binomial: Connection to > > lifeline [[20816,0],0] lost > > [REMOTEMACHINE:05436] [[20816,1],4] routed:binomial: Connection to > > lifeline [[20816,0],0] lost > > [REMOTEMACHINE:04416] [[20816,1],5] routed:binomial: Connection to > > lifeline [[20816,0],0] lost > > > > > > > > > From: users-requ...@open-mpi.org > > > Subject: users Digest, Vol 1911, Issue 1 > > > To: us...@open-mpi.org > > > Date: Fri, 20 May 2011 08:14:13 -0400 > > > > > > Send users mailing list submissions to > > > us...@open-mpi.org > > > > > > To subscribe or unsubscribe via the World Wide Web, visit > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > or, via email, send a message with subject or body 'help' to > > > users-requ...@open-mpi.org > > > > > > You can reach the person managing the list at > > > users-ow...@open-mpi.org > > > > > > When replying, please edit your Subject line so it is more specific > > > than "Re: Contents of users digest..." > > > > > > > > > Today's Topics: > > > > > > 1. Re: Error: Entry Point Not Found (Zhangping Wei) > > > 2. Re: Problem with MPI_Request, MPI_Isend/recv and > > > MPI_Wait/Test (George Bosilca) > > > 3. Re: v1.5.3-x64 does not work on Windows 7 workgroup (Jeff Squyres) > > > 4. Re: Error: Entry Point Not Found (Jeff Squyres) > > > 5. Re: openmpi (1.2.8 or above) and Intel composer XE 2011 (aka > > > 12.0) (Jeff Squyres) > > > 6. Re: Openib with > 32 cores per node (Jeff Squyres) > > > 7. Re: MPI_COMM_DUP freeze with OpenMPI 1.4.1 (Jeff Squyres) > > > 8. Re: Trouble with MPI-IO (Jeff Squyres) > > > 9. Re: Trouble with MPI-IO (Tom Rosmond) > > > 10. Re: Problem with MPI_Request, MPI_Isend/recv and > > > MPI_Wait/Test (David B?ttner) > > > 11. Re: Trouble with MPI-IO (Jeff Squyres) > > > 12. Re: MPI_Alltoallv function crashes when np > 100 (Jeff Squyres) > > > 13. Re: MPI_ERR_TRUNCATE with MPI_Allreduce() error, but only > > > sometimes... (Jeff Squyres) > > > 14. Re: Trouble with MPI-IO (Jeff Squyres) > > > > > > > > > -- > > > > > > Message: 1 > > > Date: Thu, 19 May 2011 09:13:53 -0700 (PDT) > > > From: Zhangping Wei > > > Subject: Re: [OMPI users] Error: Entry Point Not Found > > > To: us...@open-mpi.org > > > Message-ID: <101342.7961...