Re: [OMPI users] Windows: MPI_Allreduce() crashes when using MPI_DOUBLE_PRECISION
Hi Jeff, > Can you send the info listed on the help page? >From the HELP page... ***For run-time problems: 1) Check the FAQ first. Really. This can save you a lot of time; many common problems and solutions are listed there. I couldn't find reference in FAQ. 2) The version of Open MPI that you're using. I am using pre-built openmpi-1.5.3 64-bit and 32-bit binaries on Window 7 I also tried with locally built openmpi-1.5.2 using Visual Studio 2008 32-bit compilers I tried various compilers: VS-9 32-bit and VS-10 64-bit and corresponding intel ifort compiler. 3) The config.log file from the top-level Open MPI directory, if available (please compress!). Don't have. 4) The output of the "ompi_info --all" command from the node where you're invoking mpirun. see output of pre-built openmpi-1.5.3_x64/bin/ompi_info --all" in attachments. 5) If running on more than one node -- I am running test program on single none. 6) A detailed description of what is failing. Already described in this post. 7) Please include information about your network: As I am running test program on local and single machine, this might not be required. > You forgot ierr in the call to MPI_Finalize. You also paired > DOUBLE_PRECISION data with MPI_INTEGER in the call to allreduce. And you > mixed sndbuf and rcvbuf in the call to allreduce, meaning that when your > print rcvbuf afterwards, it'll always still be 0. As I am not Fortran programmer, this is my mistake !!! > program Test_MPI > use mpi > implicit none > > DOUBLE PRECISION rcvbuf(5), sndbuf(5) > INTEGER nproc, rank, ierr, n, i, ret > > n = 5 > do i = 1, n > sndbuf(i) = 2.0 > rcvbuf(i) = 0.0 > end do > > call MPI_INIT(ierr) > call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr) > call MPI_COMM_SIZE(MPI_COMM_WORLD, nproc, ierr) > write(*,*) "size=", nproc, ", rank=", rank > write(*,*) "start --, rcvbuf=", rcvbuf > CALL MPI_ALLREDUCE(sndbuf, rcvbuf, n, > & MPI_DOUBLE_PRECISION, MPI_SUM, MPI_COMM_WORLD, ierr) > write(*,*) "end --, rcvbuf=", rcvbuf > > CALL MPI_Finalize(ierr) > end > > (you could use "include 'mpif.h'", too -- I tried both) > > This program works fine for me. I am observing same crash, as described in this thread (when executing as "mpirun -np 2 mar_f_dp.exe"), even with above correct and simple test program. I commented 'use mpi' as it gave me "Error in compiled module file" error, so I used 'include "mpif.h"' statement (see attachement). It seems that Windows specific issue, (I could run this test program on Linux with openmpi-1.5.1). Can anybody try this test program on Windows? Thank you in advance. -Hiral Package: Open MPI hpcfan@VISCLUSTER25 Distribution Open MPI: 1.5.3 Open MPI SVN revision: r24532 Open MPI release date: Mar 16, 2011 Open RTE: 1.5.3 Open RTE SVN revision: r24532 Open RTE release date: Mar 16, 2011 OPAL: 1.5.3 OPAL SVN revision: r24532 OPAL release date: Mar 16, 2011 Ident string: 1.5.3 MCA backtrace: none (MCA v2.0, API v2.0, Component v1.5.3) MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.5.3) MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.5.3) MCA timer: windows (MCA v2.0, API v2.0, Component v1.5.3) MCA installdirs: env (MCA v2.0, API v2.0, Component v1.5.3) MCA installdirs: windows (MCA v2.0, API v2.0, Component v1.5.3) MCA installdirs: config (MCA v2.0, API v2.0, Component v1.5.3) MCA dpm: orte (MCA v2.0, API v2.0, Component v1.5.3) MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.5.3) MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.5.3) MCA allocator: basic (MCA v2.0, API v2.0, Component v1.5.3) MCA coll: sync (MCA v2.0, API v2.0, Component v1.5.3) MCA coll: sm (MCA v2.0, API v2.0, Component v1.5.3) MCA coll: self (MCA v2.0, API v2.0, Component v1.5.3) MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.5.3) MCA coll: basic (MCA v2.0, API v2.0, Component v1.5.3) MCA mpool: sm (MCA v2.0, API v2.0, Component v1.5.3) MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.5.3) MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.5.3) MCA bml: r2 (MCA v2.0, API v2.0, Component v1.5.3) MCA btl: tcp (MCA v2.0, API v2.0, Component v1.5.3) MCA btl: sm (MCA v2.0, API v2.0, Component v1.5.3) MCA btl: self (MCA v2.0, API v2.0, Component v1.5.3) MCA topo: unity (MCA v2.0, API v2.0, Component v1.5.3) MCA osc: rdma (MCA v2.0, API v2.0, Comp
[OMPI users] Sorry! You were supposed to get help about: But couldn't open help-orterun.txt
Hi, I am observing following message on Windows platform... c:\Users\oza\Desktop\test>mpirun -- orterun:executable-not-specified But I couldn't open the help file: C:\Users\hpcfan\Documents\OpenMPI\openmpi-1.5.3\installed-64\share\openmpi\help-orterun.txt: No such file or directory. Sorry! -- I copied pre-built installed "OpenMPI_v1.5.3-x64" directory into one Windows machine to another Windows machine. As discussed in some mailing-threads I also tried to set OPAL_PKGDATA and other OPAL_* environment variables, but still above message persist. Please suggest. Thank you in advance. -Hiral
Re: [OMPI users] Sorry! You were supposed to get help about: But couldn't open help-orterun.txt
It's my mistake it should be OPAL_PKGDATADIR env var instead of OPAL_DATADIR. With this it is working fine. Thank you. -Hiral
Re: [OMPI users] Sorry! You were supposed to get help about: But couldn't open help-orterun.txt
After setting OPAL_PKGDATADIR, "mpirun" gives proper help message. But when executing simple test program which calls MPI_ALLREDUCE() it gives following errors onto the console... c:\ompi_tests\win64>mpirun mar_f_i_op.exe [nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file ..\..\..\openmpi-1.5.3\orte\mca\ras\base\ras_base_allocate.c at line 147 [nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file ..\..\..\openmpi-1.5.3\orte\mca\plm\base\plm_base_launch_support.c at line 99 [nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file ..\..\..\openmpi-1.5.3\orte\mca\plm\ccp\plm_ccp_module.c at line 186 Any idea on these errors??? Clarification: I installed pre-built OpenMPI_v1.5.3-x64 on Windows 7 and copied this directory into Windows Server 2008. Thank you in advance. -Hiral
Re: [OMPI users] Sorry! You were supposed to get help about: But couldn't open help-orterun.txt
I don't know a lot about the Windows port, but that error means that mpirun got an error when trying to discover the allocated nodes. On May 11, 2011, at 6:10 AM, hi wrote: > After setting OPAL_PKGDATADIR, "mpirun" gives proper help message. > > But when executing simple test program which calls MPI_ALLREDUCE() it > gives following errors onto the console... > > c:\ompi_tests\win64>mpirun mar_f_i_op.exe > [nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file > ..\..\..\openmpi-1.5.3\orte\mca\ras\base\ras_base_allocate.c at line > 147 > [nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file > ..\..\..\openmpi-1.5.3\orte\mca\plm\base\plm_base_launch_support.c at > line 99 > [nbld-w08:04820] [[30632,0],0] ORTE_ERROR_LOG: Error in file > ..\..\..\openmpi-1.5.3\orte\mca\plm\ccp\plm_ccp_module.c at line 186 > > Any idea on these errors??? > > Clarification: I installed pre-built OpenMPI_v1.5.3-x64 on Windows 7 > and copied this directory into Windows Server 2008. > > Thank you in advance. > -Hiral > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] is there an equiv of iprove for bcast?
I'm not so much worried about the "load" than N pending ibcasts would cause; the "load" will be zero until the broadcast actually fires. But I'm concerned about the pending resource usage (i.e., how many internal network and collective resources will be slurped up into hundreds or thousands of pending broadcasts). You might want to have a tiered system, instead. Have a tree-based communication pattern where each worker has a "parent" who does the actual broadcasting; each broadcaster can have tens of children (for example). Even have an N-level tree, perhaps even gathering your children by server rack and/or network topology. That way, you can have a small number of processes at the top of the tree that do an actual broadcast. The rest can use a (relatively) small number of non-blocking sends and receives. Or, when non-blocking collectives become available, you can have everyone in pending ibcasts with the small number of broadcasters (i.e., N broadcasters for M processes, where N << M), which wouldn't be nearly as resource-consuming-heavy as M pending ibasts. Or something like that... just throwing some ideas out there for you... On May 10, 2011, at 7:14 PM, Randolph Pullen wrote: > Thanks, > > The messages are small and frequent (they flash metadata across the cluster). > The current approach works fine for small to medium clusters but I want it > to be able to go big. Maybe up to several hundred or even a thousands of > nodes. > > Its these larger deployments that concern me. The current scheme may see the > clearinghouse become overloaded in a very large cluster. > > From what you have said, a possible strategy may be to combine the listener > and worker into a single process, using the non-blocking bcast just for that > group, while each worker scanned its own port for an incoming request, which > it would in turn bcast to its peers. > > As you have indicated though, this would depend on the load the non-blocking > bcast would cause. - At least the load would be fairly even over the cluster. > > > --- On Mon, 9/5/11, Jeff Squyres wrote: > > From: Jeff Squyres > Subject: Re: [OMPI users] is there an equiv of iprove for bcast? > To: randolph_pul...@yahoo.com.au > Cc: "Open MPI Users" > Received: Monday, 9 May, 2011, 11:27 PM > > On May 3, 2011, at 8:20 PM, Randolph Pullen wrote: > > > Sorry, I meant to say: > > - on each node there is 1 listener and 1 worker. > > - all workers act together when any of the listeners send them a request. > > - currently I must use an extra clearinghouse process to receive from any > > of the listeners and bcast to workers, this is unfortunate because of the > > potential scaling issues > > > > I think you have answered this in that I must wait for MPI-3's non-blocking > > collectives. > > Yes and no. If each worker starts N non-blocking broadcasts just to be able > to test for completion of any of them, you might end up consuming a bunch of > resources for them (I'm *anticipating* that pending non-blocking collective > requests maybe more heavyweight than pending non-blocking point-to-point > requests). > > But then again, if N is small, it might not matter. > > > Can anyone suggest another way? I don't like the serial clearinghouse > > approach. > > If you only have a few workers and/or the broadcast message is small and/or > the broadcasts aren't frequent, then MPI's built-in broadcast algorithms > might not offer much more optimization than doing your own with > point-to-point mechanisms. I don't usually recommend this, but it may be > possible for your case. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Sorry! You were supposed to get help about: But couldn't open help-orterun.txt
On May 11, 2011, at 5:50 AM, Ralph Castain wrote: >> Clarification: I installed pre-built OpenMPI_v1.5.3-x64 on Windows 7 >> and copied this directory into Windows Server 2008. Did you copy OMPI to the same directory tree that you built it? OMPI hard-codes some directory names when it builds, and it expects to find that directory structure when it runs. If you build OMPI with a --prefix of /foo, but then move it to /bar, various things may not work (like finding help messages, etc.) unless you set the OMPI/OPAL environment variables that tell OMPI where the files are actually located. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
[OMPI users] error with checkpoint in openmpi
Hi , I am working on mpi I've have installed openmpi 1.4.3 with blcr included. I ran a simple mpi application using a hostfile: pc1 slots=2 max-slots=2 pc2 slots=2 max-slots=2 And, i ran command to run it with checkpoint supported #mpirun --hostfile myhost -np 4 --am ft-enable-cr ./mpi_app When i checkpointed, i got an error: [pc1:04836] Error: expected_component: PID information unavailable! -- Error: The local checkpoint contains invalid or incomplete metadata for Process 3411083265.2. This usually indicates that the local checkpoint is invalid. Check the metadata file (snapshot_meta.data) in the following directory: /root/ompi_global_snapshot_4836.ckpt/0/opal_snapshot_2.ckpt -- [pc1:04836] [[52049,0],0] ORTE_ERROR_LOG: Error in file snapc_full_global.c at line 1054 I'm glad if anyone can help me.
[OMPI users] TotalView Memory debugging and OpenMPI
We've gotten a few reports of problems with memory debugging when using OpenMPI under TotalView. Usually, TotalView will attach tot he processes started after an MPI_Init. However in the case where memory debugging is enabled, things seemed to run away or fail. My analysis showed that we had a number of core files left over from the attempt, and all were mpirun (or orterun) cores. It seemed to be a regression on our part, since testing seemed to indicate this worked okay before TotalView 8.9.0-0, so I filed an internal bug and passed it to engineering. After giving our engineer a brief tutorial on how to build a debug version of OpenMPI, he found what appears to be a problem in the code for orterun.c. He's made a slight change that fixes the issue in 1.4.2, 1.4.3, 1.4.4rc2 and 1.5.3, those being the versions he's tested with so far.He doesn't subscribe to this list that I know of, so I offered to pass this by the group. Of course, I'm not sure if this is exactly the right place to submit patches, but I'm sure you'd tell me where to put it if I'm in the wrong here. It's a short patch, so I'll cut and paste it, and attach as well, since cut and paste can do weird things to formatting. Credit goes to Ariel Burton for this patch. Of course he used TotalVIew to find this ;-) It shows up if you do 'mpirun -tv -np 4 ./foo' or 'totalview mpirun -a -np 4 ./foo' Cheers, PeterT more ~/patches/anbs-patch *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400 --- /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../. ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 20:28:16.5881 83000 -0400 *** *** 1578,1588 } if (NULL != env) { size1 = opal_argv_count(env); for (j = 0; j < size1; ++j) { ! putenv(env[j]); } } /* All done */ --- 1578,1600 } if (NULL != env) { size1 = opal_argv_count(env); for (j = 0; j < size1; ++j) { ! /* Use-after-Free error possible here. putenv does not copy !the string passed to it, and instead stores only the pointer. !env[j] may be freed later, in which case the pointer !in environ will now be left dangling into a deallocated !region. !So we make a copy of the variable. ! */ ! char *s = strdup(env[j]); ! ! if (NULL == s) { ! return OPAL_ERR_OUT_OF_RESOURCE; ! } ! putenv(s); } } /* All done */ *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400 --- /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c 2011-05-09 20:28:16.588183000 -0400 *** *** 1578,1588 } if (NULL != env) { size1 = opal_argv_count(env); for (j = 0; j < size1; ++j) { ! putenv(env[j]); } } /* All done */ --- 1578,1600 } if (NULL != env) { size1 = opal_argv_count(env); for (j = 0; j < size1; ++j) { ! /* Use-after-Free error possible here. putenv does not copy !the string passed to it, and instead stores only the pointer. !env[j] may be freed later, in which case the pointer !in environ will now be left dangling into a deallocated !region. !So we make a copy of the variable. ! */ ! char *s = strdup(env[j]); ! ! if (NULL == s) { ! return OPAL_ERR_OUT_OF_RESOURCE; ! } ! putenv(s); } } /* All done */
Re: [OMPI users] btl_openib_cpc_include rdmacm questions
On May 9, 2011, at 9:31 AM, Jeff Squyres wrote: > On May 3, 2011, at 6:42 AM, Dave Love wrote: > >>> We managed to have another user hit the bug that causes collectives (this >>> time MPI_Bcast() ) to hang on IB that was fixed by setting: >>> >>> btl_openib_cpc_include rdmacm >> >> Could someone explain this? We also have problems with collective hangs >> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't >> see any relevant issues filed. However, rdmacm isn't an available value >> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not >> that I understand what these things are...). > > Sorry for the delay -- perhaps an IB vendor can reply here with more detail... > > We had a user-reported issue of some hangs that the IB vendors have been > unable to replicate in their respective labs. We *suspect* that it may be an > issue with the oob openib CPC, but that code is pretty old and pretty mature, > so all of us would be at least somewhat surprised if that were the case. If > anyone can reliably reproduce this error, please let us know and/or give us > access to your machines -- we have not closed this issue, but are unable to > move forward because the customers who reported this issue switched to rdmacm > and moved on (i.e., we don't have access to their machines to test any more). An update, we set all our ib0 interfaces to have IP's on a 172. network. This allowed the use of rdmacm to work and get latencies that we would expect. That said we are still getting hangs. I can very reliably reproduce it using IMB with a specific core count on a specific test case. Just an update. Has anyone else had luck fixing the lockup issues on openib BTL for collectives in some cases? Thanks! Brock Palen www.umich.edu/~brockp Center for Advanced Computing bro...@umich.edu (734)936-1985 > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >
Re: [OMPI users] btl_openib_cpc_include rdmacm questions
Sent from my iPad On May 11, 2011, at 2:05 PM, Brock Palen wrote: > On May 9, 2011, at 9:31 AM, Jeff Squyres wrote: > >> On May 3, 2011, at 6:42 AM, Dave Love wrote: >> We managed to have another user hit the bug that causes collectives (this time MPI_Bcast() ) to hang on IB that was fixed by setting: btl_openib_cpc_include rdmacm >>> >>> Could someone explain this? We also have problems with collective hangs >>> with openib/mlx4 (specifically in IMB), but not with psm, and I couldn't >>> see any relevant issues filed. However, rdmacm isn't an available value >>> for that parameter with our 1.4.3 or 1.5.3 installations, only oob (not >>> that I understand what these things are...). >> >> Sorry for the delay -- perhaps an IB vendor can reply here with more >> detail... >> >> We had a user-reported issue of some hangs that the IB vendors have been >> unable to replicate in their respective labs. We *suspect* that it may be >> an issue with the oob openib CPC, but that code is pretty old and pretty >> mature, so all of us would be at least somewhat surprised if that were the >> case. If anyone can reliably reproduce this error, please let us know >> and/or give us access to your machines -- we have not closed this issue, but >> are unable to move forward because the customers who reported this issue >> switched to rdmacm and moved on (i.e., we don't have access to their >> machines to test any more). > > An update, we set all our ib0 interfaces to have IP's on a 172. network. This > allowed the use of rdmacm to work and get latencies that we would expect. > That said we are still getting hangs. I can very reliably reproduce it using > IMB with a specific core count on a specific test case. > > Just an update. Has anyone else had luck fixing the lockup issues on openib > BTL for collectives in some cases? Thanks! I'll go back to my earlier comments. Users always claim that their code doesn't have the sync issue, but it has proved to help more often than not, and costs nothing to try, My $.0002 > > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > bro...@umich.edu > (734)936-1985 > >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] TotalView Memory debugging and OpenMPI
That would be a problem, I fear. We need to push those envars into the environment. Is there some particular problem causing what you see? We have no other reports of this issue, and orterun has had that code forever. Sent from my iPad On May 11, 2011, at 2:05 PM, Peter Thompson wrote: > We've gotten a few reports of problems with memory debugging when using > OpenMPI under TotalView. Usually, TotalView will attach tot he processes > started after an MPI_Init. However in the case where memory debugging is > enabled, things seemed to run away or fail. My analysis showed that we had > a number of core files left over from the attempt, and all were mpirun (or > orterun) cores. It seemed to be a regression on our part, since testing > seemed to indicate this worked okay before TotalView 8.9.0-0, so I filed an > internal bug and passed it to engineering. After giving our engineer a > brief tutorial on how to build a debug version of OpenMPI, he found what > appears to be a problem in the code for orterun.c. He's made a slight > change that fixes the issue in 1.4.2, 1.4.3, 1.4.4rc2 and 1.5.3, those being > the versions he's tested with so far.He doesn't subscribe to this list > that I know of, so I offered to pass this by the group. Of course, I'm not > sure if this is exactly the right place to submit patches, but I'm sure you'd > tell me where to put it if I'm in the wrong here. It's a short patch, so > I'll cut and paste it, and attach as well, since cut and paste can do weird > things to formatting. > > Credit goes to Ariel Burton for this patch. Of course he used TotalVIew to > find this ;-) It shows up if you do 'mpirun -tv -np 4 ./foo' or 'totalview > mpirun -a -np 4 ./foo' > > Cheers, > PeterT > > > more ~/patches/anbs-patch > *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400 > --- > /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../. > ./src/openmpi-1.4.2/orte/tools/orterun/orterun.c2011-05-09 > 20:28:16.5881 > 83000 -0400 > *** > *** 1578,1588 > } > if (NULL != env) { > size1 = opal_argv_count(env); > for (j = 0; j < size1; ++j) { > ! putenv(env[j]); > } > } > /* All done */ > --- 1578,1600 > } > if (NULL != env) { > size1 = opal_argv_count(env); > for (j = 0; j < size1; ++j) { > ! /* Use-after-Free error possible here. putenv does not copy > !the string passed to it, and instead stores only the pointer. > !env[j] may be freed later, in which case the pointer > !in environ will now be left dangling into a deallocated > !region. > !So we make a copy of the variable. > ! */ > ! char *s = strdup(env[j]); > ! > ! if (NULL == s) { > ! return OPAL_ERR_OUT_OF_RESOURCE; > ! } > ! putenv(s); > } > } > /* All done */ > > *** orte/tools/orterun/orterun.c2010-04-13 13:30:34.0 -0400 > --- > /home/anb/packages/openmpi-1.4.2/linux-x8664-iwashi/installation/bin/../../../src/openmpi-1.4.2/orte/tools/orterun/orterun.c > 2011-05-09 20:28:16.588183000 -0400 > *** > *** 1578,1588 > } > > if (NULL != env) { > size1 = opal_argv_count(env); > for (j = 0; j < size1; ++j) { > ! putenv(env[j]); > } > } > > /* All done */ > > --- 1578,1600 > } > > if (NULL != env) { > size1 = opal_argv_count(env); > for (j = 0; j < size1; ++j) { > ! /* Use-after-Free error possible here. putenv does not copy > !the string passed to it, and instead stores only the pointer. > !env[j] may be freed later, in which case the pointer > !in environ will now be left dangling into a deallocated > !region. > !So we make a copy of the variable. > ! */ > ! char *s = strdup(env[j]); > ! > ! if (NULL == s) { > ! return OPAL_ERR_OUT_OF_RESOURCE; > ! } > ! putenv(s); > } > } > > /* All done */ > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] openmpi (1.2.8 or above) and Intel composer XE 2011 (aka 12.0)
I've seen the same thing when I build openmpi 1.4.3 with Intel 12, but only when I have -O2 or -O3 in CFLAGS. If I drop it down to -O1 then the collectives hangs go away. I don't know what, if anything, the higher optimization buys you when compiling openmpi, so I'm not sure if that's an acceptable workaround or not. My system is similar to yours - Intel X5570 with QDR Mellanox IB running RHEL 5, Slurm, and these openmpi btls: openib,sm,self. I'm using IMB 3.2.2 with a single iteration of Barrier to reproduce the hang, and it happens 100% of the time for me when I invoke it like this: # salloc -N 9 orterun -n 65 ./IMB-MPI1 -npmin 64 -iter 1 barrier The hang happens on the first Barrier (64 ranks) and each of the participating ranks have this backtrace: __poll (...) poll_dispatch () from [instdir]/lib/libopen-pal.so.0 opal_event_loop () from [instdir]/lib/libopen-pal.so.0 opal_progress () from [instdir]/lib/libopen-pal.so.0 ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0 ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0 ompi_coll_tuned_barrier_intra_recursivedoubling () from [instdir]/lib/libmpi.so.0 ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0 PMPI_Barrier () from [instdir]/lib/libmpi.so.0 IMB_barrier () IMB_init_buffers_iter () main () The one non-participating rank has this backtrace: __poll (...) poll_dispatch () from [instdir]/lib/libopen-pal.so.0 opal_event_loop () from [instdir]/lib/libopen-pal.so.0 opal_progress () from [instdir]/lib/libopen-pal.so.0 ompi_request_default_wait_all () from [instdir]/lib/libmpi.so.0 ompi_coll_tuned_sendrecv_actual () from [instdir]/lib/libmpi.so.0 ompi_coll_tuned_barrier_intra_bruck () from [instdir]/lib/libmpi.so.0 ompi_coll_tuned_barrier_intra_dec_fixed () from [instdir]/lib/libmpi.so.0 PMPI_Barrier () from [instdir]/lib/libmpi.so.0 main () If I use more nodes I can get it to hang with 1ppn, so that seems to rule out the sm btl (or interactions with it) as a culprit at least. I can't reproduce this with openmpi 1.5.3, interestingly. -Marcus On 05/10/2011 03:37 AM, Salvatore Podda wrote: > Dear all, > > we succeed in building several version of openmpi from 1.2.8 to 1.4.3 > with Intel composer XE 2011 (aka 12.0). > However we found a threshold in the number of cores (depending from the > application: IMB, xhpl or user applications > and form the number of required cores) above which the application hangs > (sort of deadlocks). > The building of openmpi with 'gcc' and 'pgi' does not show the same limits. > There are any known incompatibilities of openmpi with this version of > intel compiilers? > > The characteristics of our computational infrastructure are: > > Intel processors E7330, E5345, E5530 e E5620 > > CentOS 5.3, CentOS 5.5. > > Intel composer XE 2011 > gcc 4.1.2 > pgi 10.2-1 > > Regards > > Salvatore Podda > > ENEA UTICT-HPC > Department for Computer Science Development and ICT > Facilities Laboratory for Science and High Performace Computing > C.R. Frascati > Via E. Fermi, 45 > PoBox 65 > 00044 Frascati (Rome) > Italy > > Tel: +39 06 9400 5342 > Fax: +39 06 9400 5551 > Fax: +39 06 9400 5735 > E-mail: salvatore.po...@enea.it > Home Page: www.cresco.enea.it > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Issue with Open MPI 1.5.3 Windows binary builds
Answer to my own question. It was simply a noob problem - not using mpiexec to run my 'application'. Once I did this, everything is running as expected. My bad for not reading more before jumping in. Later, Tyler
Re: [OMPI users] btl_openib_cpc_include rdmacm questions
Jeff Squyres writes: > We had a user-reported issue of some hangs that the IB vendors have > been unable to replicate in their respective labs. We *suspect* that > it may be an issue with the oob openib CPC, but that code is pretty > old and pretty mature, so all of us would be at least somewhat > surprised if that were the case. If anyone can reliably reproduce > this error, please let us know and/or give us access to your machines We can reproduce it with IMB. We could provide access, but we'd have to negotiate with the owners of the relevant nodes to give you interactive access to them. Maybe Brock's would be more accessible? (If you contact me, I may not be able to respond for a few days.) > -- we have not closed this issue, Which issue? I couldn't find a relevant-looking one. > but are unable to move forward > because the customers who reported this issue switched to rdmacm and > moved on (i.e., we don't have access to their machines to test any > more). For what it's worth, I figured out why I couldn't see rdmacm, but adding ipoib would be a bit of a pain. -- Excuse the typping -- I have a broken wrist
Re: [OMPI users] btl_openib_cpc_include rdmacm questions
Ralph Castain writes: > I'll go back to my earlier comments. Users always claim that their > code doesn't have the sync issue, but it has proved to help more often > than not, and costs nothing to try, Could you point to that post, or tell us what to try excatly, given we're running IMB? Thanks. (As far as I know, this isn't happening with real codes, just IMB, but only a few have been in use.) -- Excuse the typping -- I have a broken wrist
[OMPI users] Invitation to connect on LinkedIn
LinkedIn Open, I'd like to add you to my professional network on LinkedIn. - alex alex su Developer at Alibaba Cloud Computing Company China Confirm that you know alex su https://www.linkedin.com/e/kq0fyp-gnl5gmpc-3e/isd/2864619397/g7iDMgn0/ -- (c) 2011, LinkedIn Corporation