I'm afraid this is a more extensive rewrite than I had hoped - the revisions are most unlikely to make it for 1.3.2. Looks like it will be 1.3.3 at the earliest.
Ralph On Mon, Apr 20, 2009 at 7:50 AM, Lenny Verkhovsky < lenny.verkhov...@gmail.com> wrote: > Me too, sorry, it definately seems like a bug. Somewere in the code > probably undefined variable. > I just never tested this code with such "bizzare" command line :) > > Lenny. > > On Mon, Apr 20, 2009 at 4:08 PM, Geoffroy Pignot <geopig...@gmail.com>wrote: > >> Thanks, >> >> I am not in a hurry but it would be nice if I could benefit from this >> feature in the next release. >> Regards >> >> Geoffroy >> >> >> >> 2009/4/20 <users-requ...@open-mpi.org> >> >>> Send users mailing list submissions to >>> us...@open-mpi.org >>> >>> To subscribe or unsubscribe via the World Wide Web, visit >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> or, via email, send a message with subject or body 'help' to >>> users-requ...@open-mpi.org >>> >>> You can reach the person managing the list at >>> users-ow...@open-mpi.org >>> >>> When replying, please edit your Subject line so it is more specific >>> than "Re: Contents of users digest..." >>> >>> >>> Today's Topics: >>> >>> 1. Re: 1.3.1 -rf rankfile behaviour ?? (Ralph Castain) >>> >>> >>> ---------------------------------------------------------------------- >>> >>> Message: 1 >>> Date: Mon, 20 Apr 2009 05:59:52 -0600 >>> From: Ralph Castain <r...@open-mpi.org> >>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >>> To: Open MPI Users <us...@open-mpi.org> >>> Message-ID: <6378a8c1-1763-4a1c-abca-c6fcc3605...@open-mpi.org> >>> >>> Content-Type: text/plain; charset="us-ascii"; Format="flowed"; >>> DelSp="yes" >>> >>> Honestly haven't had time to look at it yet - hopefully in the next >>> couple of days... >>> >>> Sorry for delay >>> >>> >>> On Apr 20, 2009, at 2:58 AM, Geoffroy Pignot wrote: >>> >>> > Do you have any news about this bug. >>> > Thanks >>> > >>> > Geoffroy >>> > >>> > >>> > Message: 1 >>> > Date: Tue, 14 Apr 2009 07:57:44 -0600 >>> > From: Ralph Castain <r...@lanl.gov> >>> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >>> > To: Open MPI Users <us...@open-mpi.org> >>> > Message-ID: <beb90473-0747-43bf-a1e9-6fa4e7777...@lanl.gov> >>> > Content-Type: text/plain; charset="us-ascii"; Format="flowed"; >>> > DelSp="yes" >>> > >>> > Ah now, I didn't say it -worked-, did I? :-) >>> > >>> > Clearly a bug exists in the program. I'll try to take a look at it (if >>> > Lenny doesn't get to it first), but it won't be until later in the >>> > week. >>> > >>> > On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote: >>> > >>> > > I agree with you Ralph , and that 's what I expect from openmpi but >>> > > my second example shows that it's not working >>> > > >>> > > cat hostfile.0 >>> > > r011n002 slots=4 >>> > > r011n003 slots=4 >>> > > >>> > > cat rankfile.0 >>> > > rank 0=r011n002 slot=0 >>> > > rank 1=r011n003 slot=1 >>> > > >>> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 >>> > > hostname >>> > > ### CRASHED >>> > > >>> > > > > Error, invalid rank (1) in the rankfile (rankfile.0) >>> > > > > >>> > > > >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >>> > > file >>> > > > > rmaps_rank_file.c at line 404 >>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >>> > > file >>> > > > > base/rmaps_base_map_job.c at line 87 >>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >>> > > file >>> > > > > base/plm_base_launch_support.c at line 77 >>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >>> > > file >>> > > > > plm_rsh_module.c at line 985 >>> > > > > >>> > > > >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > > > A daemon (pid unknown) died unexpectedly on signal 1 while >>> > > > attempting to >>> > > > > launch so we are aborting. >>> > > > > >>> > > > > There may be more information reported by the environment (see >>> > > > above). >>> > > > > >>> > > > > This may be because the daemon was unable to find all the needed >>> > > > shared >>> > > > > libraries on the remote node. You may set your LD_LIBRARY_PATH >>> > to >>> > > > have the >>> > > > > location of the shared libraries on the remote nodes and this >>> > will >>> > > > > automatically be forwarded to the remote nodes. >>> > > > > >>> > > > >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > > > >>> > > > >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > > > orterun noticed that the job aborted, but has no info as to the >>> > > > process >>> > > > > that caused that situation. >>> > > > > >>> > > > >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > > > orterun: clean termination accomplished >>> > > >>> > > >>> > > >>> > > Message: 4 >>> > > Date: Tue, 14 Apr 2009 06:55:58 -0600 >>> > > From: Ralph Castain <r...@lanl.gov> >>> > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >>> > > To: Open MPI Users <us...@open-mpi.org> >>> > > Message-ID: <f6290ada-a196-43f0-a853-cbcb802d8...@lanl.gov> >>> > > Content-Type: text/plain; charset="us-ascii"; Format="flowed"; >>> > > DelSp="yes" >>> > > >>> > > The rankfile cuts across the entire job - it isn't applied on an >>> > > app_context basis. So the ranks in your rankfile must correspond to >>> > > the eventual rank of each process in the cmd line. >>> > > >>> > > Unfortunately, that means you have to count ranks. In your case, you >>> > > only have four, so that makes life easier. Your rankfile would look >>> > > something like this: >>> > > >>> > > rank 0=r001n001 slot=0 >>> > > rank 1=r001n002 slot=1 >>> > > rank 2=r001n001 slot=1 >>> > > rank 3=r001n002 slot=2 >>> > > >>> > > HTH >>> > > Ralph >>> > > >>> > > On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote: >>> > > >>> > > > Hi, >>> > > > >>> > > > I agree that my examples are not very clear. What I want to do >>> > is to >>> > > > launch a multiexes application (masters-slaves) and benefit from >>> > the >>> > > > processor affinity. >>> > > > Could you show me how to convert this command , using -rf option >>> > > > (whatever the affinity is) >>> > > > >>> > > > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host >>> > r001n002 >>> > > > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 - >>> > > > host r001n002 slave.x options4 >>> > > > >>> > > > Thanks for your help >>> > > > >>> > > > Geoffroy >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > Message: 2 >>> > > > Date: Sun, 12 Apr 2009 18:26:35 +0300 >>> > > > From: Lenny Verkhovsky <lenny.verkhov...@gmail.com> >>> > > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >>> > > > To: Open MPI Users <us...@open-mpi.org> >>> > > > Message-ID: >>> > > > < >>> 453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com >>> > > >>> > > > Content-Type: text/plain; charset="iso-8859-1" >>> > > > >>> > > > Hi, >>> > > > >>> > > > The first "crash" is OK, since your rankfile has ranks 0 and 1 >>> > > > defined, >>> > > > while n=1, which means only rank 0 is present and can be >>> > allocated. >>> > > > >>> > > > NP must be >= the largest rank in rankfile. >>> > > > >>> > > > What exactly are you trying to do ? >>> > > > >>> > > > I tried to recreate your seqv but all I got was >>> > > > >>> > > > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile >>> > > > hostfile.0 >>> > > > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname >>> > > > [witch19:30798] mca: base: component_find: paffinity >>> > > > "mca_paffinity_linux" >>> > > > uses an MCA interface that is not recognized (component MCA >>> > > v1.0.0 != >>> > > > supported MCA v2.0.0) -- ignored >>> > > > >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > > It looks like opal_init failed for some reason; your parallel >>> > > > process is >>> > > > likely to abort. There are many reasons that a parallel process >>> > can >>> > > > fail during opal_init; some of which are due to configuration or >>> > > > environment problems. This failure appears to be an internal >>> > > failure; >>> > > > here's some additional information (which may only be relevant >>> > to an >>> > > > Open MPI developer): >>> > > > >>> > > > opal_carto_base_select failed >>> > > > --> Returned value -13 instead of OPAL_SUCCESS >>> > > > >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in >>> > > file >>> > > > ../../orte/runtime/orte_init.c at line 78 >>> > > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in >>> > > file >>> > > > ../../orte/orted/orted_main.c at line 344 >>> > > > >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > > A daemon (pid 11629) died unexpectedly with status 243 while >>> > > > attempting >>> > > > to launch so we are aborting. >>> > > > >>> > > > There may be more information reported by the environment (see >>> > > above). >>> > > > >>> > > > This may be because the daemon was unable to find all the needed >>> > > > shared >>> > > > libraries on the remote node. You may set your LD_LIBRARY_PATH to >>> > > > have the >>> > > > location of the shared libraries on the remote nodes and this will >>> > > > automatically be forwarded to the remote nodes. >>> > > > >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > > >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > > mpirun noticed that the job aborted, but has no info as to the >>> > > process >>> > > > that caused that situation. >>> > > > >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > > mpirun: clean termination accomplished >>> > > > >>> > > > >>> > > > Lenny. >>> > > > >>> > > > >>> > > > On 4/10/09, Geoffroy Pignot <geopig...@gmail.com> wrote: >>> > > > > >>> > > > > Hi , >>> > > > > >>> > > > > I am currently testing the process affinity capabilities of >>> > > > openmpi and I >>> > > > > would like to know if the rankfile behaviour I will describe >>> > below >>> > > > is normal >>> > > > > or not ? >>> > > > > >>> > > > > cat hostfile.0 >>> > > > > r011n002 slots=4 >>> > > > > r011n003 slots=4 >>> > > > > >>> > > > > cat rankfile.0 >>> > > > > rank 0=r011n002 slot=0 >>> > > > > rank 1=r011n003 slot=1 >>> > > > > >>> > > > > >>> > > > > >>> > > > >>> > > >>> > >>> ################################################################################## >>> > > > > >>> > > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname ### >>> > OK >>> > > > > r011n002 >>> > > > > r011n003 >>> > > > > >>> > > > > >>> > > > > >>> > > > >>> > > >>> > >>> ################################################################################## >>> > > > > but >>> > > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 >>> > > > hostname >>> > > > > ### CRASHED >>> > > > > * >>> > > > > >>> > > > >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > > > Error, invalid rank (1) in the rankfile (rankfile.0) >>> > > > > >>> > > > >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >>> > > file >>> > > > > rmaps_rank_file.c at line 404 >>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >>> > > file >>> > > > > base/rmaps_base_map_job.c at line 87 >>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >>> > > file >>> > > > > base/plm_base_launch_support.c at line 77 >>> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >>> > > file >>> > > > > plm_rsh_module.c at line 985 >>> > > > > >>> > > > >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > > > A daemon (pid unknown) died unexpectedly on signal 1 while >>> > > > attempting to >>> > > > > launch so we are aborting. >>> > > > > >>> > > > > There may be more information reported by the environment (see >>> > > > above). >>> > > > > >>> > > > > This may be because the daemon was unable to find all the needed >>> > > > shared >>> > > > > libraries on the remote node. You may set your LD_LIBRARY_PATH >>> > to >>> > > > have the >>> > > > > location of the shared libraries on the remote nodes and this >>> > will >>> > > > > automatically be forwarded to the remote nodes. >>> > > > > >>> > > > >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > > > >>> > > > >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > > > orterun noticed that the job aborted, but has no info as to the >>> > > > process >>> > > > > that caused that situation. >>> > > > > >>> > > > >>> > > >>> > >>> -------------------------------------------------------------------------- >>> > > > > orterun: clean termination accomplished >>> > > > > * >>> > > > > It seems that the rankfile option is not propagted to the second >>> > > > command >>> > > > > line ; there is no global understanding of the ranking inside a >>> > > > mpirun >>> > > > > command. >>> > > > > >>> > > > > >>> > > > > >>> > > > >>> > > >>> > >>> ################################################################################## >>> > > > > >>> > > > > Assuming that , I tried to provide a rankfile to each command >>> > > line: >>> > > > > >>> > > > > cat rankfile.0 >>> > > > > rank 0=r011n002 slot=0 >>> > > > > >>> > > > > cat rankfile.1 >>> > > > > rank 0=r011n003 slot=1 >>> > > > > >>> > > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf >>> > > > rankfile.1 >>> > > > > -n 1 hostname ### CRASHED >>> > > > > *[r011n002:28778] *** Process received signal *** >>> > > > > [r011n002:28778] Signal: Segmentation fault (11) >>> > > > > [r011n002:28778] Signal code: Address not mapped (1) >>> > > > > [r011n002:28778] Failing at address: 0x34 >>> > > > > [r011n002:28778] [ 0] [0xffffe600] >>> > > > > [r011n002:28778] [ 1] >>> > > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. >>> > > > 0(orte_odls_base_default_get_add_procs_data+0x55d) >>> > > > > [0x5557decd] >>> > > > > [r011n002:28778] [ 2] >>> > > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. >>> > > > 0(orte_plm_base_launch_apps+0x117) >>> > > > > [0x555842a7] >>> > > > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/ >>> > > > mca_plm_rsh.so >>> > > > > [0x556098c0] >>> > > > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >>> > > > [0x804aa27] >>> > > > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >>> > > > [0x804a022] >>> > > > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc) >>> > > > [0x9f1dec] >>> > > > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >>> > > > [0x8049f71] >>> > > > > [r011n002:28778] *** End of error message *** >>> > > > > Segmentation fault (core dumped)* >>> > > > > >>> > > > > >>> > > > > >>> > > > > I hope that I've found a bug because it would be very important >>> > > > for me to >>> > > > > have this kind of capabiliy . >>> > > > > Launch a multiexe mpirun command line and be able to bind my >>> > exes >>> > > > and >>> > > > > sockets together. >>> > > > > >>> > > > > Thanks in advance for your help >>> > > > > >>> > > > > Geoffroy >>> > > > _______________________________________________ >>> > > > users mailing list >>> > > > us...@open-mpi.org >>> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users >>> > > >>> > > -------------- next part -------------- >>> > > HTML attachment scrubbed and removed >>> > > >>> > > ------------------------------ >>> > > >>> > > _______________________________________________ >>> > > users mailing list >>> > > us...@open-mpi.org >>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users >>> > > >>> > > End of users Digest, Vol 1202, Issue 2 >>> > > ************************************** >>> > > >>> > > _______________________________________________ >>> > > users mailing list >>> > > us...@open-mpi.org >>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users >>> > >>> > -------------- next part -------------- >>> > HTML attachment scrubbed and removed >>> > >>> > ------------------------------ >>> > >>> > Message: 2 >>> > Date: Tue, 14 Apr 2009 10:30:58 -0400 >>> > From: Prentice Bisbal <prent...@ias.edu> >>> > Subject: Re: [OMPI users] PGI Fortran pthread support >>> > To: Open MPI Users <us...@open-mpi.org> >>> > Message-ID: <49e49e22.9040...@ias.edu> >>> > Content-Type: text/plain; charset=ISO-8859-1 >>> > >>> > Orion, >>> > >>> > I have no trouble getting thread support during configure with PGI >>> > 8.0-3 >>> > >>> > Are there any other compilers in your path before the PGI compilers? >>> > Even if the PGI compilers come first, try specifying the PGI compilers >>> > explicitly with these environment variables (bash syntax shown): >>> > >>> > export CC=pgcc >>> > export CXX=pgCC >>> > export F77=pgf77 >>> > export FC=pgf90 >>> > >>> > also check the value of CPPFLAGS and LDFLAGS, and make sure they are >>> > correct for your PGI compilers. >>> > >>> > -- >>> > Prentice >>> > >>> > Orion Poplawski wrote: >>> > > Seeing the following building openmpi 1.3.1 on CentOS 5.3 with PGI >>> > pgf90 >>> > > 8.0-5 fortran compiler: >>> > > >>> > > checking if C compiler and POSIX threads work with -Kthread... no >>> > > checking if C compiler and POSIX threads work with -kthread... no >>> > > checking if C compiler and POSIX threads work with -pthread... yes >>> > > checking if C++ compiler and POSIX threads work with -Kthread... no >>> > > checking if C++ compiler and POSIX threads work with -kthread... no >>> > > checking if C++ compiler and POSIX threads work with -pthread... yes >>> > > checking if F77 compiler and POSIX threads work with -Kthread... no >>> > > checking if F77 compiler and POSIX threads work with -kthread... no >>> > > checking if F77 compiler and POSIX threads work with -pthread... no >>> > > checking if F77 compiler and POSIX threads work with -pthreads... no >>> > > checking if F77 compiler and POSIX threads work with -mt... no >>> > > checking if F77 compiler and POSIX threads work with -mthreads... no >>> > > checking if F77 compiler and POSIX threads work with -lpthreads... >>> > no >>> > > checking if F77 compiler and POSIX threads work with -llthread... no >>> > > checking if F77 compiler and POSIX threads work with -lpthread... no >>> > > checking for PTHREAD_MUTEX_ERRORCHECK_NP... yes >>> > > checking for PTHREAD_MUTEX_ERRORCHECK... yes >>> > > checking for working POSIX threads package... no >>> > > checking if C compiler and Solaris threads work... no >>> > > checking if C++ compiler and Solaris threads work... no >>> > > checking if F77 compiler and Solaris threads work... no >>> > > checking for working Solaris threads package... no >>> > > checking for type of thread support... none found >>> > > >>> > >>> > >>> > >>> > ------------------------------ >>> > >>> > _______________________________________________ >>> > users mailing list >>> > us...@open-mpi.org >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >>> > >>> > End of users Digest, Vol 1202, Issue 4 >>> > ************************************** >>> > >>> > _______________________________________________ >>> > users mailing list >>> > us...@open-mpi.org >>> > http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> -------------- next part -------------- >>> HTML attachment scrubbed and removed >>> >>> ------------------------------ >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> End of users Digest, Vol 1208, Issue 2 >>> ************************************** >>> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >