Me too, sorry, it definately seems like a bug. Somewere in the code probably undefined variable. I just never tested this code with such "bizzare" command line :)
Lenny. On Mon, Apr 20, 2009 at 4:08 PM, Geoffroy Pignot <geopig...@gmail.com>wrote: > Thanks, > > I am not in a hurry but it would be nice if I could benefit from this > feature in the next release. > Regards > > Geoffroy > > > > 2009/4/20 <users-requ...@open-mpi.org> > >> Send users mailing list submissions to >> us...@open-mpi.org >> >> To subscribe or unsubscribe via the World Wide Web, visit >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> or, via email, send a message with subject or body 'help' to >> users-requ...@open-mpi.org >> >> You can reach the person managing the list at >> users-ow...@open-mpi.org >> >> When replying, please edit your Subject line so it is more specific >> than "Re: Contents of users digest..." >> >> >> Today's Topics: >> >> 1. Re: 1.3.1 -rf rankfile behaviour ?? (Ralph Castain) >> >> >> ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Mon, 20 Apr 2009 05:59:52 -0600 >> From: Ralph Castain <r...@open-mpi.org> >> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >> To: Open MPI Users <us...@open-mpi.org> >> Message-ID: <6378a8c1-1763-4a1c-abca-c6fcc3605...@open-mpi.org> >> >> Content-Type: text/plain; charset="us-ascii"; Format="flowed"; >> DelSp="yes" >> >> Honestly haven't had time to look at it yet - hopefully in the next >> couple of days... >> >> Sorry for delay >> >> >> On Apr 20, 2009, at 2:58 AM, Geoffroy Pignot wrote: >> >> > Do you have any news about this bug. >> > Thanks >> > >> > Geoffroy >> > >> > >> > Message: 1 >> > Date: Tue, 14 Apr 2009 07:57:44 -0600 >> > From: Ralph Castain <r...@lanl.gov> >> > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >> > To: Open MPI Users <us...@open-mpi.org> >> > Message-ID: <beb90473-0747-43bf-a1e9-6fa4e7777...@lanl.gov> >> > Content-Type: text/plain; charset="us-ascii"; Format="flowed"; >> > DelSp="yes" >> > >> > Ah now, I didn't say it -worked-, did I? :-) >> > >> > Clearly a bug exists in the program. I'll try to take a look at it (if >> > Lenny doesn't get to it first), but it won't be until later in the >> > week. >> > >> > On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote: >> > >> > > I agree with you Ralph , and that 's what I expect from openmpi but >> > > my second example shows that it's not working >> > > >> > > cat hostfile.0 >> > > r011n002 slots=4 >> > > r011n003 slots=4 >> > > >> > > cat rankfile.0 >> > > rank 0=r011n002 slot=0 >> > > rank 1=r011n003 slot=1 >> > > >> > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 >> > > hostname >> > > ### CRASHED >> > > >> > > > > Error, invalid rank (1) in the rankfile (rankfile.0) >> > > > > >> > > > >> > > >> > >> -------------------------------------------------------------------------- >> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >> > > file >> > > > > rmaps_rank_file.c at line 404 >> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >> > > file >> > > > > base/rmaps_base_map_job.c at line 87 >> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >> > > file >> > > > > base/plm_base_launch_support.c at line 77 >> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >> > > file >> > > > > plm_rsh_module.c at line 985 >> > > > > >> > > > >> > > >> > >> -------------------------------------------------------------------------- >> > > > > A daemon (pid unknown) died unexpectedly on signal 1 while >> > > > attempting to >> > > > > launch so we are aborting. >> > > > > >> > > > > There may be more information reported by the environment (see >> > > > above). >> > > > > >> > > > > This may be because the daemon was unable to find all the needed >> > > > shared >> > > > > libraries on the remote node. You may set your LD_LIBRARY_PATH >> > to >> > > > have the >> > > > > location of the shared libraries on the remote nodes and this >> > will >> > > > > automatically be forwarded to the remote nodes. >> > > > > >> > > > >> > > >> > >> -------------------------------------------------------------------------- >> > > > > >> > > > >> > > >> > >> -------------------------------------------------------------------------- >> > > > > orterun noticed that the job aborted, but has no info as to the >> > > > process >> > > > > that caused that situation. >> > > > > >> > > > >> > > >> > >> -------------------------------------------------------------------------- >> > > > > orterun: clean termination accomplished >> > > >> > > >> > > >> > > Message: 4 >> > > Date: Tue, 14 Apr 2009 06:55:58 -0600 >> > > From: Ralph Castain <r...@lanl.gov> >> > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >> > > To: Open MPI Users <us...@open-mpi.org> >> > > Message-ID: <f6290ada-a196-43f0-a853-cbcb802d8...@lanl.gov> >> > > Content-Type: text/plain; charset="us-ascii"; Format="flowed"; >> > > DelSp="yes" >> > > >> > > The rankfile cuts across the entire job - it isn't applied on an >> > > app_context basis. So the ranks in your rankfile must correspond to >> > > the eventual rank of each process in the cmd line. >> > > >> > > Unfortunately, that means you have to count ranks. In your case, you >> > > only have four, so that makes life easier. Your rankfile would look >> > > something like this: >> > > >> > > rank 0=r001n001 slot=0 >> > > rank 1=r001n002 slot=1 >> > > rank 2=r001n001 slot=1 >> > > rank 3=r001n002 slot=2 >> > > >> > > HTH >> > > Ralph >> > > >> > > On Apr 14, 2009, at 12:19 AM, Geoffroy Pignot wrote: >> > > >> > > > Hi, >> > > > >> > > > I agree that my examples are not very clear. What I want to do >> > is to >> > > > launch a multiexes application (masters-slaves) and benefit from >> > the >> > > > processor affinity. >> > > > Could you show me how to convert this command , using -rf option >> > > > (whatever the affinity is) >> > > > >> > > > mpirun -n 1 -host r001n001 master.x options1 : -n 1 -host >> > r001n002 >> > > > master.x options2 : -n 1 -host r001n001 slave.x options3 : -n 1 - >> > > > host r001n002 slave.x options4 >> > > > >> > > > Thanks for your help >> > > > >> > > > Geoffroy >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > Message: 2 >> > > > Date: Sun, 12 Apr 2009 18:26:35 +0300 >> > > > From: Lenny Verkhovsky <lenny.verkhov...@gmail.com> >> > > > Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ?? >> > > > To: Open MPI Users <us...@open-mpi.org> >> > > > Message-ID: >> > > > <453d39990904120826t2e1d1d33l7bb1fe3de65b5...@mail.gmail.com >> > > >> > > > Content-Type: text/plain; charset="iso-8859-1" >> > > > >> > > > Hi, >> > > > >> > > > The first "crash" is OK, since your rankfile has ranks 0 and 1 >> > > > defined, >> > > > while n=1, which means only rank 0 is present and can be >> > allocated. >> > > > >> > > > NP must be >= the largest rank in rankfile. >> > > > >> > > > What exactly are you trying to do ? >> > > > >> > > > I tried to recreate your seqv but all I got was >> > > > >> > > > ~/work/svn/ompi/trunk/build_x86-64/install/bin/mpirun --hostfile >> > > > hostfile.0 >> > > > -rf rankfile.0 -n 1 hostname : -rf rankfile.1 -n 1 hostname >> > > > [witch19:30798] mca: base: component_find: paffinity >> > > > "mca_paffinity_linux" >> > > > uses an MCA interface that is not recognized (component MCA >> > > v1.0.0 != >> > > > supported MCA v2.0.0) -- ignored >> > > > >> > > >> > >> -------------------------------------------------------------------------- >> > > > It looks like opal_init failed for some reason; your parallel >> > > > process is >> > > > likely to abort. There are many reasons that a parallel process >> > can >> > > > fail during opal_init; some of which are due to configuration or >> > > > environment problems. This failure appears to be an internal >> > > failure; >> > > > here's some additional information (which may only be relevant >> > to an >> > > > Open MPI developer): >> > > > >> > > > opal_carto_base_select failed >> > > > --> Returned value -13 instead of OPAL_SUCCESS >> > > > >> > > >> > >> -------------------------------------------------------------------------- >> > > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in >> > > file >> > > > ../../orte/runtime/orte_init.c at line 78 >> > > > [witch19:30798] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in >> > > file >> > > > ../../orte/orted/orted_main.c at line 344 >> > > > >> > > >> > >> -------------------------------------------------------------------------- >> > > > A daemon (pid 11629) died unexpectedly with status 243 while >> > > > attempting >> > > > to launch so we are aborting. >> > > > >> > > > There may be more information reported by the environment (see >> > > above). >> > > > >> > > > This may be because the daemon was unable to find all the needed >> > > > shared >> > > > libraries on the remote node. You may set your LD_LIBRARY_PATH to >> > > > have the >> > > > location of the shared libraries on the remote nodes and this will >> > > > automatically be forwarded to the remote nodes. >> > > > >> > > >> > >> -------------------------------------------------------------------------- >> > > > >> > > >> > >> -------------------------------------------------------------------------- >> > > > mpirun noticed that the job aborted, but has no info as to the >> > > process >> > > > that caused that situation. >> > > > >> > > >> > >> -------------------------------------------------------------------------- >> > > > mpirun: clean termination accomplished >> > > > >> > > > >> > > > Lenny. >> > > > >> > > > >> > > > On 4/10/09, Geoffroy Pignot <geopig...@gmail.com> wrote: >> > > > > >> > > > > Hi , >> > > > > >> > > > > I am currently testing the process affinity capabilities of >> > > > openmpi and I >> > > > > would like to know if the rankfile behaviour I will describe >> > below >> > > > is normal >> > > > > or not ? >> > > > > >> > > > > cat hostfile.0 >> > > > > r011n002 slots=4 >> > > > > r011n003 slots=4 >> > > > > >> > > > > cat rankfile.0 >> > > > > rank 0=r011n002 slot=0 >> > > > > rank 1=r011n003 slot=1 >> > > > > >> > > > > >> > > > > >> > > > >> > > >> > >> ################################################################################## >> > > > > >> > > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 2 hostname ### >> > OK >> > > > > r011n002 >> > > > > r011n003 >> > > > > >> > > > > >> > > > > >> > > > >> > > >> > >> ################################################################################## >> > > > > but >> > > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 >> > > > hostname >> > > > > ### CRASHED >> > > > > * >> > > > > >> > > > >> > > >> > >> -------------------------------------------------------------------------- >> > > > > Error, invalid rank (1) in the rankfile (rankfile.0) >> > > > > >> > > > >> > > >> > >> -------------------------------------------------------------------------- >> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >> > > file >> > > > > rmaps_rank_file.c at line 404 >> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >> > > file >> > > > > base/rmaps_base_map_job.c at line 87 >> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >> > > file >> > > > > base/plm_base_launch_support.c at line 77 >> > > > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in >> > > file >> > > > > plm_rsh_module.c at line 985 >> > > > > >> > > > >> > > >> > >> -------------------------------------------------------------------------- >> > > > > A daemon (pid unknown) died unexpectedly on signal 1 while >> > > > attempting to >> > > > > launch so we are aborting. >> > > > > >> > > > > There may be more information reported by the environment (see >> > > > above). >> > > > > >> > > > > This may be because the daemon was unable to find all the needed >> > > > shared >> > > > > libraries on the remote node. You may set your LD_LIBRARY_PATH >> > to >> > > > have the >> > > > > location of the shared libraries on the remote nodes and this >> > will >> > > > > automatically be forwarded to the remote nodes. >> > > > > >> > > > >> > > >> > >> -------------------------------------------------------------------------- >> > > > > >> > > > >> > > >> > >> -------------------------------------------------------------------------- >> > > > > orterun noticed that the job aborted, but has no info as to the >> > > > process >> > > > > that caused that situation. >> > > > > >> > > > >> > > >> > >> -------------------------------------------------------------------------- >> > > > > orterun: clean termination accomplished >> > > > > * >> > > > > It seems that the rankfile option is not propagted to the second >> > > > command >> > > > > line ; there is no global understanding of the ranking inside a >> > > > mpirun >> > > > > command. >> > > > > >> > > > > >> > > > > >> > > > >> > > >> > >> ################################################################################## >> > > > > >> > > > > Assuming that , I tried to provide a rankfile to each command >> > > line: >> > > > > >> > > > > cat rankfile.0 >> > > > > rank 0=r011n002 slot=0 >> > > > > >> > > > > cat rankfile.1 >> > > > > rank 0=r011n003 slot=1 >> > > > > >> > > > > mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -rf >> > > > rankfile.1 >> > > > > -n 1 hostname ### CRASHED >> > > > > *[r011n002:28778] *** Process received signal *** >> > > > > [r011n002:28778] Signal: Segmentation fault (11) >> > > > > [r011n002:28778] Signal code: Address not mapped (1) >> > > > > [r011n002:28778] Failing at address: 0x34 >> > > > > [r011n002:28778] [ 0] [0xffffe600] >> > > > > [r011n002:28778] [ 1] >> > > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. >> > > > 0(orte_odls_base_default_get_add_procs_data+0x55d) >> > > > > [0x5557decd] >> > > > > [r011n002:28778] [ 2] >> > > > > /tmp/HALMPI/openmpi-1.3.1/lib/libopen-rte.so. >> > > > 0(orte_plm_base_launch_apps+0x117) >> > > > > [0x555842a7] >> > > > > [r011n002:28778] [ 3] /tmp/HALMPI/openmpi-1.3.1/lib/openmpi/ >> > > > mca_plm_rsh.so >> > > > > [0x556098c0] >> > > > > [r011n002:28778] [ 4] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >> > > > [0x804aa27] >> > > > > [r011n002:28778] [ 5] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >> > > > [0x804a022] >> > > > > [r011n002:28778] [ 6] /lib/libc.so.6(__libc_start_main+0xdc) >> > > > [0x9f1dec] >> > > > > [r011n002:28778] [ 7] /tmp/HALMPI/openmpi-1.3.1/bin/orterun >> > > > [0x8049f71] >> > > > > [r011n002:28778] *** End of error message *** >> > > > > Segmentation fault (core dumped)* >> > > > > >> > > > > >> > > > > >> > > > > I hope that I've found a bug because it would be very important >> > > > for me to >> > > > > have this kind of capabiliy . >> > > > > Launch a multiexe mpirun command line and be able to bind my >> > exes >> > > > and >> > > > > sockets together. >> > > > > >> > > > > Thanks in advance for your help >> > > > > >> > > > > Geoffroy >> > > > _______________________________________________ >> > > > users mailing list >> > > > us...@open-mpi.org >> > > > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > >> > > -------------- next part -------------- >> > > HTML attachment scrubbed and removed >> > > >> > > ------------------------------ >> > > >> > > _______________________________________________ >> > > users mailing list >> > > us...@open-mpi.org >> > > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > >> > > End of users Digest, Vol 1202, Issue 2 >> > > ************************************** >> > > >> > > _______________________________________________ >> > > users mailing list >> > > us...@open-mpi.org >> > > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > -------------- next part -------------- >> > HTML attachment scrubbed and removed >> > >> > ------------------------------ >> > >> > Message: 2 >> > Date: Tue, 14 Apr 2009 10:30:58 -0400 >> > From: Prentice Bisbal <prent...@ias.edu> >> > Subject: Re: [OMPI users] PGI Fortran pthread support >> > To: Open MPI Users <us...@open-mpi.org> >> > Message-ID: <49e49e22.9040...@ias.edu> >> > Content-Type: text/plain; charset=ISO-8859-1 >> > >> > Orion, >> > >> > I have no trouble getting thread support during configure with PGI >> > 8.0-3 >> > >> > Are there any other compilers in your path before the PGI compilers? >> > Even if the PGI compilers come first, try specifying the PGI compilers >> > explicitly with these environment variables (bash syntax shown): >> > >> > export CC=pgcc >> > export CXX=pgCC >> > export F77=pgf77 >> > export FC=pgf90 >> > >> > also check the value of CPPFLAGS and LDFLAGS, and make sure they are >> > correct for your PGI compilers. >> > >> > -- >> > Prentice >> > >> > Orion Poplawski wrote: >> > > Seeing the following building openmpi 1.3.1 on CentOS 5.3 with PGI >> > pgf90 >> > > 8.0-5 fortran compiler: >> > > >> > > checking if C compiler and POSIX threads work with -Kthread... no >> > > checking if C compiler and POSIX threads work with -kthread... no >> > > checking if C compiler and POSIX threads work with -pthread... yes >> > > checking if C++ compiler and POSIX threads work with -Kthread... no >> > > checking if C++ compiler and POSIX threads work with -kthread... no >> > > checking if C++ compiler and POSIX threads work with -pthread... yes >> > > checking if F77 compiler and POSIX threads work with -Kthread... no >> > > checking if F77 compiler and POSIX threads work with -kthread... no >> > > checking if F77 compiler and POSIX threads work with -pthread... no >> > > checking if F77 compiler and POSIX threads work with -pthreads... no >> > > checking if F77 compiler and POSIX threads work with -mt... no >> > > checking if F77 compiler and POSIX threads work with -mthreads... no >> > > checking if F77 compiler and POSIX threads work with -lpthreads... >> > no >> > > checking if F77 compiler and POSIX threads work with -llthread... no >> > > checking if F77 compiler and POSIX threads work with -lpthread... no >> > > checking for PTHREAD_MUTEX_ERRORCHECK_NP... yes >> > > checking for PTHREAD_MUTEX_ERRORCHECK... yes >> > > checking for working POSIX threads package... no >> > > checking if C compiler and Solaris threads work... no >> > > checking if C++ compiler and Solaris threads work... no >> > > checking if F77 compiler and Solaris threads work... no >> > > checking for working Solaris threads package... no >> > > checking for type of thread support... none found >> > > >> > >> > >> > >> > ------------------------------ >> > >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > End of users Digest, Vol 1202, Issue 4 >> > ************************************** >> > >> > _______________________________________________ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> -------------- next part -------------- >> HTML attachment scrubbed and removed >> >> ------------------------------ >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> End of users Digest, Vol 1208, Issue 2 >> ************************************** >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >