Hi Lenny After removing the max-slots entries, i could do mpirun -np 4 -hostfile th_02 -rf rf_02 ./HelloMPI without any errors.
But can you explain what the meaning of the max-slots entry is? I checked the FAQs http://www.open-mpi.org/faq/?category=running#simple-spmd-run http://www.open-mpi.org/faq/?category=running#mpirun-scheduling but i couldn't find any explanation. (furthermore, in the FAQ it says "max-slots" in one place, but "max_slots" in the other one) Thank You Jody On Mon, Aug 17, 2009 at 3:29 PM, Lenny Verkhovsky<lenny.verkhov...@gmail.com> wrote: > can you try not specifiyng "max-slots" in the hostfile. > if you are the only user of the nodes, there will be no oversibscibing of > the processors. > This one definetly looks like a bug, > but as Ralph said there is a current disscusion and working on this > component. > Lenny. > On Mon, Aug 17, 2009 at 2:37 PM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>> Is there an explanation for this? >> >> I believe the word is "bug". :-) >> >> The rank_file mapper has been substantially revised lately - we are >> discussing now how much of that revision to bring to 1.3.4 versus the next >> major release. >> >> Ralph >> >> On Aug 17, 2009, at 4:45 AM, jody wrote: >> >>> Hi Lenny >>> >>>> I think it has something to do with your environment, /etc/hosts, IT >>>> setup, >>>> hostname function return value e.t.c >>>> I am not sure if it has something to do with Open MPI at all. >>> >>> OK. I just thought this was Open MPI related because i was able to use >>> the >>> aliases of the hosts (i.e. plankton instead of plankton.uzh.ch) in >>> the host file... >>> >>> However, I encountered a new problem: >>> if the rankfile lists all the entries which occur in the host file >>> there is an error message. >>> In the following example, the hostfile is >>> [jody@plankton neander]$ cat th_02 >>> nano_00.uzh.ch slots=2 max-slots=2 >>> nano_02.uzh.ch slots=2 max-slots=2 >>> >>> and the rankfile is: >>> [jody@plankton neander]$ cat rf_02 >>> rank 0=nano_00.uzh.ch slot=0 >>> rank 2=nano_00.uzh.ch slot=1 >>> rank 1=nano_02.uzh.ch slot=0 >>> rank 3=nano_02.uzh.ch slot=1 >>> >>> Here is the error: >>> [jody@plankton neander]$ mpirun -np 4 -hostfile th_02 -rf rf_02 >>> ./HelloMPI >>> >>> -------------------------------------------------------------------------- >>> There are not enough slots available in the system to satisfy the 4 slots >>> that were requested by the application: >>> ./HelloMPI >>> >>> Either request fewer slots for your application, or make more slots >>> available >>> for use. >>> >>> >>> -------------------------------------------------------------------------- >>> >>> -------------------------------------------------------------------------- >>> A daemon (pid unknown) died unexpectedly on signal 1 while attempting to >>> launch so we are aborting. >>> >>> There may be more information reported by the environment (see above). >>> >>> This may be because the daemon was unable to find all the needed shared >>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have >>> the >>> location of the shared libraries on the remote nodes and this will >>> automatically be forwarded to the remote nodes. >>> >>> -------------------------------------------------------------------------- >>> >>> -------------------------------------------------------------------------- >>> mpirun noticed that the job aborted, but has no info as to the process >>> that caused that situation. >>> >>> -------------------------------------------------------------------------- >>> mpirun: clean termination accomplished >>> >>> If i use a hostfile with one more entry >>> [jody@aim-plankton neander]$ cat th_021 >>> aim-nano_00.uzh.ch slots=2 max-slots=2 >>> aim-nano_02.uzh.ch slots=2 max-slots=2 >>> aim-nano_01.uzh.ch slots=1 max-slots=1 >>> >>> Then this works fine: >>> [jody@aim-plankton neander]$ mpirun -np 4 -hostfile th_021 -rf rf_02 >>> ./HelloMPI >>> >>> Is there an explanation for this? >>> >>> Thank You >>> Jody >>> >>>> Lenny. >>>> On Mon, Aug 17, 2009 at 12:59 PM, jody <jody....@gmail.com> wrote: >>>>> >>>>> Hi Lenny >>>>> >>>>> Thanks - using the full names makes it work! >>>>> Is there a reason why the rankfile option treats >>>>> host names differently than the hostfile option? >>>>> >>>>> Thanks >>>>> Jody >>>>> >>>>> >>>>> >>>>> On Mon, Aug 17, 2009 at 11:20 AM, Lenny >>>>> Verkhovsky<lenny.verkhov...@gmail.com> wrote: >>>>>> >>>>>> Hi >>>>>> This message means >>>>>> that you are trying to use host "plankton", that was not allocated via >>>>>> hostfile or hostlist. >>>>>> But according to the files and command line, everything seems fine. >>>>>> Can you try using "plankton.uzh.ch" hostname instead of "plankton". >>>>>> thanks >>>>>> Lenny. >>>>>> On Mon, Aug 17, 2009 at 10:36 AM, jody <jody....@gmail.com> wrote: >>>>>>> >>>>>>> Hi >>>>>>> >>>>>>> When i use a rankfile, i get an error message which i don't >>>>>>> understand: >>>>>>> >>>>>>> [jody@plankton tests]$ mpirun -np 3 -rf rankfile -hostfile testhosts >>>>>>> ./HelloMPI >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> Rankfile claimed host plankton that was not allocated or >>>>>>> oversubscribed it's slots: >>>>>>> >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter >>>>>>> in >>>>>>> file rmaps_rank_file.c at line 108 >>>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter >>>>>>> in >>>>>>> file base/rmaps_base_map_job.c at line 87 >>>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter >>>>>>> in >>>>>>> file base/plm_base_launch_support.c at line 77 >>>>>>> [plankton.uzh.ch:24327] [[44857,0],0] ORTE_ERROR_LOG: Bad parameter >>>>>>> in >>>>>>> file plm_rsh_module.c at line 990 >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> A daemon (pid unknown) died unexpectedly on signal 1 while >>>>>>> attempting >>>>>>> to >>>>>>> launch so we are aborting. >>>>>>> >>>>>>> There may be more information reported by the environment (see >>>>>>> above). >>>>>>> >>>>>>> This may be because the daemon was unable to find all the needed >>>>>>> shared >>>>>>> libraries on the remote node. You may set your LD_LIBRARY_PATH to >>>>>>> have >>>>>>> the >>>>>>> location of the shared libraries on the remote nodes and this will >>>>>>> automatically be forwarded to the remote nodes. >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> mpirun noticed that the job aborted, but has no info as to the >>>>>>> process >>>>>>> that caused that situation. >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> mpirun: clean termination accomplished >>>>>>> >>>>>>> >>>>>>> >>>>>>> With out the '-rf rankfile' option everything works as expected. >>>>>>> >>>>>>> My hostfile : >>>>>>> [jody@plankton tests]$ cat testhosts >>>>>>> # The following node is a quad-processor machine, and we absolutely >>>>>>> # want to disallow over-subscribing it: >>>>>>> plankton slots=3 max-slots=3 >>>>>>> # The following nodes are dual-processor machines: >>>>>>> nano_00 slots=2 max-slots=2 >>>>>>> nano_01 slots=2 max-slots=2 >>>>>>> nano_02 slots=2 max-slots=2 >>>>>>> nano_03 slots=2 max-slots=2 >>>>>>> nano_04 slots=2 max-slots=2 >>>>>>> nano_05 slots=2 max-slots=2 >>>>>>> nano_06 slots=2 max-slots=2 >>>>>>> >>>>>>> my rank file: >>>>>>> [jody@plankton neander]$ cat rankfile >>>>>>> rank 0=nano_00 slot=1 >>>>>>> rank 1=plankton slot=0 >>>>>>> rank 2=nano_01 slot=1 >>>>>>> >>>>>>> my Open MPI version: 1.3.2 >>>>>>> >>>>>>> i get the same error if i use a rankfile which has a single line >>>>>>> rank 0=plankton slot=0 >>>>>>> (plankton is my local machine) and call mpirun with np 1 >>>>>>> >>>>>>> What does the "Rankfile claimed..." message mean? >>>>>>> Did i make an error in my rankfile? >>>>>>> If yes, what would be the correct way to write it? >>>>>>> >>>>>>> Thank You >>>>>>> Jody >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >