The 1.6 code always expects to find the default hostfile, even if it is empty. 
We always install it by default, so I don't know why yours isn't there. In the 
future, we just ignore it if we don't find it.

You have two options:

1. create that file and leave it empty

2. you can work around it by adding --default-hostfile none to your cmd line, 
or adding OMPI_MCA_orte_default_hostfile=none to your environment. If you want 
to do this for everyone on the system, then add "orte_default_hostfile=none" to 
your default MCA param file.

HTH
Ralph

On Aug 23, 2012, at 4:03 PM, Jim Kusznir <jkusz...@gmail.com> wrote:

> Hi all:
> 
> I recently rebuilt my cluster from rocks 5 to rocks 6 (which is based
> on CentOS 6.2) using the official spec file and my build options as
> before.  It all built successfully and all appeared good.  That is,
> until one tried to use it.  This is built with torque integration, and
> its run through torque.  When a user's job runs, this ends up in the
> error file and the program does not run successfully:
> 
> --------------------------------------------------------------------------
> Open RTE was unable to open the hostfile:
>    /opt/openmpi-gcc/1.6/etc/openmpi-default-hostfile
> Check to make sure the path and filename are correct.
> --------------------------------------------------------------------------
> [compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in
> file base/rmaps_base_support_fns.c at line 88
> [compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in
> file rmaps_rr.c at line 82
> [compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in
> file base/rmaps_base_map_job.c at line 88
> [compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in
> file base/plm_base_launch_support.c at line 105
> [compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in
> file plm_tm_module.c at line 194
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> 
> This has been confirmed with several different node assignments.  Any
> ideas on cause or fixes?
> 
> I built it with this command:
> rpmbuild -bb --define 'install_in_opt 1' --define 'install_modulefile
> 1' --define 'modules_rpm_name environment-modules' --define
> 'build_all_in_one_rpm 0' --define 'configure_options
> --with-tm=/opt/torque' --define '_name openmpi-gcc' --define 'makeopts
> -J8' openmpi.spec
> 
> (and the PGI version was built with:
> CC=pgcc CXX=pgCC F77=pgf77 FC=pgf90 rpmbuild -bb --define
> 'install_in_opt 1' --define 'install_modulefile 1' --define
> 'modules_rpm_name environment-modules' --define 'build_all_in_one_rpm
> 0'  --define 'configure_options --with-tm=/opt/torque' --define '_name
> openmpi-pgi' --define 'use_default_rpm_opt_flags 0' openmpi.spec
> )
> 
> --Jim
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to