Re: [OMPI users] Gridengine + Open MPI
Pak Lui a écrit : It was fixed at one point in the trunk before v1.3 went official, but while rolling the code from gridengine PLM into the rsh PLM code, this feature was left out because there was some lingering issues that I didn't resolved and I lost track of it. Sorry but thanks for bringing it up, I will need to look at the issue again and reopen this ticket against v1.3: Ok, so I have to wait for a 1.3 version to work with job suspend, or will it be back-ported to 1.2.6 or 1.2.6 ? So even it is the rsh PLM that starts the parallel job under SGE, the rsh PLM can detect if the Open MPI job is started under the SGE Parallel Environment (via checking some SGE env vars) and use the "qrsh --inherit" command to launch the parallel job the same way as it was before. You can check by setting MCA to something like "--mca plm_base_verbose 10" in your mpirun command and look for the launch commands that mpirun uses. It looks like shepherd cannot be started for a reason I couldn't get yet. /opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0 reading exit code from shepherd ... 255 [hostname:16745] Regards, Romaric -- -- R. David - da...@icps.u-strasbg.fr Tel. : 03 90 24 45 48 (Fax 45 47) --
Re: [OMPI users] Gridengine + Open MPI
Hi, Am 07.07.2008 um 11:31 schrieb Romaric David: Pak Lui a écrit : It was fixed at one point in the trunk before v1.3 went official, but while rolling the code from gridengine PLM into the rsh PLM code, this feature was left out because there was some lingering issues that I didn't resolved and I lost track of it. Sorry but thanks for bringing it up, I will need to look at the issue again and reopen this ticket against v1.3: Ok, so I have to wait for a 1.3 version to work with job suspend, or will it be back-ported to 1.2.6 or 1.2.6 ? So even it is the rsh PLM that starts the parallel job under SGE, the rsh PLM can detect if the Open MPI job is started under the SGE Parallel Environment (via checking some SGE env vars) and use the "qrsh --inherit" command to launch the parallel job the same way as it was before. You can check by setting MCA to something like "--mca plm_base_verbose 10" in your mpirun command and look for the launch commands that mpirun uses. It looks like shepherd cannot be started for a reason I couldn't get yet. /opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0 reading exit code from shepherd ... 255 [hostname:16745] you mean with the plain rsh startup, like a loose integration? Isn't in this case a proper hostlist necessary, which is for other MPI implementations built in the start_proc_args defined routine? AFAIK you can disregard the hostlist only with Open MPI's tight SGE support. -- Reuti
Re: [OMPI users] Gridengine + Open MPI
Romaric David wrote: Pak Lui a écrit : It was fixed at one point in the trunk before v1.3 went official, but while rolling the code from gridengine PLM into the rsh PLM code, this feature was left out because there was some lingering issues that I didn't resolved and I lost track of it. Sorry but thanks for bringing it up, I will need to look at the issue again and reopen this ticket against v1.3: Ok, so I have to wait for a 1.3 version to work with job suspend, or will it be back-ported to 1.2.6 or 1.2.6 ? I believe it will definitely be in 1.3 series, I am not sure about v1.2 at this point. So even it is the rsh PLM that starts the parallel job under SGE, the rsh PLM can detect if the Open MPI job is started under the SGE Parallel Environment (via checking some SGE env vars) and use the "qrsh --inherit" command to launch the parallel job the same way as it was before. You can check by setting MCA to something like "--mca plm_base_verbose 10" in your mpirun command and look for the launch commands that mpirun uses. It looks like shepherd cannot be started for a reason I couldn't get yet. /opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0 reading exit code from shepherd ... 255 [hostname:16745] Regards, Romaric How recent is the build that you use to generate the error above? I assume you are using a trunk build? I didn't see the complete error messages that you are seeing, but I think I am running into the same exact error too. It seems to be a weird error that points out that the 'ssh' component not found. I don't believe there's a component named 'ssh' here, because ssh and rsh shared the same component. Well, it looks like something is broken in the plm that is responsible for launching the tight integration job for SGE. I checked it used to work without problem with my earlier trunk build (r18645). I have to find out what has happened since... Starting server daemon at host "burl-ct-v440-4" Server daemon successfully started with task id "1.burl-ct-v440-4" Establishing /opt/sge/utilbin/sol-sparc64/rsh session to host burl-ct-v440-4 ... [burl-ct-v440-4:13749] mca: base: components_open: Looking for plm components -- A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that Open MPI stopped checking at the first component that it did not find. Host: burl-ct-v440-4 Framework: plm Component: ssh -- [burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file base/ess_base_std_orted.c at line 70 [burl-ct-v440-4:13749] -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_plm_base_open failed --> Returned value Error (-1) instead of ORTE_SUCCESS -- [burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file ess_env_module.c at line 135 [burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 132 [burl-ct-v440-4:13749] -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_set_name failed --> Returned value Error (-1) instead of ORTE_SUCCESS -- [burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file orted/orted_main.c at line 311 /opt/sge/utilbin/sol-sparc64/rsh exited with exit code 0 reading exit code from shepherd ... 255 [burl-ct-v440-5:09789] -- A daemon (pid 9790) died unexpectedly with status 255 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the sh
Re: [OMPI users] Gridengine + Open MPI
Reuti wrote: Hi, Am 07.07.2008 um 11:31 schrieb Romaric David: Pak Lui a écrit : It was fixed at one point in the trunk before v1.3 went official, but while rolling the code from gridengine PLM into the rsh PLM code, this feature was left out because there was some lingering issues that I didn't resolved and I lost track of it. Sorry but thanks for bringing it up, I will need to look at the issue again and reopen this ticket against v1.3: Ok, so I have to wait for a 1.3 version to work with job suspend, or will it be back-ported to 1.2.6 or 1.2.6 ? So even it is the rsh PLM that starts the parallel job under SGE, the rsh PLM can detect if the Open MPI job is started under the SGE Parallel Environment (via checking some SGE env vars) and use the "qrsh --inherit" command to launch the parallel job the same way as it was before. You can check by setting MCA to something like "--mca plm_base_verbose 10" in your mpirun command and look for the launch commands that mpirun uses. It looks like shepherd cannot be started for a reason I couldn't get yet. /opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0 reading exit code from shepherd ... 255 [hostname:16745] you mean with the plain rsh startup, like a loose integration? Isn't in this case a proper hostlist necessary, which is for other MPI implementations built in the start_proc_args defined routine? AFAIK you can disregard the hostlist only with Open MPI's tight SGE support. I think he's using the tight integration and not using a plain rsh startup. From the output it shows that he's using the bundled rsh from SGE. From my run with a recent trunk, something is indeed broken for tight integration. I am looking at it now. -- Reuti ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- - Pak Lui pak@sun.com
[OMPI users] Valgrind Functionality
Hi, I was attempting to get valgrind working with a simple MPI app (osu_latency) on OpenMPI. While it appears to report uninitialized values it fails to report any mallocs or frees that have been conducted. I am using RHEL 5, gcc 4.2.3 and a drop from the repo labeled openmpi-1.3a1r18303. configured with $ ../configure --prefix=/opt/wkspace/openmpi-1.3a1r18303 CC=gcc CXX=g++ --disable-mpi-f77 --enable-debug --enable-memchecker --with-psm=/usr/include --with-valgrind=/opt/wkspace/valgrind-3.3.0/ As the FAQ's suggest I am running a later version of valgrind, enabling the memchecker and debug. I tested a slightly modified osu_latency test which has a simple char buffer malloc and free but the valgrind summary shows no malloc/free activity whatsoever. This is running on a dual node system using Infinipath HCAs. Here is a trimmed output. [tom@lab01 ~]$ mpirun --mca pml cm -np 2 --hostfile my_hostfile valgrind ./osu_latency1 ==17839== Memcheck, a memory error detector. ==17839== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al. ==17839== Using LibVEX rev 1658, a library for dynamic binary translation. ==17839== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP. ==17839== Using valgrind-3.2.1, a dynamic binary instrumentation framework. ==17839== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al. ==17839== For more details, rerun with: -v ==17839== ==17823== Memcheck, a memory error detector. ==17823== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al. ==17823== Using LibVEX rev 1658, a library for dynamic binary translation. ==17823== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP. ==17823== Using valgrind-3.2.1, a dynamic binary instrumentation framework. ==17823== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al. ==17823== For more details, rerun with: -v ==17823== ==17839== Syscall param write(buf) points to uninitialised byte(s) ==17839== at 0x3DD8C0CAA0: __write_nocancel (in /lib64/libpthread-2.5.so) ==17839== by 0x7C283E8: ipath_userinit (ipath_proto.c:191) ==17839== by 0x7AFCDF4: psmi_port_open (psm_port.c:116) ==17839== by 0x7AFE1FB: psm_ep_open (psm_ep.c:535) ==17839== by 0x78E842C: ompi_mtl_psm_module_init (mtl_psm.c:108) ==17839== by 0x78E9137: ompi_mtl_psm_component_init (mtl_psm_component.c:125) ==17839== by 0x4CE32D4: ompi_mtl_base_select (mtl_base_component.c:105) ==17839== by 0x76D9EDE: mca_pml_cm_component_init (pml_cm_component.c:145) ==17839== by 0x4CE7425: mca_pml_base_select (pml_base_select.c:122) ==17839== by 0x4C50ED2: ompi_mpi_init (ompi_mpi_init.c:398) ==17839== by 0x4C93EE4: PMPI_Init (pinit.c:88) ==17839== by 0x400D78: main (in /home/tomr/osu_latency1) ==17839== Address 0x7FEFFE9D4 is on thread 1's stack ==17839== ==17823== ==17823== Syscall param sched_setaffinity(mask) points to uninitialised byte(s) ==17823== at 0x3EA36B8AD0: sched_setaffinity@@GLIBC_2.3.4 (in /lib64/libc-2.5.so) ==17823== by 0x7C2775E: ipath_setaffinity (ipath_proto.c:160) ==17823== by 0x7C28400: ipath_userinit (ipath_proto.c:198) ==17823== by 0x7AFCDF4: psmi_port_open (psm_port.c:116) ==17823== by 0x7AFE1FB: psm_ep_open (psm_ep.c:535) ==17823== by 0x78E842C: ompi_mtl_psm_module_init (mtl_psm.c:108) ==17823== by 0x78E9137: ompi_mtl_psm_component_init (mtl_psm_component.c:125) ==17823== by 0x4CE32D4: ompi_mtl_base_select (mtl_base_component.c:105) ==17823== by 0x76D9EDE: mca_pml_cm_component_init (pml_cm_component.c:145) ==17823== by 0x4CE7425: mca_pml_base_select (pml_base_select.c:122) ==17823== by 0x4C50ED2: ompi_mpi_init (ompi_mpi_init.c:398) ==17823== by 0x4C93EE4: PMPI_Init (pinit.c:88) ==17823== Address 0x7FEFFEC30 is on thread 1's stack ==17823== # OSU MPI Latency Test v3.0 # Size Latency (us) 384 61.78 ==17839== ==17839== ERROR SUMMARY: 9 errors from 9 contexts (suppressed: 5 from 1) ==17839== malloc/free: in use at exit: 0 bytes in 0 blocks. ==17839== malloc/free: 0 allocs, 0 frees, 0 bytes allocated. ==17839== For counts of detected errors, rerun with: -v ==17839== All heap blocks were freed -- no leaks are possible. ==17823== ==17823== ERROR SUMMARY: 9 errors from 9 contexts (suppressed: 5 from 1) ==17823== malloc/free: in use at exit: 0 bytes in 0 blocks. ==17823== malloc/free: 0 allocs, 0 frees, 0 bytes allocated. ==17823== For counts of detected errors, rerun with: -v ==17823== All heap blocks were freed -- no leaks are possible. Hopefully this was enough info to garner a bit of input, Thanks in advance. Tom