Re: [OMPI users] Gridengine + Open MPI

2008-07-07 Thread Romaric David

Pak Lui a écrit :



It was fixed at one point in the trunk before v1.3 went official, but 
while rolling the code from gridengine PLM into the rsh PLM code, this 
feature was left out because there was some lingering issues that I 
didn't resolved and I lost track of it. Sorry but thanks for bringing it 
up, I will need to look at the issue again and reopen this ticket 
against v1.3:

Ok, so I have to wait for a 1.3 version to work with job suspend, or
will it be back-ported to 1.2.6 or 1.2.6 ?




So even it is the rsh PLM that starts the parallel job under SGE, the 
rsh PLM can detect if the Open MPI job is started under the SGE Parallel 
Environment (via checking some SGE env vars) and use the "qrsh 
--inherit" command to launch the parallel job the same way as it was 
before. You can check by setting MCA to something like "--mca 
plm_base_verbose 10" in your mpirun command and look for the launch 
commands that mpirun uses.



It looks like shepherd cannot be started for a reason I couldn't get yet.
/opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0
reading exit code from shepherd ... 255
[hostname:16745] 

Regards,
Romaric
--
--
  R. David - da...@icps.u-strasbg.fr
  Tel. : 03 90 24 45 48  (Fax 45 47)
--


Re: [OMPI users] Gridengine + Open MPI

2008-07-07 Thread Reuti

Hi,

Am 07.07.2008 um 11:31 schrieb Romaric David:


Pak Lui a écrit :
It was fixed at one point in the trunk before v1.3 went official,  
but while rolling the code from gridengine PLM into the rsh PLM  
code, this feature was left out because there was some lingering  
issues that I didn't resolved and I lost track of it. Sorry but  
thanks for bringing it up, I will need to look at the issue again  
and reopen this ticket against v1.3:

Ok, so I have to wait for a 1.3 version to work with job suspend, or
will it be back-ported to 1.2.6 or 1.2.6 ?

So even it is the rsh PLM that starts the parallel job under SGE,  
the rsh PLM can detect if the Open MPI job is started under the  
SGE Parallel Environment (via checking some SGE env vars) and use  
the "qrsh --inherit" command to launch the parallel job the same  
way as it was before. You can check by setting MCA to something  
like "--mca plm_base_verbose 10" in your mpirun command and look  
for the launch commands that mpirun uses.
It looks like shepherd cannot be started for a reason I couldn't  
get yet.

/opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0
reading exit code from shepherd ... 255
[hostname:16745] 


you mean with the plain rsh startup, like a loose integration? Isn't  
in this case a proper hostlist necessary, which is for other MPI  
implementations built in the start_proc_args defined routine? AFAIK  
you can disregard the hostlist only with Open MPI's tight SGE support.


-- Reuti


Re: [OMPI users] Gridengine + Open MPI

2008-07-07 Thread Pak Lui

Romaric David wrote:

Pak Lui a écrit :



It was fixed at one point in the trunk before v1.3 went official, but 
while rolling the code from gridengine PLM into the rsh PLM code, this 
feature was left out because there was some lingering issues that I 
didn't resolved and I lost track of it. Sorry but thanks for bringing 
it up, I will need to look at the issue again and reopen this ticket 
against v1.3:

Ok, so I have to wait for a 1.3 version to work with job suspend, or
will it be back-ported to 1.2.6 or 1.2.6 ?


I believe it will definitely be in 1.3 series, I am not sure about v1.2 
at this point.







So even it is the rsh PLM that starts the parallel job under SGE, the 
rsh PLM can detect if the Open MPI job is started under the SGE 
Parallel Environment (via checking some SGE env vars) and use the 
"qrsh --inherit" command to launch the parallel job the same way as it 
was before. You can check by setting MCA to something like "--mca 
plm_base_verbose 10" in your mpirun command and look for the launch 
commands that mpirun uses.



It looks like shepherd cannot be started for a reason I couldn't get yet.
/opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0
reading exit code from shepherd ... 255
[hostname:16745] 

 Regards,
Romaric


How recent is the build that you use to generate the error above? I 
assume you are using a trunk build?


I didn't see the complete error messages that you are seeing, but I 
think I am running into the same exact error too. It seems to be a weird 
error that points out that the 'ssh' component not found. I don't 
believe there's a component named 'ssh' here, because ssh and rsh shared 
the same component.


Well, it looks like something is broken in the plm that is responsible 
for launching the tight integration job for SGE.


I checked it used to work without problem with my earlier trunk build 
(r18645). I have to find out what has happened since...




Starting server daemon at host "burl-ct-v440-4"
Server daemon successfully started with task id "1.burl-ct-v440-4"
Establishing /opt/sge/utilbin/sol-sparc64/rsh session to host 
burl-ct-v440-4 ...
[burl-ct-v440-4:13749] mca: base: components_open: Looking for plm 
components

--
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:  burl-ct-v440-4
Framework: plm
Component: ssh
--
[burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file 
base/ess_base_std_orted.c at line 70
[burl-ct-v440-4:13749] 
--

It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_plm_base_open failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--
[burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file 
ess_env_module.c at line 135
[burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file 
runtime/orte_init.c at line 132
[burl-ct-v440-4:13749] 
--

It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--
[burl-ct-v440-4:13749] [[4569,0],1] ORTE_ERROR_LOG: Error in file 
orted/orted_main.c at line 311

/opt/sge/utilbin/sol-sparc64/rsh exited with exit code 0
reading exit code from shepherd ... 255
[burl-ct-v440-5:09789] 
--

A daemon (pid 9790) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the sh

Re: [OMPI users] Gridengine + Open MPI

2008-07-07 Thread Pak Lui

Reuti wrote:

Hi,

Am 07.07.2008 um 11:31 schrieb Romaric David:


Pak Lui a écrit :
It was fixed at one point in the trunk before v1.3 went official, but 
while rolling the code from gridengine PLM into the rsh PLM code, 
this feature was left out because there was some lingering issues 
that I didn't resolved and I lost track of it. Sorry but thanks for 
bringing it up, I will need to look at the issue again and reopen 
this ticket against v1.3:

Ok, so I have to wait for a 1.3 version to work with job suspend, or
will it be back-ported to 1.2.6 or 1.2.6 ?

So even it is the rsh PLM that starts the parallel job under SGE, the 
rsh PLM can detect if the Open MPI job is started under the SGE 
Parallel Environment (via checking some SGE env vars) and use the 
"qrsh --inherit" command to launch the parallel job the same way as 
it was before. You can check by setting MCA to something like "--mca 
plm_base_verbose 10" in your mpirun command and look for the launch 
commands that mpirun uses.

It looks like shepherd cannot be started for a reason I couldn't get yet.
/opt/SGE/utilbin/lx24-amd64/rsh exited with exit code 0
reading exit code from shepherd ... 255
[hostname:16745] 


you mean with the plain rsh startup, like a loose integration? Isn't in 
this case a proper hostlist necessary, which is for other MPI 
implementations built in the start_proc_args defined routine? AFAIK you 
can disregard the hostlist only with Open MPI's tight SGE support.


I think he's using the tight integration and not using a plain rsh 
startup. From the output it shows that he's using the bundled rsh from 
SGE. From my run with a recent trunk, something is indeed broken for 
tight integration. I am looking at it now.




-- Reuti
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

- Pak Lui
pak@sun.com


[OMPI users] Valgrind Functionality

2008-07-07 Thread Tom Riddle
Hi,

I was attempting to get valgrind working with a simple MPI app (osu_latency) on 
OpenMPI. While it appears to report uninitialized values it fails to report any 
mallocs or frees that have been conducted. 

I am using RHEL 5, gcc 4.2.3 and a drop from the repo labeled 
openmpi-1.3a1r18303. configured with  

 $ ../configure --prefix=/opt/wkspace/openmpi-1.3a1r18303 CC=gcc 
CXX=g++ --disable-mpi-f77 --enable-debug --enable-memchecker 
--with-psm=/usr/include --with-valgrind=/opt/wkspace/valgrind-3.3.0/


As the FAQ's suggest I am running a later version of valgrind, enabling the 
memchecker and debug. I tested a slightly modified osu_latency test which has a 
simple char buffer malloc and free but the valgrind summary shows no 
malloc/free activity whatsoever. This is  running on a dual node system using 
Infinipath HCAs.  Here is a trimmed output.

[tom@lab01 ~]$ mpirun --mca pml cm -np 2 --hostfile my_hostfile valgrind 
./osu_latency1 
==17839== Memcheck, a memory error detector.
==17839== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al.
==17839== Using LibVEX rev 1658, a library for dynamic binary translation.
==17839== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
==17839== Using valgrind-3.2.1, a dynamic binary instrumentation framework.
==17839== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al.
==17839== For more details, rerun with: -v
==17839== 
==17823== Memcheck, a memory error detector.
==17823== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al.
==17823== Using LibVEX rev 1658, a library for dynamic binary translation.
==17823== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
==17823== Using valgrind-3.2.1, a dynamic binary instrumentation framework.
==17823== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al.
==17823== For more details, rerun with: -v
==17823== 
==17839== Syscall param write(buf) points to uninitialised byte(s)
==17839==    at 0x3DD8C0CAA0: __write_nocancel (in /lib64/libpthread-2.5.so)
==17839==    by 0x7C283E8: ipath_userinit (ipath_proto.c:191)
==17839==    by 0x7AFCDF4: psmi_port_open (psm_port.c:116)
==17839==    by 0x7AFE1FB: psm_ep_open (psm_ep.c:535)
==17839==    by 0x78E842C: ompi_mtl_psm_module_init (mtl_psm.c:108)
==17839==    by 0x78E9137: ompi_mtl_psm_component_init (mtl_psm_component.c:125)
==17839==    by 0x4CE32D4: ompi_mtl_base_select (mtl_base_component.c:105)
==17839==    by 0x76D9EDE: mca_pml_cm_component_init (pml_cm_component.c:145)
==17839==    by 0x4CE7425: mca_pml_base_select (pml_base_select.c:122)
==17839==    by 0x4C50ED2: ompi_mpi_init (ompi_mpi_init.c:398)
==17839==    by 0x4C93EE4: PMPI_Init (pinit.c:88)
==17839==    by 0x400D78: main (in /home/tomr/osu_latency1)
==17839==  Address 0x7FEFFE9D4 is on thread 1's stack
==17839== 
==17823== 
==17823== Syscall param sched_setaffinity(mask) points to uninitialised byte(s)
==17823==    at 0x3EA36B8AD0: sched_setaffinity@@GLIBC_2.3.4 (in 
/lib64/libc-2.5.so)
==17823==    by 0x7C2775E: ipath_setaffinity (ipath_proto.c:160)
==17823==    by 0x7C28400: ipath_userinit (ipath_proto.c:198)
==17823==    by 0x7AFCDF4: psmi_port_open (psm_port.c:116)
==17823==    by 0x7AFE1FB: psm_ep_open (psm_ep.c:535)
==17823==    by 0x78E842C: ompi_mtl_psm_module_init (mtl_psm.c:108)
==17823==    by 0x78E9137: ompi_mtl_psm_component_init (mtl_psm_component.c:125)
==17823==    by 0x4CE32D4: ompi_mtl_base_select (mtl_base_component.c:105)
==17823==    by 0x76D9EDE: mca_pml_cm_component_init (pml_cm_component.c:145)
==17823==    by 0x4CE7425: mca_pml_base_select (pml_base_select.c:122)
==17823==    by 0x4C50ED2: ompi_mpi_init (ompi_mpi_init.c:398)
==17823==    by 0x4C93EE4: PMPI_Init (pinit.c:88)
==17823==  Address 0x7FEFFEC30 is on thread 1's stack
==17823== 
# OSU MPI Latency Test v3.0
# Size    Latency (us)
384  61.78
==17839== 
==17839== ERROR SUMMARY: 9 errors from 9 contexts (suppressed: 5 from 1)
==17839== malloc/free: in use at exit: 0 bytes in 0 blocks.
==17839== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
==17839== For counts of detected errors, rerun with: -v
==17839== All heap blocks were freed -- no leaks are possible.
==17823== 
==17823== ERROR SUMMARY: 9 errors from 9 contexts (suppressed: 5 from 1)
==17823== malloc/free: in use at exit: 0 bytes in 0 blocks.
==17823== malloc/free: 0 allocs, 0 frees, 0 bytes allocated.
==17823== For counts of detected errors, rerun with: -v
==17823== All heap blocks were freed -- no leaks are possible.

Hopefully this was enough info to garner a bit of input, Thanks in advance. Tom