[OMPI users] torque integration when tm ras/plm isn't compiled in.

2009-10-22 Thread Roy Dragseth
Hi all.

I'm trying to create a tight integration between torque and openmpi for cases 
where  the tm ras and plm isn't compiled into openmpi.  This scenario is 
common for linux distros that ship openmpi.  Of course the ideal solution is 
to recompile openmpi with torque support, but this isn't always feasible since 
I do not want to support my own version of openmpi on the stuff I'm 
distributing to others.

We also see some proprietary applications shipping their own embedded openmpi 
libraries where either tm plm/ras is missing or non-functional with the torque 
installation on our system.

So, I've come so far as to create a pbsdshwrapper.py that mimics ssh behaviour 
very closely so that starting the orteds on all the hosts works as expected 
and the application starts correctly when I use

setenv OMPI_MCA_plm_rsh_agent "pbsdshwrapper.py"
mpirun --hostfile $PBS_NODEFILE 

What I want now is a way to get rid of the --hostfile $PBS_NODEFILE in the 
mpirun command.  Is there an environment variable that I can set so that 
mpirun grabs the right nodelist?

By spelunking the code I find that the rsh plm has support for SGE where it 
automatically picks up the PE_NODEFILE if it detects that it is launched 
within an SGE job.  Would it be possible to have the same functionality for 
torque?  The code looks a bit too complex at first sight for me to fix this 
myself.

Best regards,
Roy.

-- 
  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
  phone:+47 77 64 41 07, fax:+47 77 64 41 00
 Roy Dragseth, Team Leader, High Performance Computing
 Direct call: +47 77 64 62 56. email: roy.drags...@uit.no



Re: [OMPI users] torque integration when tm ras/plm isn't compiled in.

2009-10-22 Thread Roy Dragseth
On Friday 23 October 2009 00:50:00 Ralph Castain wrote:
> Why not just
> 
> setenv OMPI_MCA_orte_default_hostfile $PBS_NODEFILE
> 
> assuming you are using 1.3.x, of course.
> 
> If not, then you can use the equivalent for 1.2 - ompi_info would tell
> you the name of it.

THANKS!

Just what I was looking for.  Been looking up and down for it, but couldn't 
find the right swear words.

Is it also possible to disable the backgrounding of the orted daemons?  When 
they fork into background one looses the feedback about cpu usage in the job.  
Not really a big issue though...

Regards,
r.


-- 
  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
  phone:+47 77 64 41 07, fax:+47 77 64 41 00
 Roy Dragseth, Team Leader, High Performance Computing
 Direct call: +47 77 64 62 56. email: roy.drags...@uit.no



[OMPI users] Memory corruption?

2009-12-22 Thread Roy Dragseth
Hi.

We have started to scale up one of our codes and sometimes we get messages 
like this:

[c9-13.local:31125] Memory 0x2aaab7b64000:217088 cannot be freed from 
the registration cache. Possible memory corruption.

It seems like the application runs normally and it does not crash becaus of 
this.  Should we be worried?  We have tested the code with up to 1700 cores 
and the message becomes more frequent as we scale up.

System details:

Rocks 5.2 (aka CentOS 5.3) x86_64
INTEL Compiler 11.1
OFED 1.4.1
OpenMPI 1.3.3

Best regards and Merry Christmas to all,
r.

-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
  phone:+47 77 64 41 07, fax:+47 77 64 41 00
Roy Dragseth, Team Leader, High Performance Computing
 Direct call: +47 77 64 62 56. email: roy.drags...@uit.no



Re: [OMPI users] Memory corruption?

2010-01-08 Thread Roy Dragseth
> Hi.
> 
> We have started to scale up one of our codes and sometimes we get messages
> like this:
> 
> [c9-13.local:31125] Memory 0x2aaab7b64000:217088 cannot be freed from
> the registration cache. Possible memory corruption.
> 
> It seems like the application runs normally and it does not crash becaus of
> this.  Should we be worried?  We have tested the code with up to 1700 cores
> and the message becomes more frequent as we scale up.

Nevermind, this turned out to be an application bug.  A buffer was freed before 
Isend completed.

r.


-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
  phone:+47 77 64 41 07, fax:+47 77 64 41 00
Roy Dragseth, Team Leader, High Performance Computing
 Direct call: +47 77 64 62 56. email: roy.drags...@uit.no



[OMPI users] Openmpi 1.3 problems with libtool-ltdl on CentOS 4 and 5

2009-01-23 Thread Roy Dragseth
Hi, all.

I do not know if this is to be considered a real bug or not, I'm just 
reporting it here so people can find it if they google around for the error 
message this produces.  There is a backtrace at the end of this mail. 

Problem description:

Openmpi 1.3 seems to be nonfunctional when used with libltdl in libtool v1.5  
that is installed on CentOS (aka RH EL) 4 and 5.   Upgrading to libtool 
v2.2.6a (and maybe earlier versions) solves the problem.  We saw this problem 
with both gcc and icc.

Here is a code snippet that is extracted from the real application.

nestcrash.c:
#include 
#include 

int main(int argc,char *argv[])
{
  MPI_Init(&argc,&argv);

  char *dummy="dummy";
  const lt_dlhandle hModule = lt_dlopenext(dummy);

}

This will crash in MPI_Init when using libtool 1.5.X, if you comment out 
lt_dlopenext it will run normally.

I can provide a complete example if neccessary.

As I said earlier, upgrading to libtool 2.2.6a solved the problem for us.

Here is the backtrace:

*** Process received signal ***
Signal: Segmentation fault (11)
Signal code:  (128)
Failing at address: (nil)
[ 0] /lib64/tls/libpthread.so.0 [0x3ffce0c4f0]
[ 1] /global/apps/openmpi/1.3rc2/lib/libopen-pal.so.0 [0x2a95d4bce5]
[ 2] /global/apps/openmpi/1.3rc2/lib/libopen-pal.so.0(lt_dlopenadvise
+0xf0) [0x2a95d4b470]
[ 3] /global/apps/openmpi/1.3rc2/lib/libopen-pal.so.0 [0x2a95d56e1f]
[ 4] /global/apps/openmpi/1.3rc2/lib/libopen-
pal.so.0(mca_base_component_find+0x58d) [0x2a95d5657d]
[ 5] /global/apps/openmpi/1.3rc2/lib/libopen-
pal.so.0(mca_base_components_open+0x1ae) [0x2a95d581be]
[ 6] /global/apps/openmpi/1.3rc2/lib/libopen-
pal.so.0(opal_paffinity_base_open+0xad) [0x2a95d73ddd]
[ 7] /global/apps/openmpi/1.3rc2/lib/libopen-pal.so.0(opal_init+0x64)
[0x2a95d43e64]
[ 8] /global/apps/openmpi/1.3rc2/lib/libopen-rte.so.0(orte_init+0x1e)
[0x2a95bdeb8e]
[ 9] /global/apps/openmpi/1.3rc2/lib/libmpi.so.0 [0x2a95a38fee]
[10] /global/apps/openmpi/1.3rc2/lib/libmpi.so.0(PMPI_Init_thread+0x72)
[0x2a95a5b9c2]
[11] nest-ompi_1.3rc2/bin/nest(_ZN4nest12Communicator4initEPiPPPc+0x11f)
[0x55440f]
[12] nest-ompi_1.3rc2/bin/nest(main+0x74) [0x4a7674]
[13] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x339271c3fb]
[14] nest-ompi_1.3rc2/bin/nest(_ZNSt8ios_base4InitD1Ev+0x5a) [0x4a756a]
*** End of error message ***



-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
  phone:+47 77 64 41 07, fax:+47 77 64 41 00
Roy Dragseth, Team Leader, High Performance Computing
 Direct call: +47 77 64 62 56. email: roy.drags...@uit.no




Re: [OMPI users] Openmpi 1.3 problems with libtool-ltdl on CentOS 4 and 5

2009-01-30 Thread Roy Dragseth
On Friday 23 January 2009 15:31:59 Jeff Squyres wrote:
> Ew.  Yes, I can see this being a problem.
>
> I'm guessing that the real issue is that OMPI embeds the libltdl from
> LT 2.2.6a inside libopen_pal (one of the internal OMPI libraries).
> Waving my hands a bit, but it's not hard to imagine some sort of clash
> is going on between the -lltdl you added to the command line and the
> libltdl that is embedded in OMPI's libraries.
>
> Can you verify that this is what is happening?

Hi, sorry for the delay.

I'm not very familiar with the workings of ltdl, I got this from one of our 
users.  Would you suggest that if one use openmpi 1.3 and ltdl you should not 
explicitly link with -lltdl?  At least this seems to work correctly with the 
example I posted. That is, I can link the program without specifying -lltdl so 
the symbol seems to resolve to something in the openmpi libraries and the 
example runs without crashing.

Regards,
r.

-- 
  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
  phone:+47 77 64 41 07, fax:+47 77 64 41 00
 Roy Dragseth, Team Leader, High Performance Computing
 Direct call: +47 77 64 62 56. email: roy.drags...@uit.no