[OMPI users] can't run mpi-jobs on remote host

2014-04-11 Thread Lubrano Francesco
Dear MPI users,

I have a problem with open-mpi (version 1.8).

I'm just beginning to undestand how mpi works and I can't find solution of my 
problem on FAQ page.

I have two machines (a local host and a remote host) with linux open-suse 
(latest version) and open-mpi 1.8

I can run open-mpi jobs on both machines from theirself (in a "local" way).

I have not connections problem: ssh from the first to the second works 
correctly and I can run programs on remote machine.

If I run a simple non mpi program on the remot host from the local one, it 
doesnt' work. I think it is a buffer pointer problem (status 1). I didn't 
change open-mpi settings and PATH is correct. I have just one open-mpi versione 
on both of them. Open-mpi doesn't show any error: just return to the local 
machine.

I can also request to run a false program: it does not change. on the terminal 
I can read


francesco@linux-hldu:~> mpirun -host Frank@158.110.39.110 hostname

Password:

francesco@linux-hldu:~>


If I request debug, the result is:


francesco@linux-hldu:~> mpirun -d --host Frank@158.110.39.110 hostname
[linux-hldu.site:02537] sess_dir_finalize: job session dir not empty - leaving
[linux-hldu.site:02537] procdir: 
/tmp/openmpi-sessions-francesco@linux-hldu_0/33429/0/0
[linux-hldu.site:02537] jobdir: 
/tmp/openmpi-sessions-francesco@linux-hldu_0/33429/0
[linux-hldu.site:02537] top: openmpi-sessions-francesco@linux-hldu_0
[linux-hldu.site:02537] tmp: /tmp
Password:
[linux-o5sl.site:04273] sess_dir_finalize: job session dir not empty - leaving
[linux-o5sl.site:04273] procdir: 
/tmp/openmpi-sessions-Frank@linux-o5sl_0/33429/0/1
[linux-o5sl.site:04273] jobdir: /tmp/openmpi-sessions-Frank@linux-o5sl_0/33429/0
[linux-o5sl.site:04273] top: openmpi-sessions-Frank@linux-o5sl_0
[linux-o5sl.site:04273] tmp: /tmp
[linux-o5sl.site:04273] sess_dir_finalize: job session dir not empty - leaving
exiting with status 1
[linux-hldu.site:02537] sess_dir_finalize: job session dir not empty - leaving
exiting with status 1


Do you understand where is the problem? How can I get more information?

Thank you for your cooperation


regards


Francesco



Re: [OMPI users] can't run mpi-jobs on remote host

2014-04-13 Thread Lubrano Francesco
Sorry for my late reply

I tried previously with passfrase and, in this particular case, this is not the 
problem: the error occurs also without asking password.

thus I think there is something else.

Francesco


Re: [OMPI users] can't run mpi-jobs on remote host

2014-04-14 Thread Lubrano Francesco
I can't set --enable-debug (command not found: I have just --enable-recovery in 
help command), but the other commands works properly. The output is:

francesco@linux-hldu:~> mpirun -mca plm_base_verbose 10 --debug-daemons --host 
Frank@158.110.39.110 hostname
[linux-hldu.site:02234] mca: base: components_register: registering plm 
components
[linux-hldu.site:02234] mca: base: components_register: found loaded component 
isolated
[linux-hldu.site:02234] mca: base: components_register: component isolated has 
no register or open function
[linux-hldu.site:02234] mca: base: components_register: found loaded component 
rsh
[linux-hldu.site:02234] mca: base: components_register: component rsh register 
function successful
[linux-hldu.site:02234] mca: base: components_register: found loaded component 
slurm
[linux-hldu.site:02234] mca: base: components_register: component slurm 
register function successful
[linux-hldu.site:02234] mca: base: components_open: opening plm components
[linux-hldu.site:02234] mca: base: components_open: found loaded component 
isolated
[linux-hldu.site:02234] mca: base: components_open: component isolated open 
function successful
[linux-hldu.site:02234] mca: base: components_open: found loaded component rsh
[linux-hldu.site:02234] mca: base: components_open: component rsh open function 
successful
[linux-hldu.site:02234] mca: base: components_open: found loaded component slurm
[linux-hldu.site:02234] mca: base: components_open: component slurm open 
function successful
[linux-hldu.site:02234] mca:base:select: Auto-selecting plm components
[linux-hldu.site:02234] mca:base:select:(  plm) Querying component [isolated]
[linux-hldu.site:02234] mca:base:select:(  plm) Query of component [isolated] 
set priority to 0
[linux-hldu.site:02234] mca:base:select:(  plm) Querying component [rsh]
[linux-hldu.site:02234] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[linux-hldu.site:02234] mca:base:select:(  plm) Querying component [slurm]
[linux-hldu.site:02234] mca:base:select:(  plm) Skipping component [slurm]. 
Query failed to return a module
[linux-hldu.site:02234] mca:base:select:(  plm) Selected component [rsh]
[linux-hldu.site:02234] mca: base: close: component isolated closed
[linux-hldu.site:02234] mca: base: close: unloading component isolated
[linux-hldu.site:02234] mca: base: close: component slurm closed
[linux-hldu.site:02234] mca: base: close: unloading component slurm
Daemon was launched on linux-o5sl.site - beginning to initialize
[linux-o5sl.site:02271] mca: base: components_register: registering plm 
components
[linux-o5sl.site:02271] mca: base: components_register: found loaded component 
rsh
[linux-o5sl.site:02271] mca: base: components_register: component rsh register 
function successful
[linux-o5sl.site:02271] mca: base: components_open: opening plm components
[linux-o5sl.site:02271] mca: base: components_open: found loaded component rsh
[linux-o5sl.site:02271] mca: base: components_open: component rsh open function 
successful
[linux-o5sl.site:02271] mca:base:select: Auto-selecting plm components
[linux-o5sl.site:02271] mca:base:select:(  plm) Querying component [rsh]
[linux-o5sl.site:02271] mca:base:select:(  plm) Query of component [rsh] set 
priority to 10
[linux-o5sl.site:02271] mca:base:select:(  plm) Selected component [rsh]
Daemon [[33734,0],1] checking in as pid 2271 on host linux-o5sl
[linux-o5sl.site:02271] [[33734,0],1] orted: up and running - waiting for 
commands!
[linux-o5sl.site:02271] mca: base: close: component rsh closed
[linux-o5sl.site:02271] mca: base: close: unloading component rsh
[linux-hldu.site:02234] [[33734,0],0] orted_cmd: received exit cmd
[linux-hldu.site:02234] [[33734,0],0] orted_cmd: all routes and children gone - 
exiting
[linux-hldu.site:02234] mca: base: close: component rsh closed
[linux-hldu.site:02234] mca: base: close: unloading component rsh

Is orted in linux-05sl reciving any command?
Thank you for your cooperation

(I don't know if it matter, but I have the same problem using the first pc as 
remote and the second as local).

regards

Francesco