Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-11 Thread Tena Sakai
Hi,

I have made a bit more progress.  I think I can say ssh authenti-
cation problem is behind me now.  I am still having a problem running
mpirun, but the latest discovery, which I can reproduce, is that
I can run mpirun as root.  Here's the session log:

  [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
  Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ ll
  total 8
  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ ll .ssh
  total 16
  -rw--- 1 tsakai tsakai  232 Feb  5 23:19 authorized_keys
  -rw--- 1 tsakai tsakai  102 Feb 11 00:34 config
  -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
  -rw--- 1 tsakai tsakai  887 Feb  8 22:03 tsakai
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
  Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ # I am on machine B
  [tsakai@ip-10-100-243-195 ~]$ hostname
  ip-10-100-243-195
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ ll
  total 8
  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ cat app.ac
  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ # go back to machine A
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ exit
  logout
  Connection to ip-10-100-243-195.ec2.internal closed.
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ hostname
  ip-10-195-198-31
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ # Execute mpirun
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ mpirun -app app.ac
  --
  mpirun was unable to launch the specified application as it encountered an
error:

  Error: pipe function call failed when setting up I/O forwarding subsystem
  Node: ip-10-195-198-31

  while attempting to start process rank 0.
  --
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ # try it as root
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ sudo su
  bash-3.2#
  bash-3.2# pwd
  /home/tsakai
  bash-3.2#
  bash-3.2# ls -l /root/.ssh/config
  -rw--- 1 root root 103 Feb 11 00:56 /root/.ssh/config
  bash-3.2#
  bash-3.2# cat /root/.ssh/config
  Host *
  IdentityFile /root/.ssh/.derobee/.kagi
  IdentitiesOnly yes
  BatchMode yes
  bash-3.2#
  bash-3.2# pwd
  /home/tsakai
  bash-3.2#
  bash-3.2# ls -l
  total 8
  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
  bash-3.2#
  bash-3.2# # now is the time for mpirun
  bash-3.2#
  bash-3.2# mpirun --app ./app.ac
  13 ip-10-100-243-195
  21 ip-10-100-243-195
  5 ip-10-195-198-31
  8 ip-10-195-198-31
  bash-3.2#
  bash-3.2# # It works (being root)!
  bash-3.2#
  bash-3.2# exit
  exit
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ # try it one more time as tsakai
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ mpirun --app app.ac
  --
  mpirun was unable to launch the specified application as it encountered an
error:

  Error: pipe function call failed when setting up I/O forwarding subsystem
  Node: ip-10-195-198-31

  while attempting to start process rank 0.
  --
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ # I don't get it.
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ exit
  logout
  [tsakai@vixen ec2]$

So, why does it say "pipe function call failed when setting up
I/O forwarding subsystem Node: ip-10-195-198-31" ?
The node it is referring to is not the remote machine.  It is
What I call machine A.  I first thought maybe this is a problem
With PATH variable.  But I don't think so.  I compared root's
Path to that of tsaki's and made them identical and retried.
I got the same behavior.

If you could enlighten me why this is happening, I would really
Appreciate it.

Thank you.

Tena


On 2/10/11 4:12 PM, "Tena Sakai"  wrote:

> Hi jeff,
>
> Thanks for the firewall tip.  I tried it while allowing all tip traffic
> and got interesting and preplexing result.  Here's what's interesting
> (BTW, I got rid of "LogLevel DEBUG3" from ./ss

Re: [OMPI users] runtime error

2011-02-11 Thread Marcela Castro León
Hello:

I've the same version ob Ubuntu 10.04. The original version was Ubuntu
Server 9.1 (64) and upgraded both of them to 10.04.
Yesterday I've updated and upgraded to the same level again. But I've got
the same error after that.
The machine are exactly the same, HP Compaq with inter Core I5.

Anyway I've compared the version of openmpi and gcc, and are the same too:
1.4.1-2 and 4.4.4.3 respectly. I'm attaching the exit of the dpkg-l on the
two system.

I would appreciate a lot any help to solve it.
Thank you.

Marcela.
2011/2/10 Jeff Squyres 

> I typically see these kinds of errors when there's an Open MPI version
> mismatch between the nodes, and/or if there are slightly different flavors
> of Linux installed on each node (i.e., you're technically in a heterogeneous
> situation, but you're trying to run a single application binary).  Can you
> verify:
>
> 1. that you have exactly the same version of Open MPI installed on all
> nodes?  (and that your application was compiled against that exact version)
>
> 2. that you have exactly the same OS/update level installed on all nodes
> (e.g., same versions of glibc, etc.)
>
>
> On Feb 10, 2011, at 3:13 AM, Marcela Castro León wrote:
>
> > Hello
> > I've a program that allways works fine, but i'm trying it on a new
> cluster and fails when I execute it on more than one machine.
> > I mean, if I execute alone on each host, everything works fine.
> > radic@santacruz:~/gaps/caso3-i1$ mpirun -np 3 ../test parcorto.txt
> >
> > But when I execute
> > radic@santacruz:~/gaps/caso3-i1$ mpirun -np 3 -machinefile
> /home/radic/mfile ../test parcorto.txt
> >
> > I get this error:
> >
> > mpirun has exited due to process rank 0 with PID 2132 on
> > node santacruz exiting without calling "finalize". This may
> > have caused other processes in the application to be
> > terminated by signals sent by mpirun (as reported here).
> >
> --
> >
> > Though the machinefile (mfile) had only one machine, the programs fails.
> > This is the current content:
> >
> > radic@santacruz:~/gaps/caso3-i1$ cat /home/radic/mfile
> > santacruz
> > chubut
> >
> > I've debug the program and the error occurs after proc0 do an
> > MPI_Recv(&nomproc,lennomproc,MPI_CHAR,i,tag,MPI_COMM_WORLD,&Stat);
> > from the remote process.
> >
> > I've done several test I'll mention:
> >
> > 1) Change the order on machinefile
> > radic@santacruz:~/gaps/caso3-i1$ cat /home/radic/mfile
> > chubut
> > santacruz
> >
> > In that case, I get this error:
> > [chubut:2194] *** An error occurred in MPI_Recv
> > [chubut:2194] *** on communicator MPI_COMM_WORLD
> > [chubut:2194] *** MPI_ERR_TRUNCATE: message truncated
> > [chubut:2194] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> > and then
> >
> --
> > mpirun has exited due to process rank 0 with PID 2194 on
> > node chubut exiting without calling "finalize". This may
> > have caused other processes in the application to be
> > terminated by signals sent by mpirun (as reported here).
> >
> --
> >
> > 2) I've got the same error executing on host chubut intead of santacruz,
> > 3) a simple mpi programs like  MPI_Hello world are working fine, but I
> suppose that are very simple program.
> >
> > radic@santacruz:~/gaps$ mpirun -np 3 -machinefile /home/radic/mfile
> MPI_Hello
> > Hola Mundo Hola Marce 1
> > Hola Mundo Hola Marce 0
> > Hola Mundo Hola Marce 2
> >
> >
> > This is the information you ask for tuntime problem.
> > a) radic@santacruz:~$ mpirun -version
> > mpirun (Open MPI) 1.4.1
> > b) i'm using ubuntu 10,04. I'm installing the packages using apt-get
> install, so, I don't have a config.log
> > c) The ompi_info --all is on the file ompi_info.zip
> > d) These are PATH and LD_LIBRARY_PATH
> > radic@santacruz:~$ echo $PATH
> > /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
> > radic@santacruz:~$ echo $LD_LIBRARY_PATH
> >
> >
> > Thank you very much.
> >
> > Marcela.
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] runtime error

2011-02-11 Thread Marcela Castro León
Excuse me. I forgot the attaching.

2011/2/11 Marcela Castro León 

> Hello:
>
> I've the same version ob Ubuntu 10.04. The original version was Ubuntu
> Server 9.1 (64) and upgraded both of them to 10.04.
> Yesterday I've updated and upgraded to the same level again. But I've got
> the same error after that.
> The machine are exactly the same, HP Compaq with inter Core I5.
>
> Anyway I've compared the version of openmpi and gcc, and are the same too:
> 1.4.1-2 and 4.4.4.3 respectly. I'm attaching the exit of the dpkg-l on the
> two system.
>
> I would appreciate a lot any help to solve it.
> Thank you.
>
> Marcela.
> 2011/2/10 Jeff Squyres 
>
> I typically see these kinds of errors when there's an Open MPI version
>> mismatch between the nodes, and/or if there are slightly different flavors
>> of Linux installed on each node (i.e., you're technically in a heterogeneous
>> situation, but you're trying to run a single application binary).  Can you
>> verify:
>>
>> 1. that you have exactly the same version of Open MPI installed on all
>> nodes?  (and that your application was compiled against that exact version)
>>
>> 2. that you have exactly the same OS/update level installed on all nodes
>> (e.g., same versions of glibc, etc.)
>>
>>
>> On Feb 10, 2011, at 3:13 AM, Marcela Castro León wrote:
>>
>> > Hello
>> > I've a program that allways works fine, but i'm trying it on a new
>> cluster and fails when I execute it on more than one machine.
>> > I mean, if I execute alone on each host, everything works fine.
>> > radic@santacruz:~/gaps/caso3-i1$ mpirun -np 3 ../test parcorto.txt
>> >
>> > But when I execute
>> > radic@santacruz:~/gaps/caso3-i1$ mpirun -np 3 -machinefile
>> /home/radic/mfile ../test parcorto.txt
>> >
>> > I get this error:
>> >
>> > mpirun has exited due to process rank 0 with PID 2132 on
>> > node santacruz exiting without calling "finalize". This may
>> > have caused other processes in the application to be
>> > terminated by signals sent by mpirun (as reported here).
>> >
>> --
>> >
>> > Though the machinefile (mfile) had only one machine, the programs fails.
>> > This is the current content:
>> >
>> > radic@santacruz:~/gaps/caso3-i1$ cat /home/radic/mfile
>> > santacruz
>> > chubut
>> >
>> > I've debug the program and the error occurs after proc0 do an
>> > MPI_Recv(&nomproc,lennomproc,MPI_CHAR,i,tag,MPI_COMM_WORLD,&Stat);
>> > from the remote process.
>> >
>> > I've done several test I'll mention:
>> >
>> > 1) Change the order on machinefile
>> > radic@santacruz:~/gaps/caso3-i1$ cat /home/radic/mfile
>> > chubut
>> > santacruz
>> >
>> > In that case, I get this error:
>> > [chubut:2194] *** An error occurred in MPI_Recv
>> > [chubut:2194] *** on communicator MPI_COMM_WORLD
>> > [chubut:2194] *** MPI_ERR_TRUNCATE: message truncated
>> > [chubut:2194] *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
>> > and then
>> >
>> --
>> > mpirun has exited due to process rank 0 with PID 2194 on
>> > node chubut exiting without calling "finalize". This may
>> > have caused other processes in the application to be
>> > terminated by signals sent by mpirun (as reported here).
>> >
>> --
>> >
>> > 2) I've got the same error executing on host chubut intead of santacruz,
>> > 3) a simple mpi programs like  MPI_Hello world are working fine, but I
>> suppose that are very simple program.
>> >
>> > radic@santacruz:~/gaps$ mpirun -np 3 -machinefile /home/radic/mfile
>> MPI_Hello
>> > Hola Mundo Hola Marce 1
>> > Hola Mundo Hola Marce 0
>> > Hola Mundo Hola Marce 2
>> >
>> >
>> > This is the information you ask for tuntime problem.
>> > a) radic@santacruz:~$ mpirun -version
>> > mpirun (Open MPI) 1.4.1
>> > b) i'm using ubuntu 10,04. I'm installing the packages using apt-get
>> install, so, I don't have a config.log
>> > c) The ompi_info --all is on the file ompi_info.zip
>> > d) These are PATH and LD_LIBRARY_PATH
>> > radic@santacruz:~$ echo $PATH
>> > /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
>> > radic@santacruz:~$ echo $LD_LIBRARY_PATH
>> >
>> >
>> > Thank you very much.
>> >
>> > Marcela.
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>


scgcc
Description: Binary data


scompi
Description: Binary data


chgcc
Description: Binary data


chompi
Description: Binary data


[OMPI users] Collective comminucation API

2011-02-11 Thread Bibrak Qamar
I want to know, if there is any other implementation of collective
communication ( reduce and Bcast) available apart from what openMPI
provides.


Thanks

Bibrak Qamar
Undergraduate Student BIT-9
Member Center for High Performance Scientific Computing
NUST-School of Electrical Engineering and Computer Science.


Re: [OMPI users] Totalview not showing main program on startup with OpenMPI 1.3.x and 1.4.x

2011-02-11 Thread Terry Dontje
Sorry I have to ask this, did you build your lastest OMPI version, not 
just the application, with the -g flag too.


IIRC, when I ran into this issue I was actually able to do stepi's and 
eventually pop up the stack however that is really no way to debug a 
program :-).


Unless OMPI is somehow trashing the stack I don't see what OMPI could be 
doing to cause this type of an issue.  Again when I ran into this issue 
known working programs still worked I just was unable to get a full 
stack.  So it was definitely an interfacing issue between totalview and 
the executable (or the result of how the executable and libraries were 
compiled).   Another thing I noticed was when using Solaris Studio dbx I 
was also able to see the full stack where I could not when using 
totaview.  I am not sure if gdb could also see the full stack or not but 
it might be worth a try to attach gdb to a running program and see if 
you get a full stack.


--td


On 02/09/2011 05:35 PM, Dennis McRitchie wrote:


Thanks Terry.

Unfortunately, -fno-omit-frame-pointer is the default for the Intel 
compiler when --g  is used, which I am using since it is necessary for 
source level debugging. So the compiler kindly tells me that it is 
ignoring your suggested option when I specify it. J


Also, since I can reproduce this problem by simply changing the 
OpenMPI version, without changing the compiler version, it strikes me 
as being more likely to be an OpenMPI-related issue: 1.2.8 works, but 
anything later does not (as described below).


I have tried different versions of TotalView from 8.1 to 8.9, but all 
behave the same.


I was wondering if a change to the openmpi-totalview.tcl script might 
be needed?


Dennis

*From:*users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] 
*On Behalf Of *Terry Dontje

*Sent:* Wednesday, February 09, 2011 5:02 PM
*To:* us...@open-mpi.org
*Subject:* Re: [OMPI users] Totalview not showing main program on 
startup with OpenMPI 1.3.x and 1.4.x


This sounds like something I ran into some time ago that involved the 
compiler omitting frame pointers.  You may want to try to compile your 
code with -fno-omit-frame-pointer.  I am unsure if you may need to do 
the same while building MPI though.


--td

On 02/09/2011 02:49 PM, Dennis McRitchie wrote:

Hi,
  
I'm encountering a strange problem and can't find it having been discussed on this mailing list.
  
When building and running my parallel program using any recent Intel compiler and OpenMPI 1.2.8, TotalView behaves entirely correctly, displaying the "Process mpirun is a parallel job. Do you want to stop the job now?" dialog box, and stopping at the start of the program. The code displayed is the source code of my program's function main, and the stack trace window shows that we are stopped in the poll function many levels "up" from my main function's call to MPI_Init. I can then set breakpoints, single step, etc., and the code runs appropriately.
  
But when building and running using Intel compilers with OpenMPI 1.3.x or 1.4.x, TotalView displays the usual dialog box, and stops at the start of the program; but my main program's source code is *not* displayed. The stack trace window again shows that we are stopped in the poll function several levels "up" from my main function's call to MPI_Init; but this time, the code displayed is the assembler code for the poll function itself.
  
If I click on 'main' in the stack trace window, the source code for my program's function main is then displayed, and I can now set breakpoints, single step, etc. as usual.
  
So why is the program's source code not displayed when using 1.3.x and 1.4.x, but is displayed when using 1.2.8. This change in behavior is fairly confusing to our users, and it would be nice to have it work as it used to, if possible.
  
Thanks,

Dennis
  
Dennis McRitchie

Computational Science and Engineering Support (CSES)
Academic Services Department
Office of Information Technology
Princeton University
  
  
___

users mailing list
us...@open-mpi.org  
http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 





[OMPI users] MPI_Win_create with size=0 expose memory anyway

2011-02-11 Thread Patrick Le Dot
Hi all,
I am testing the one-sided message passing (MPI_Put, MPI_Get)
and it seems to me that the size parameter of MPI_Win_create()
is definitly not taken into account.

Then I can put/get messages using a window created with size=0
(or put/get after any others limits between 0 and the original buffer size).

I know that size=0 is not an usual practice but the man page say :
"A process may elect to expose no memory by specifying size = 0."

I might still have misunderstood something ?

Hereafter a simple test to reproduce the  with Open MPI 1.5

Thx,
Patrick


/*
 * compilation :
 * mpicc -o a.out a.c
 *
 * execution :
 * srun --resv-ports -n2 -N2 a.out
 *
 */

#include "mpi.h"

#define SIZE_10 10
#define RANK_1   1

int main(int argc, char *argv[]) {
  int i, rank, nprocs, A[SIZE_10], B[SIZE_10];
  MPI_Win win;
  int errs = 0;

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  if (nprocs != 2) {
printf("%d is a bad process number : srun -n2 -N2 a.out \n", nprocs);
MPI_Finalize();
return -1;
  }

  for (i=0; i

Re: [OMPI users] libmpi.so.0 not found during gdb debugging

2011-02-11 Thread Prentice Bisbal
swagat mishra wrote:
> hello everyone,
> i have a network of systems connected over lan with each computer
> running ubuntu. openmpi 1.4.x is installed on 1 machine and the
> installation is mounted on other nodes through Networking File
> System(NFS). the source program and compiled file(a.out) are present in
> the mounted directory
> i run my programs by the following command:
> /opt/project/bin/mpirun -np 4 --prefix  /opt/project/ --hostfile
> hostfile a.out
> i have not set LD_LIBRARY_PATH but as i use --prefix mpirun works
> successfully
>  
> however as per the open mpi debugging faq:
> http://www.open-mpi.org/faq/?category=debugging
> when i run
> /opt/project/bin/mpirun -np 4 --prefix  /opt/project/ --hostfile
> hostfile -x DISPLAY=10.0.0.1:0.0 xterm -e gdb a.out
>  
> 4 xterm windows are opened with gdb running as expected. however when i
> give the command start to gdb in the windows corresponding to remote
> nodes, i get the error:
> libmpi.so.0 not found: no such file/directory
>  
> as mentioned other mpi jobs run fine with mpirun
>  
> when i execute
> /opt/project/bin/mpirun -np 4 --prefix  /opt/project/ -x
> DISPLAY=10.0.0.1:0.0 xterm -e gdb a.out ,the debugging continues succesfully
>  
> please help
> 

You need to set LD_LIBRARY_PATH to include the path to the OpenMPI
libraries. The --prefix option works for OpenMPI only; it has no effect
on other programs. You also need to make sure that the LD_LIBRARY_PATH
variable is correctly passed along to the other OpenMPI programs. For
processes on other hosts, this is usually done by editing your shell's
rc file for non-interactive logins (.bash_profile for bash).

-- 
Prentice



Re: [OMPI users] libmpi.so.0 not found during gdb debugging

2011-02-11 Thread swagat mishra
yes setting LD_LIBRARY_PATH solved the problem
thanks for the help
On Fri, Feb 11, 2011 at 7:14 PM, Prentice Bisbal  wrote:

>  swagat mishra wrote:
> > hello everyone,
> > i have a network of systems connected over lan with each computer
> > running ubuntu. openmpi 1.4.x is installed on 1 machine and the
> > installation is mounted on other nodes through Networking File
> > System(NFS). the source program and compiled file(a.out) are present in
> > the mounted directory
> > i run my programs by the following command:
> > /opt/project/bin/mpirun -np 4 --prefix  /opt/project/ --hostfile
> > hostfile a.out
> > i have not set LD_LIBRARY_PATH but as i use --prefix mpirun works
> > successfully
> >
> > however as per the open mpi debugging faq:
> > http://www.open-mpi.org/faq/?category=debugging
> > when i run
> > /opt/project/bin/mpirun -np 4 --prefix  /opt/project/ --hostfile
> > hostfile -x DISPLAY=10.0.0.1:0.0 xterm -e gdb a.out
> >
> > 4 xterm windows are opened with gdb running as expected. however when i
> > give the command start to gdb in the windows corresponding to remote
> > nodes, i get the error:
> > libmpi.so.0 not found: no such file/directory
> >
> > as mentioned other mpi jobs run fine with mpirun
> >
> > when i execute
> > /opt/project/bin/mpirun -np 4 --prefix  /opt/project/ -x
> > DISPLAY=10.0.0.1:0.0 xterm -e gdb a.out ,the debugging continues
> succesfully
> >
> > please help
> >
>
> You need to set LD_LIBRARY_PATH to include the path to the OpenMPI
> libraries. The --prefix option works for OpenMPI only; it has no effect
> on other programs. You also need to make sure that the LD_LIBRARY_PATH
> variable is correctly passed along to the other OpenMPI programs. For
> processes on other hosts, this is usually done by editing your shell's
> rc file for non-interactive logins (.bash_profile for bash).
>
> --
> Prentice
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] MPI_Win_create with size=0 expose memory anyway

2011-02-11 Thread Barrett, Brian W
Patrick -

Your program is erroneous, so the behavior of the MPI is not defined.  The 
default implementation of RMA with Open MPI uses active-message like semantics 
to locally deliver the message, and does not do bounds checking, so the error 
was not caught.

Brian


On Feb 11, 2011, at 5:41 AM,  wrote:

> Hi all,
> I am testing the one-sided message passing (MPI_Put, MPI_Get)
> and it seems to me that the size parameter of MPI_Win_create()
> is definitly not taken into account.
> 
> Then I can put/get messages using a window created with size=0
> (or put/get after any others limits between 0 and the original buffer size).
> 
> I know that size=0 is not an usual practice but the man page say :
> "A process may elect to expose no memory by specifying size = 0."
> 
> I might still have misunderstood something ?
> 
> Hereafter a simple test to reproduce the  with Open MPI 1.5
> 
> Thx,
> Patrick
> 
> 
> /*
> * compilation :
> * mpicc -o a.out a.c
> *
> * execution :
> * srun --resv-ports -n2 -N2 a.out
> *
> */
> 
> #include "mpi.h"
> 
> #define SIZE_10 10
> #define RANK_1   1
> 
> int main(int argc, char *argv[]) {
>  int i, rank, nprocs, A[SIZE_10], B[SIZE_10];
>  MPI_Win win;
>  int errs = 0;
> 
>  MPI_Init(&argc, &argv);
>  MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
>  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> 
>  if (nprocs != 2) {
>printf("%d is a bad process number : srun -n2 -N2 a.out \n", nprocs);
>MPI_Finalize();
>return -1;
>  }
> 
>  for (i=0; iA[i] = i+1;
>B[i] = (-1)*(i+1);
>  }
> 
>  printf("[%d] create a window on A[] with size=0 \n", rank);
>  MPI_Win_create(A, 0, sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win);
> 
>  if (rank == 0) {
>printf("[%d] call MPI_Get(B, %d, ...) \n", rank, SIZE_10);
>MPI_Win_lock(MPI_LOCK_SHARED, RANK_1, 0, win);
>MPI_Get(B, SIZE_10, MPI_INT, RANK_1, 0, SIZE_10, MPI_INT, win);
>MPI_Win_unlock(RANK_1, win);
> 
>for (i=0; i  if (B[i] != i+1) {
>printf("[%d] MPI_Get error: B[%d]=%d, should be %d \n",rank, i, B[i], 
> i+1);
>errs++;
>  }
>}
> 
>if (errs == 0) {
>  printf("[%d] No Error! \n", rank);
>}
> 
>  }
> 
>  MPI_Barrier(MPI_COMM_WORLD);
>  MPI_Win_free(&win);
>  MPI_Finalize();
> }
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 





Re: [OMPI users] Collective comminucation API

2011-02-11 Thread David Zhang
there is alltoall, scatter, gather, and many more. check out
https://computing.llnl.gov/tutorials/mpi/#Collective_Communication_Routines

On Fri, Feb 11, 2011 at 3:26 AM, Bibrak Qamar  wrote:

> I want to know, if there is any other implementation of collective
> communication ( reduce and Bcast) available apart from what openMPI
> provides.
>
>
> Thanks
>
> Bibrak Qamar
> Undergraduate Student BIT-9
> Member Center for High Performance Scientific Computing
> NUST-School of Electrical Engineering and Computer Science.
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
David Zhang
University of California, San Diego


Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-11 Thread Gus Correa

Hi Tena

Since root can but you can't,
is is a directory permission problem perhaps?
Check the execution directory permission (on both machines,
if this is not NFS mounted dir).
I am not sure, but IIRR OpenMPI also uses /tmp for
under-the-hood stuff, worth checking permissions there also.
Just a naive guess.

Congrats for all the progress with the cloudy MPI!

Gus Correa

Tena Sakai wrote:

Hi,

I have made a bit more progress.  I think I can say ssh authenti-
cation problem is behind me now.  I am still having a problem running
mpirun, but the latest discovery, which I can reproduce, is that
I can run mpirun as root.  Here's the session log:

  [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
  Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ ll
  total 8
  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ ll .ssh
  total 16
  -rw--- 1 tsakai tsakai  232 Feb  5 23:19 authorized_keys
  -rw--- 1 tsakai tsakai  102 Feb 11 00:34 config
  -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
  -rw--- 1 tsakai tsakai  887 Feb  8 22:03 tsakai
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
  Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ # I am on machine B
  [tsakai@ip-10-100-243-195 ~]$ hostname
  ip-10-100-243-195
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ ll
  total 8
  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ cat app.ac
  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ # go back to machine A
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ exit
  logout
  Connection to ip-10-100-243-195.ec2.internal closed.
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ hostname
  ip-10-195-198-31
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ # Execute mpirun
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ mpirun -app app.ac
  --
  mpirun was unable to launch the specified application as it encountered an
error:

  Error: pipe function call failed when setting up I/O forwarding subsystem
  Node: ip-10-195-198-31

  while attempting to start process rank 0.
  --
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ # try it as root
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ sudo su
  bash-3.2#
  bash-3.2# pwd
  /home/tsakai
  bash-3.2#
  bash-3.2# ls -l /root/.ssh/config
  -rw--- 1 root root 103 Feb 11 00:56 /root/.ssh/config
  bash-3.2#
  bash-3.2# cat /root/.ssh/config
  Host *
  IdentityFile /root/.ssh/.derobee/.kagi
  IdentitiesOnly yes
  BatchMode yes
  bash-3.2#
  bash-3.2# pwd
  /home/tsakai
  bash-3.2#
  bash-3.2# ls -l
  total 8
  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
  bash-3.2#
  bash-3.2# # now is the time for mpirun
  bash-3.2#
  bash-3.2# mpirun --app ./app.ac
  13 ip-10-100-243-195
  21 ip-10-100-243-195
  5 ip-10-195-198-31
  8 ip-10-195-198-31
  bash-3.2#
  bash-3.2# # It works (being root)!
  bash-3.2#
  bash-3.2# exit
  exit
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ # try it one more time as tsakai
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ mpirun --app app.ac
  --
  mpirun was unable to launch the specified application as it encountered an
error:

  Error: pipe function call failed when setting up I/O forwarding subsystem
  Node: ip-10-195-198-31

  while attempting to start process rank 0.
  --
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ # I don't get it.
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ exit
  logout
  [tsakai@vixen ec2]$

So, why does it say "pipe function call failed when setting up
I/O forwarding subsystem Node: ip-10-195-198-31" ?
The node it is referring to is not the remote machine.  It is
What I call machine A.  I first thought maybe this is a problem
With PATH variable.  But I don't think so.  I compared root's
Path to that of tsaki's and made them id

Re: [OMPI users] Totalview not showing main program on startup with OpenMPI 1.3.x and 1.4.x

2011-02-11 Thread Dennis McRitchie
Hi Terry,

Someone else at the University builds the packages that I use, and we've been 
experimenting for the last few days with different openmpi build options to see 
what might be causing this.

Re the stack, I can always see the entire stack in the TV stack pane, and I can 
always click on 'main' in the stack pane and thereby make my main program's 
source code appear. I can then debug as usual. But, as you said, this is still 
no way to debug a program...

The only thing that might point the finger at OpenMPI is that the same build 
options led to different behavior when running with OpenMPI 1.2.8 vs. anything 
later. But I imagine that it will turn out to be related to the availability 
(or the lack thereof) of OpenMPI symbols to TotalView as to whether it thinks 
it should be displaying assembler or not.

I'll keep you posted with our progress.

Thanks for the tips.

Dennis

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Terry Dontje
Sent: Friday, February 11, 2011 6:38 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] Totalview not showing main program on startup with 
OpenMPI 1.3.x and 1.4.x

Sorry I have to ask this, did you build your lastest OMPI version, not just the 
application, with the -g flag too.

IIRC, when I ran into this issue I was actually able to do stepi's and 
eventually pop up the stack however that is really no way to debug a program 
:-).

Unless OMPI is somehow trashing the stack I don't see what OMPI could be doing 
to cause this type of an issue.  Again when I ran into this issue known working 
programs still worked I just was unable to get a full stack.  So it was 
definitely an interfacing issue between totalview and the executable (or the 
result of how the executable and libraries were compiled).   Another thing I 
noticed was when using Solaris Studio dbx I was also able to see the full stack 
where I could not when using totaview.  I am not sure if gdb could also see the 
full stack or not but it might be worth a try to attach gdb to a running 
program and see if you get a full stack.

--td


On 02/09/2011 05:35 PM, Dennis McRitchie wrote:
Thanks Terry.

Unfortunately, -fno-omit-frame-pointer is the default for the Intel compiler 
when -g  is used, which I am using since it is necessary for source level 
debugging. So the compiler kindly tells me that it is ignoring your suggested 
option when I specify it.  :)

Also, since I can reproduce this problem by simply changing the OpenMPI 
version, without changing the compiler version, it strikes me as being more 
likely to be an OpenMPI-related issue: 1.2.8 works, but anything later does not 
(as described below).

I have tried different versions of TotalView from 8.1 to 8.9, but all behave 
the same.

I was wondering if a change to the openmpi-totalview.tcl script might be needed?

Dennis


From: users-boun...@open-mpi.org 
[mailto:users-boun...@open-mpi.org] On Behalf Of Terry Dontje
Sent: Wednesday, February 09, 2011 5:02 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] Totalview not showing main program on startup with 
OpenMPI 1.3.x and 1.4.x

This sounds like something I ran into some time ago that involved the compiler 
omitting frame pointers.  You may want to try to compile your code with 
-fno-omit-frame-pointer.  I am unsure if you may need to do the same while 
building MPI though.

--td

On 02/09/2011 02:49 PM, Dennis McRitchie wrote:

Hi,



I'm encountering a strange problem and can't find it having been discussed on 
this mailing list.



When building and running my parallel program using any recent Intel compiler 
and OpenMPI 1.2.8, TotalView behaves entirely correctly, displaying the 
"Process mpirun is a parallel job. Do you want to stop the job now?" dialog 
box, and stopping at the start of the program. The code displayed is the source 
code of my program's function main, and the stack trace window shows that we 
are stopped in the poll function many levels "up" from my main function's call 
to MPI_Init. I can then set breakpoints, single step, etc., and the code runs 
appropriately.



But when building and running using Intel compilers with OpenMPI 1.3.x or 
1.4.x, TotalView displays the usual dialog box, and stops at the start of the 
program; but my main program's source code is *not* displayed. The stack trace 
window again shows that we are stopped in the poll function several levels "up" 
from my main function's call to MPI_Init; but this time, the code displayed is 
the assembler code for the poll function itself.



If I click on 'main' in the stack trace window, the source code for my 
program's function main is then displayed, and I can now set breakpoints, 
single step, etc. as usual.



So why is the program's source code not displayed when using 1.3.x and 1.4.x, 
but is displayed when using 1.2.8. This change in behavior is fairly confusing 
to our users, and it would be nice to have

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-11 Thread Jeff Squyres (jsquyres)
It is concerning if the pipe system call fails - I can't think of why that 
would happen. Thats not usually a permissions issue but rather a deeper 
indication that something is either seriously wrong on your system or you are 
running out of file descriptors. Are file descriptors limited on a per-process 
basis, perchance?

Sent from my PDA. No type good. 

On Feb 11, 2011, at 10:08 AM, "Gus Correa"  wrote:

> Hi Tena
> 
> Since root can but you can't,
> is is a directory permission problem perhaps?
> Check the execution directory permission (on both machines,
> if this is not NFS mounted dir).
> I am not sure, but IIRR OpenMPI also uses /tmp for
> under-the-hood stuff, worth checking permissions there also.
> Just a naive guess.
> 
> Congrats for all the progress with the cloudy MPI!
> 
> Gus Correa
> 
> Tena Sakai wrote:
>> Hi,
>> I have made a bit more progress.  I think I can say ssh authenti-
>> cation problem is behind me now.  I am still having a problem running
>> mpirun, but the latest discovery, which I can reproduce, is that
>> I can run mpirun as root.  Here's the session log:
>>  [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
>>  Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ ll
>>  total 8
>>  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ ll .ssh
>>  total 16
>>  -rw--- 1 tsakai tsakai  232 Feb  5 23:19 authorized_keys
>>  -rw--- 1 tsakai tsakai  102 Feb 11 00:34 config
>>  -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
>>  -rw--- 1 tsakai tsakai  887 Feb  8 22:03 tsakai
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
>>  Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
>>  [tsakai@ip-10-100-243-195 ~]$
>>  [tsakai@ip-10-100-243-195 ~]$ # I am on machine B
>>  [tsakai@ip-10-100-243-195 ~]$ hostname
>>  ip-10-100-243-195
>>  [tsakai@ip-10-100-243-195 ~]$
>>  [tsakai@ip-10-100-243-195 ~]$ ll
>>  total 8
>>  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
>>  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
>>  [tsakai@ip-10-100-243-195 ~]$
>>  [tsakai@ip-10-100-243-195 ~]$
>>  [tsakai@ip-10-100-243-195 ~]$ cat app.ac
>>  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
>>  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
>>  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
>>  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
>>  [tsakai@ip-10-100-243-195 ~]$
>>  [tsakai@ip-10-100-243-195 ~]$ # go back to machine A
>>  [tsakai@ip-10-100-243-195 ~]$
>>  [tsakai@ip-10-100-243-195 ~]$ exit
>>  logout
>>  Connection to ip-10-100-243-195.ec2.internal closed.
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ hostname
>>  ip-10-195-198-31
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ # Execute mpirun
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ mpirun -app app.ac
>>  --
>>  mpirun was unable to launch the specified application as it encountered an
>> error:
>>  Error: pipe function call failed when setting up I/O forwarding subsystem
>>  Node: ip-10-195-198-31
>>  while attempting to start process rank 0.
>>  --
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ # try it as root
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ sudo su
>>  bash-3.2#
>>  bash-3.2# pwd
>>  /home/tsakai
>>  bash-3.2#
>>  bash-3.2# ls -l /root/.ssh/config
>>  -rw--- 1 root root 103 Feb 11 00:56 /root/.ssh/config
>>  bash-3.2#
>>  bash-3.2# cat /root/.ssh/config
>>  Host *
>>  IdentityFile /root/.ssh/.derobee/.kagi
>>  IdentitiesOnly yes
>>  BatchMode yes
>>  bash-3.2#
>>  bash-3.2# pwd
>>  /home/tsakai
>>  bash-3.2#
>>  bash-3.2# ls -l
>>  total 8
>>  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>  bash-3.2#
>>  bash-3.2# # now is the time for mpirun
>>  bash-3.2#
>>  bash-3.2# mpirun --app ./app.ac
>>  13 ip-10-100-243-195
>>  21 ip-10-100-243-195
>>  5 ip-10-195-198-31
>>  8 ip-10-195-198-31
>>  bash-3.2#
>>  bash-3.2# # It works (being root)!
>>  bash-3.2#
>>  bash-3.2# exit
>>  exit
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ # try it one more time as tsakai
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ mpirun --app app.ac
>>  --
>>  mpirun was unable to launch the specified application as it encountered an
>> error:
>>  Error: pipe function call failed when setting up I/O forwarding subsystem
>>  Node: ip-10-195

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-11 Thread Tena Sakai
Hi Jeff,
Hi Gus,

Thanks for your replies.

I have pretty much ruled out PATH issues by setting tsakai's PATH
as identical to that of root.  In that setting I reproduced the
same result as before: root can run mpirun correctly and tsakai
cannot.

I have also checked out permission on /tmp directory.  tsakai has
no problem creating files under /tmp.

I am trying to come up with a strategy to show that each and every
programs in the PATH has "world" executable permission.  It is a
stone to turn over, but I am not holding my breath.

> ... you are running out of file descriptors. Are file descriptors
> limited on a per-process basis, perchance?

I have never heard there is such restriction on Amazon EC2.  There
are folks who keep running instances for a long, long time.  Whereas
in my case, I launch 2 instances, check things out, and then turn
the instances off.  (Given that the state of California has a huge
debts, our funding is very tight.)  So, I really doubt that's the
case.  I have run mpirun unsuccessfully as user tsakai and immediately
after successfully as root.  Still, I would be happy if you can tell
me a way to tell number of file descriptors used or remmain.

Your mentioned file descriptors made me think of something under
/dev.  But I don't know exactly what I am fishing.  Do you have
some suggestions?

I wish I could reproduce this (weired) behavior on a different
set of machines.  I certainly cannot in my local environment.  Sigh!

Regards,

Tena


On 2/11/11 3:17 PM, "Jeff Squyres (jsquyres)"  wrote:

> It is concerning if the pipe system call fails - I can't think of why that
> would happen. Thats not usually a permissions issue but rather a deeper
> indication that something is either seriously wrong on your system or you are
> running out of file descriptors. Are file descriptors limited on a per-process
> basis, perchance?
>
> Sent from my PDA. No type good.
>
> On Feb 11, 2011, at 10:08 AM, "Gus Correa"  wrote:
>
>> Hi Tena
>>
>> Since root can but you can't,
>> is is a directory permission problem perhaps?
>> Check the execution directory permission (on both machines,
>> if this is not NFS mounted dir).
>> I am not sure, but IIRR OpenMPI also uses /tmp for
>> under-the-hood stuff, worth checking permissions there also.
>> Just a naive guess.
>>
>> Congrats for all the progress with the cloudy MPI!
>>
>> Gus Correa
>>
>> Tena Sakai wrote:
>>> Hi,
>>> I have made a bit more progress.  I think I can say ssh authenti-
>>> cation problem is behind me now.  I am still having a problem running
>>> mpirun, but the latest discovery, which I can reproduce, is that
>>> I can run mpirun as root.  Here's the session log:
>>>  [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
>>>  Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ ll
>>>  total 8
>>>  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>>  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ ll .ssh
>>>  total 16
>>>  -rw--- 1 tsakai tsakai  232 Feb  5 23:19 authorized_keys
>>>  -rw--- 1 tsakai tsakai  102 Feb 11 00:34 config
>>>  -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
>>>  -rw--- 1 tsakai tsakai  887 Feb  8 22:03 tsakai
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
>>>  Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$ # I am on machine B
>>>  [tsakai@ip-10-100-243-195 ~]$ hostname
>>>  ip-10-100-243-195
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$ ll
>>>  total 8
>>>  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
>>>  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$ cat app.ac
>>>  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
>>>  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
>>>  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
>>>  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$ # go back to machine A
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$ exit
>>>  logout
>>>  Connection to ip-10-100-243-195.ec2.internal closed.
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ hostname
>>>  ip-10-195-198-31
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ # Execute mpirun
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ mpirun -app app.ac
>>>  --
>>>  mpirun was unable to launch the specified application as it encountered an
>>> error:
>>>  Error: pipe function call failed when setting up

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-11 Thread Gus Correa

Hi Tena

Please read one answer inline.

Tena Sakai wrote:

Hi Jeff,
Hi Gus,

Thanks for your replies.

I have pretty much ruled out PATH issues by setting tsakai's PATH
as identical to that of root.  In that setting I reproduced the
same result as before: root can run mpirun correctly and tsakai
cannot.

I have also checked out permission on /tmp directory.  tsakai has
no problem creating files under /tmp.

I am trying to come up with a strategy to show that each and every
programs in the PATH has "world" executable permission.  It is a
stone to turn over, but I am not holding my breath.


... you are running out of file descriptors. Are file descriptors
limited on a per-process basis, perchance?


I have never heard there is such restriction on Amazon EC2.  There
are folks who keep running instances for a long, long time.  Whereas
in my case, I launch 2 instances, check things out, and then turn
the instances off.  (Given that the state of California has a huge
debts, our funding is very tight.)  So, I really doubt that's the
case.  I have run mpirun unsuccessfully as user tsakai and immediately
after successfully as root.  Still, I would be happy if you can tell
me a way to tell number of file descriptors used or remmain.

Your mentioned file descriptors made me think of something under
/dev.  But I don't know exactly what I am fishing.  Do you have
some suggestions?



1) If the environment has anything to do with Linux,
check:

cat /proc/sys/fs/file-nr /proc/sys/fs/file-max


or

sysctl -a |grep fs.file-max

This max can be set (fs.file-max=whatever_is_reasonable)
in /etc/sysctl.conf

See 'man sysctl' and 'man sysctl.conf'

2) Another possible source of limits.

Check "ulimit -a" (bash) or "limit" (tcsh).

If you need to change look at:

/etc/security/limits.conf

(See also 'man limits.conf')

**

Since "root can but Tena cannot",
I would check 2) first,
as they are the 'per user/per group' limits,
whereas 1) is kernel/system-wise.

I hope this helps,
Gus Correa

PS - I know you are a wise and careful programmer,
but here we had cases of programs that would
fail because of too many files that were open and never closed,
eventually exceeding the max available/permissible.
So, it does happen.


I wish I could reproduce this (weired) behavior on a different
set of machines.  I certainly cannot in my local environment.  Sigh!

Regards,

Tena


On 2/11/11 3:17 PM, "Jeff Squyres (jsquyres)"  wrote:


It is concerning if the pipe system call fails - I can't think of why that
would happen. Thats not usually a permissions issue but rather a deeper
indication that something is either seriously wrong on your system or you are
running out of file descriptors. Are file descriptors limited on a per-process
basis, perchance?

Sent from my PDA. No type good.

On Feb 11, 2011, at 10:08 AM, "Gus Correa"  wrote:


Hi Tena

Since root can but you can't,
is is a directory permission problem perhaps?
Check the execution directory permission (on both machines,
if this is not NFS mounted dir).
I am not sure, but IIRR OpenMPI also uses /tmp for
under-the-hood stuff, worth checking permissions there also.
Just a naive guess.

Congrats for all the progress with the cloudy MPI!

Gus Correa

Tena Sakai wrote:

Hi,
I have made a bit more progress.  I think I can say ssh authenti-
cation problem is behind me now.  I am still having a problem running
mpirun, but the latest discovery, which I can reproduce, is that
I can run mpirun as root.  Here's the session log:
 [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
 Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
 [tsakai@ip-10-195-198-31 ~]$
 [tsakai@ip-10-195-198-31 ~]$ ll
 total 8
 -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
 -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
 [tsakai@ip-10-195-198-31 ~]$
 [tsakai@ip-10-195-198-31 ~]$ ll .ssh
 total 16
 -rw--- 1 tsakai tsakai  232 Feb  5 23:19 authorized_keys
 -rw--- 1 tsakai tsakai  102 Feb 11 00:34 config
 -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
 -rw--- 1 tsakai tsakai  887 Feb  8 22:03 tsakai
 [tsakai@ip-10-195-198-31 ~]$
 [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
 Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
 [tsakai@ip-10-100-243-195 ~]$
 [tsakai@ip-10-100-243-195 ~]$ # I am on machine B
 [tsakai@ip-10-100-243-195 ~]$ hostname
 ip-10-100-243-195
 [tsakai@ip-10-100-243-195 ~]$
 [tsakai@ip-10-100-243-195 ~]$ ll
 total 8
 -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
 -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
 [tsakai@ip-10-100-243-195 ~]$
 [tsakai@ip-10-100-243-195 ~]$
 [tsakai@ip-10-100-243-195 ~]$ cat app.ac
 -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
 -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
 -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
 -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
 [tsakai@ip-10-100-243-195 ~]$
 [tsaka

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-11 Thread Tena Sakai
Hi Gus,

Thank you for your tips.

I didn't find any smoking gun or anything comes close.
Here's the upshot:

  [tsakai@ip-10-114-239-188 ~]$ ulimit -a
  core file size  (blocks, -c) 0
  data seg size   (kbytes, -d) unlimited
  scheduling priority (-e) 0
  file size   (blocks, -f) unlimited
  pending signals (-i) 61504
  max locked memory   (kbytes, -l) 32
  max memory size (kbytes, -m) unlimited
  open files  (-n) 1024
  pipe size(512 bytes, -p) 8
  POSIX message queues (bytes, -q) 819200
  real-time priority  (-r) 0
  stack size  (kbytes, -s) 8192
  cpu time   (seconds, -t) unlimited
  max user processes  (-u) 61504
  virtual memory  (kbytes, -v) unlimited
  file locks  (-x) unlimited
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ sudo su
  bash-3.2#
  bash-3.2# ulimit -a
  core file size  (blocks, -c) 0
  data seg size   (kbytes, -d) unlimited
  scheduling priority (-e) 0
  file size   (blocks, -f) unlimited
  pending signals (-i) 61504
  max locked memory   (kbytes, -l) 32
  max memory size (kbytes, -m) unlimited
  open files  (-n) 1024
  pipe size(512 bytes, -p) 8
  POSIX message queues (bytes, -q) 819200
  real-time priority  (-r) 0
  stack size  (kbytes, -s) 8192
  cpu time   (seconds, -t) unlimited
  max user processes  (-u) unlimited
  virtual memory  (kbytes, -v) unlimited
  file locks  (-x) unlimited
  bash-3.2#
  bash-3.2#
  bash-3.2# ulimit -a > root_ulimit-a
  bash-3.2# exit
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ ulimit -a > tsakai_ulimit-a
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ diff root_ulimit-a tsakai_ulimit-a
  14c14
  < max user processes  (-u) unlimited
  ---
  > max user processes  (-u) 61504
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ cat /proc/sys/fs/file-nr
/proc/sys/fs/file-max
  480 0   762674
  762674
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ sudo su
  bash-3.2#
  bash-3.2# cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
  512 0   762674
  762674
  bash-3.2# exit
  exit
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ sysctl -a |grep fs.file-max
  -bash: sysctl: command not found
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ /sbin/!!
  /sbin/sysctl -a |grep fs.file-max
  error: permission denied on key 'kernel.cad_pid'
  error: permission denied on key 'kernel.cap-bound'
  fs.file-max = 762674
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ sudo /sbin/sysctl -a | grep fs.file-max
  fs.file-max = 762674
  [tsakai@ip-10-114-239-188 ~]$

I see a bit of difference between root and tsakai, but I cannot
believe such small difference results in somewhat a catastrophic
failure as I have reported.  Would you agree with me?

Regards,

Tena

On 2/11/11 6:06 PM, "Gus Correa"  wrote:

> Hi Tena
>
> Please read one answer inline.
>
> Tena Sakai wrote:
>> Hi Jeff,
>> Hi Gus,
>>
>> Thanks for your replies.
>>
>> I have pretty much ruled out PATH issues by setting tsakai's PATH
>> as identical to that of root.  In that setting I reproduced the
>> same result as before: root can run mpirun correctly and tsakai
>> cannot.
>>
>> I have also checked out permission on /tmp directory.  tsakai has
>> no problem creating files under /tmp.
>>
>> I am trying to come up with a strategy to show that each and every
>> programs in the PATH has "world" executable permission.  It is a
>> stone to turn over, but I am not holding my breath.
>>
>>> ... you are running out of file descriptors. Are file descriptors
>>> limited on a per-process basis, perchance?
>>
>> I have never heard there is such restriction on Amazon EC2.  There
>> are folks who keep running instances for a long, long time.  Whereas
>> in my case, I launch 2 instances, check things out, and then turn
>> the instances off.  (Given that the state of California has a huge
>> debts, our funding is very tight.)  So, I really doubt that's the
>> case.  I have run mpirun unsuccessfully as user tsakai and immediately
>> after successfully as root.  Still, I would be happy if you can tell
>> me a way to tell number of file descriptors used or remmain.
>>
>> Your mentioned file descriptors made me think of something under
>> /dev.  But I don't know exactly what I am fishing.  Do you have
>> some suggestions?
>>
>
> 1) If the environment has anything to do with Linux,
> check:
>
> cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
>
>
> or
>
> sysctl -a |grep fs.file-max
>
> This max can be set (fs.file-max=whatever_is_reasonable)
> in /etc/sysctl.conf
>
> See 'man sysctl' and 'man s

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-11 Thread Jeff Squyres (jsquyres)
Sounds about right. I'm not near a keyboard to check the reasons why pipe(2) 
would fail. 

Specifically, OMPI is failing when it is trying to setup stdin/stdout/stderr 
forwarding for your job. Very strange. 

Sent from my PDA. No type good. 

On Feb 11, 2011, at 9:56 PM, "Tena Sakai"  wrote:

> Hi Gus,
> 
> Thank you for your tips.
> 
> I didn't find any smoking gun or anything comes close.
> Here's the upshot:
> 
>  [tsakai@ip-10-114-239-188 ~]$ ulimit -a
>  core file size  (blocks, -c) 0
>  data seg size   (kbytes, -d) unlimited
>  scheduling priority (-e) 0
>  file size   (blocks, -f) unlimited
>  pending signals (-i) 61504
>  max locked memory   (kbytes, -l) 32
>  max memory size (kbytes, -m) unlimited
>  open files  (-n) 1024
>  pipe size(512 bytes, -p) 8
>  POSIX message queues (bytes, -q) 819200
>  real-time priority  (-r) 0
>  stack size  (kbytes, -s) 8192
>  cpu time   (seconds, -t) unlimited
>  max user processes  (-u) 61504
>  virtual memory  (kbytes, -v) unlimited
>  file locks  (-x) unlimited
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ sudo su
>  bash-3.2#
>  bash-3.2# ulimit -a
>  core file size  (blocks, -c) 0
>  data seg size   (kbytes, -d) unlimited
>  scheduling priority (-e) 0
>  file size   (blocks, -f) unlimited
>  pending signals (-i) 61504
>  max locked memory   (kbytes, -l) 32
>  max memory size (kbytes, -m) unlimited
>  open files  (-n) 1024
>  pipe size(512 bytes, -p) 8
>  POSIX message queues (bytes, -q) 819200
>  real-time priority  (-r) 0
>  stack size  (kbytes, -s) 8192
>  cpu time   (seconds, -t) unlimited
>  max user processes  (-u) unlimited
>  virtual memory  (kbytes, -v) unlimited
>  file locks  (-x) unlimited
>  bash-3.2#
>  bash-3.2#
>  bash-3.2# ulimit -a > root_ulimit-a
>  bash-3.2# exit
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ ulimit -a > tsakai_ulimit-a
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ diff root_ulimit-a tsakai_ulimit-a
>  14c14
>  < max user processes  (-u) unlimited
>  ---
>> max user processes  (-u) 61504
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ cat /proc/sys/fs/file-nr
> /proc/sys/fs/file-max
>  480 0   762674
>  762674
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ sudo su
>  bash-3.2#
>  bash-3.2# cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
>  512 0   762674
>  762674
>  bash-3.2# exit
>  exit
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ sysctl -a |grep fs.file-max
>  -bash: sysctl: command not found
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ /sbin/!!
>  /sbin/sysctl -a |grep fs.file-max
>  error: permission denied on key 'kernel.cad_pid'
>  error: permission denied on key 'kernel.cap-bound'
>  fs.file-max = 762674
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ sudo /sbin/sysctl -a | grep fs.file-max
>  fs.file-max = 762674
>  [tsakai@ip-10-114-239-188 ~]$
> 
> I see a bit of difference between root and tsakai, but I cannot
> believe such small difference results in somewhat a catastrophic
> failure as I have reported.  Would you agree with me?
> 
> Regards,
> 
> Tena
> 
> On 2/11/11 6:06 PM, "Gus Correa"  wrote:
> 
>> Hi Tena
>> 
>> Please read one answer inline.
>> 
>> Tena Sakai wrote:
>>> Hi Jeff,
>>> Hi Gus,
>>> 
>>> Thanks for your replies.
>>> 
>>> I have pretty much ruled out PATH issues by setting tsakai's PATH
>>> as identical to that of root.  In that setting I reproduced the
>>> same result as before: root can run mpirun correctly and tsakai
>>> cannot.
>>> 
>>> I have also checked out permission on /tmp directory.  tsakai has
>>> no problem creating files under /tmp.
>>> 
>>> I am trying to come up with a strategy to show that each and every
>>> programs in the PATH has "world" executable permission.  It is a
>>> stone to turn over, but I am not holding my breath.
>>> 
 ... you are running out of file descriptors. Are file descriptors
 limited on a per-process basis, perchance?
>>> 
>>> I have never heard there is such restriction on Amazon EC2.  There
>>> are folks who keep running instances for a long, long time.  Whereas
>>> in my case, I launch 2 instances, check things out, and then turn
>>> the instances off.  (Given that the state of California has a huge
>>> debts, our funding is very tight.)  So, I really doubt that's the
>>> case.  I have run mpirun unsuccessfully as user tsakai and immediately
>>> after successfully as root.  Still, I would be happy if you can tell
>>> me a way to tell number of file descriptors use

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-11 Thread Gus Correa

Hi Tena

We setup the cluster nodes to run MPI programs
with stacksize unlimited,
memlock unlimited,
4096 max open files,
to avoid crashing on edge cases.
This is kind of typical for HPC, MPI, number crunching.

However, some are quite big codes,
and from what you said yours is not (or not yet).

Your stack limit sounds quite small, but when
we had problems with stack the result was a segmentation fault.
1024 files I guess is a default for 32 bit Linux distributions,
but some programs break there.

If you want to do this, put these lines on the bottom
of /etc/security/limits.conf:

# End of file
*   -   memlock -1
*   -   stack   -1
*   -   nofile  4096

I don't think you should give unlimited number of processes to
regular users; keep this privilege to root (which is where
the two have different limits).

You may want to monitor /proc/sys/fs/file-nr while the program runs.
The first number is the actual number of open files.
Top or vmstat also help see how you are doing in terms of memory,
although you suggested these are (small?) test programs, unlikely to run
out of memory.

If you are using two nodes, check the same stuff on the other node too.
Also, the IP number you checked now is not the same as in your
message with the MPI failure/errors.
Not sure if I understand which computers we're talking about,
or where these computers are (at Amazon?),
or if they change depending on each session you use to run your programs,
if they are identical machines with the same limits or if they differ.

One of the error messages mentions LD_LIBRARY_PATH.
Is it set to point to the OpenMPI lib directory?
Remember, OpenMPI requires both PATH and LD_LIBRARY_PATH properly set.

I hope this helps, although I am afraid I may be missing the point.

Gus Correa

Tena Sakai wrote:

Hi Gus,

Thank you for your tips.

I didn't find any smoking gun or anything comes close.
Here's the upshot:

  [tsakai@ip-10-114-239-188 ~]$ ulimit -a
  core file size  (blocks, -c) 0
  data seg size   (kbytes, -d) unlimited
  scheduling priority (-e) 0
  file size   (blocks, -f) unlimited
  pending signals (-i) 61504
  max locked memory   (kbytes, -l) 32
  max memory size (kbytes, -m) unlimited
  open files  (-n) 1024
  pipe size(512 bytes, -p) 8
  POSIX message queues (bytes, -q) 819200
  real-time priority  (-r) 0
  stack size  (kbytes, -s) 8192
  cpu time   (seconds, -t) unlimited
  max user processes  (-u) 61504
  virtual memory  (kbytes, -v) unlimited
  file locks  (-x) unlimited
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ sudo su
  bash-3.2#
  bash-3.2# ulimit -a
  core file size  (blocks, -c) 0
  data seg size   (kbytes, -d) unlimited
  scheduling priority (-e) 0
  file size   (blocks, -f) unlimited
  pending signals (-i) 61504
  max locked memory   (kbytes, -l) 32
  max memory size (kbytes, -m) unlimited
  open files  (-n) 1024
  pipe size(512 bytes, -p) 8
  POSIX message queues (bytes, -q) 819200
  real-time priority  (-r) 0
  stack size  (kbytes, -s) 8192
  cpu time   (seconds, -t) unlimited
  max user processes  (-u) unlimited
  virtual memory  (kbytes, -v) unlimited
  file locks  (-x) unlimited
  bash-3.2#
  bash-3.2#
  bash-3.2# ulimit -a > root_ulimit-a
  bash-3.2# exit
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ ulimit -a > tsakai_ulimit-a
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ diff root_ulimit-a tsakai_ulimit-a
  14c14
  < max user processes  (-u) unlimited
  ---
  > max user processes  (-u) 61504
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ cat /proc/sys/fs/file-nr
/proc/sys/fs/file-max
  480 0   762674
  762674
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ sudo su
  bash-3.2#
  bash-3.2# cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
  512 0   762674
  762674
  bash-3.2# exit
  exit
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ sysctl -a |grep fs.file-max
  -bash: sysctl: command not found
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ /sbin/!!
  /sbin/sysctl -a |grep fs.file-max
  error: permission denied on key 'kernel.cad_pid'
  error: permission denied on key 'kernel.cap-bound'
  fs.file-max = 762674
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ sudo /sbin/sysctl -a | grep fs.file-max
  fs.file-max = 762674
  [tsakai@ip-10-114-239-188 ~]$

I see a bit of difference between root and tsakai, but I cannot
believe such small difference results in somewhat a catastrophic
failure as I have reported.  Would you agree with me?

Regards,

Tena

On