[OMPI users] Segmentation fault with SLURM and non-local nodes
Hi, I'm not sure whether this problem is with SLURM or OpenMPI, but the stack traces (below) point to an issue within OpenMPI. Whenever I try to launch an MPI job within SLURM, mpirun immediately segmentation faults -- but only if the machine that SLURM allocated to MPI is different to the one that I launched the MPI job. However, if I force SLURM to allocate only the local node (ie, the one on which salloc was called), everything works fine. Failing case: michael@ipc ~ $ salloc -n8 mpirun --display-map ./mpi JOB MAP Data for node: Name: ipc4 Num procs: 8 Process OMPI jobid: [21326,1] Process rank: 0 Process OMPI jobid: [21326,1] Process rank: 1 Process OMPI jobid: [21326,1] Process rank: 2 Process OMPI jobid: [21326,1] Process rank: 3 Process OMPI jobid: [21326,1] Process rank: 4 Process OMPI jobid: [21326,1] Process rank: 5 Process OMPI jobid: [21326,1] Process rank: 6 Process OMPI jobid: [21326,1] Process rank: 7 = [ipc:16986] *** Process received signal *** [ipc:16986] Signal: Segmentation fault (11) [ipc:16986] Signal code: Address not mapped (1) [ipc:16986] Failing at address: 0x801328268 [ipc:16986] [ 0] /lib/libpthread.so.0(+0xf8f0) [0x7ff85c7638f0] [ipc:16986] [ 1] /usr/lib/libopen-rte.so.0(+0x3459a) [0x7ff85d4a059a] [ipc:16986] [ 2] /usr/lib/libopen-pal.so.0(+0x1eeb8) [0x7ff85d233eb8] [ipc:16986] [ 3] /usr/lib/libopen-pal.so.0(opal_progress+0x99) [0x7ff85d228439] [ipc:16986] [ 4] /usr/lib/libopen-rte.so.0(orte_plm_base_daemon_callback+0x9d) [0x7ff85d4a002d] [ipc:16986] [ 5] /usr/lib/openmpi/lib/openmpi/mca_plm_slurm.so(+0x211a) [0x7ff85bbc311a] [ipc:16986] [ 6] mpirun() [0x403c1f] [ipc:16986] [ 7] mpirun() [0x403014] [ipc:16986] [ 8] /lib/libc.so.6(__libc_start_main+0xfd) [0x7ff85c3efc4d] [ipc:16986] [ 9] mpirun() [0x402f39] [ipc:16986] *** End of error message *** Non-failing case: michael@eng-ipc4 ~ $ salloc -n8 -w ipc4 mpirun --display-map ./mpi JOB MAP Data for node: Name: eng-ipc4.FQDN Num procs: 8 Process OMPI jobid: [12467,1] Process rank: 0 Process OMPI jobid: [12467,1] Process rank: 1 Process OMPI jobid: [12467,1] Process rank: 2 Process OMPI jobid: [12467,1] Process rank: 3 Process OMPI jobid: [12467,1] Process rank: 4 Process OMPI jobid: [12467,1] Process rank: 5 Process OMPI jobid: [12467,1] Process rank: 6 Process OMPI jobid: [12467,1] Process rank: 7 = Process 1 on eng-ipc4.FQDN out of 8 Process 3 on eng-ipc4.FQDN out of 8 Process 4 on eng-ipc4.FQDN out of 8 Process 6 on eng-ipc4.FQDN out of 8 Process 7 on eng-ipc4.FQDN out of 8 Process 0 on eng-ipc4.FQDN out of 8 Process 2 on eng-ipc4.FQDN out of 8 Process 5 on eng-ipc4.FQDN out of 8 Using mpi directly is fine: eg mpirun -H 'ipc3,ipc4' -np 8 ./mpi Works as expected This is a (small) homogenous cluster, all Xeon class machines with plenty of RAM and shared filesystem over NFS, running 64-bit Ubuntu server. I was running stock OpenMPI (1.4.1) and SLURM (2.1.1), I have since upgraded to latest stable OpenMPI (1.4.3) and SLURM (2.2.0), with no effect. (the newer binaries were compiled from the respective upstream Debian packages). strace (not shown) shows that the job is launched via srun and a connection is received back from the child process over TCP/IP. Soon after this, mpirun crashes. Nodes communicate over a semi-dedicated TCP/IP GigE connection. Is this a known bug? What is going wrong? Regards, Michael Curtis
[OMPI users] Argument parsing issue
Dear OpenMPI users and developers, i'm using OpenMPI 1.4.3 and Intel compiler. My simple application require 3 line arguments to work. If i use the follow command: mpirun -np 2 ./a.out a b "c d" It works well. Debugging my application with Totalview: mpirun -np 2 --debug ./a.out a b "c d" Argument parsing doesn't work well. Arguments passed are: a b c d and not a b "c d" I think there is an issue in parsing the arguments invoking Totalview. Is this a bug into mpirun or i need to do it in other way? Thanks in forward. -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it
Re: [OMPI users] Argument parsing issue
Hi, Am 27.01.2011 um 09:48 schrieb Gabriele Fatigati: > Dear OpenMPI users and developers, > > i'm using OpenMPI 1.4.3 and Intel compiler. My simple application require 3 > line arguments to work. If i use the follow command: > > mpirun -np 2 ./a.out a b "c d" > > It works well. > > Debugging my application with Totalview: > > mpirun -np 2 --debug ./a.out a b "c d" > > Argument parsing doesn't work well. Arguments passed are: > > a b c d this double expansion can happen with certain wrappers (also with queuing system this happens sometimes). What you can try is: $ mpirun -np 2 --debug ./a.out a b "'c d'" $ mpirun -np 2 --debug ./a.out a b "\"c d\"" -- Reuti > and not > > a b "c d" > > I think there is an issue in parsing the arguments invoking Totalview. Is > this a bug into mpirun or i need to do it in other way? > > Thanks in forward. > > > > > > -- > Ing. Gabriele Fatigati > > Parallel programmer > > CINECA Systems & Tecnologies Department > > Supercomputing Group > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > www.cineca.itTel: +39 051 6171722 > > g.fatigati [AT] cineca.it > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Argument parsing issue
Mm, doing as you suggest the output is: a b "c d" and not: a b "c d" 2011/1/27 Reuti > Hi, > > Am 27.01.2011 um 09:48 schrieb Gabriele Fatigati: > > > Dear OpenMPI users and developers, > > > > i'm using OpenMPI 1.4.3 and Intel compiler. My simple application require > 3 line arguments to work. If i use the follow command: > > > > mpirun -np 2 ./a.out a b "c d" > > > > It works well. > > > > Debugging my application with Totalview: > > > > mpirun -np 2 --debug ./a.out a b "c d" > > > > Argument parsing doesn't work well. Arguments passed are: > > > > a b c d > > this double expansion can happen with certain wrappers (also with queuing > system this happens sometimes). What you can try is: > > $ mpirun -np 2 --debug ./a.out a b "'c d'" > > $ mpirun -np 2 --debug ./a.out a b "\"c d\"" > > -- Reuti > > > > and not > > > > a b "c d" > > > > I think there is an issue in parsing the arguments invoking Totalview. Is > this a bug into mpirun or i need to do it in other way? > > > > Thanks in forward. > > > > > > > > > > > > -- > > Ing. Gabriele Fatigati > > > > Parallel programmer > > > > CINECA Systems & Tecnologies Department > > > > Supercomputing Group > > > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > > > www.cineca.itTel: +39 051 6171722 > > > > g.fatigati [AT] cineca.it > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it
Re: [OMPI users] Argument parsing issue
Am 27.01.2011 um 10:32 schrieb Gabriele Fatigati: > Mm, > > doing as you suggest the output is: > > a > b > "c > d" Whoa - your applications without the debugger is running fine - so I don't think that it's a problem with `mpirun` per se. The same happens with single quotes inside double quotes? -- Reuti > and not: > > a > b > "c d" > > 2011/1/27 Reuti > Hi, > > Am 27.01.2011 um 09:48 schrieb Gabriele Fatigati: > > > Dear OpenMPI users and developers, > > > > i'm using OpenMPI 1.4.3 and Intel compiler. My simple application require 3 > > line arguments to work. If i use the follow command: > > > > mpirun -np 2 ./a.out a b "c d" > > > > It works well. > > > > Debugging my application with Totalview: > > > > mpirun -np 2 --debug ./a.out a b "c d" > > > > Argument parsing doesn't work well. Arguments passed are: > > > > a b c d > > this double expansion can happen with certain wrappers (also with queuing > system this happens sometimes). What you can try is: > > $ mpirun -np 2 --debug ./a.out a b "'c d'" > > $ mpirun -np 2 --debug ./a.out a b "\"c d\"" > > -- Reuti > > > > and not > > > > a b "c d" > > > > I think there is an issue in parsing the arguments invoking Totalview. Is > > this a bug into mpirun or i need to do it in other way? > > > > Thanks in forward. > > > > > > > > > > > > -- > > Ing. Gabriele Fatigati > > > > Parallel programmer > > > > CINECA Systems & Tecnologies Department > > > > Supercomputing Group > > > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > > > www.cineca.itTel: +39 051 6171722 > > > > g.fatigati [AT] cineca.it > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > -- > Ing. Gabriele Fatigati > > Parallel programmer > > CINECA Systems & Tecnologies Department > > Supercomputing Group > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > www.cineca.itTel: +39 051 6171722 > > g.fatigati [AT] cineca.it > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Argument parsing issue
The problem is how mpirun scan input parameters when Totalview is invoked. There is some wrong behaviour in the middle :( 2011/1/27 Reuti > Am 27.01.2011 um 10:32 schrieb Gabriele Fatigati: > > > Mm, > > > > doing as you suggest the output is: > > > > a > > b > > "c > > d" > > Whoa - your applications without the debugger is running fine - so I don't > think that it's a problem with `mpirun` per se. > > The same happens with single quotes inside double quotes? > > -- Reuti > > > > and not: > > > > a > > b > > "c d" > > > > 2011/1/27 Reuti > > Hi, > > > > Am 27.01.2011 um 09:48 schrieb Gabriele Fatigati: > > > > > Dear OpenMPI users and developers, > > > > > > i'm using OpenMPI 1.4.3 and Intel compiler. My simple application > require 3 line arguments to work. If i use the follow command: > > > > > > mpirun -np 2 ./a.out a b "c d" > > > > > > It works well. > > > > > > Debugging my application with Totalview: > > > > > > mpirun -np 2 --debug ./a.out a b "c d" > > > > > > Argument parsing doesn't work well. Arguments passed are: > > > > > > a b c d > > > > this double expansion can happen with certain wrappers (also with queuing > system this happens sometimes). What you can try is: > > > > $ mpirun -np 2 --debug ./a.out a b "'c d'" > > > > $ mpirun -np 2 --debug ./a.out a b "\"c d\"" > > > > -- Reuti > > > > > > > and not > > > > > > a b "c d" > > > > > > I think there is an issue in parsing the arguments invoking Totalview. > Is this a bug into mpirun or i need to do it in other way? > > > > > > Thanks in forward. > > > > > > > > > > > > > > > > > > -- > > > Ing. Gabriele Fatigati > > > > > > Parallel programmer > > > > > > CINECA Systems & Tecnologies Department > > > > > > Supercomputing Group > > > > > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > > > > > www.cineca.itTel: +39 051 6171722 > > > > > > g.fatigati [AT] cineca.it > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > -- > > Ing. Gabriele Fatigati > > > > Parallel programmer > > > > CINECA Systems & Tecnologies Department > > > > Supercomputing Group > > > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > > > www.cineca.itTel: +39 051 6171722 > > > > g.fatigati [AT] cineca.it > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it
Re: [OMPI users] Argument parsing issue
The problem is that mpirun regenerates itself to exec a command of "totalview mpirun ", and the quotes are lost in the process. Just start your debugged job with "totalview mpirun ..." and it should work fine. On Jan 27, 2011, at 3:00 AM, Gabriele Fatigati wrote: > The problem is how mpirun scan input parameters when Totalview is invoked. > > There is some wrong behaviour in the middle :( > > 2011/1/27 Reuti > Am 27.01.2011 um 10:32 schrieb Gabriele Fatigati: > > > Mm, > > > > doing as you suggest the output is: > > > > a > > b > > "c > > d" > > Whoa - your applications without the debugger is running fine - so I don't > think that it's a problem with `mpirun` per se. > > The same happens with single quotes inside double quotes? > > -- Reuti > > > > and not: > > > > a > > b > > "c d" > > > > 2011/1/27 Reuti > > Hi, > > > > Am 27.01.2011 um 09:48 schrieb Gabriele Fatigati: > > > > > Dear OpenMPI users and developers, > > > > > > i'm using OpenMPI 1.4.3 and Intel compiler. My simple application require > > > 3 line arguments to work. If i use the follow command: > > > > > > mpirun -np 2 ./a.out a b "c d" > > > > > > It works well. > > > > > > Debugging my application with Totalview: > > > > > > mpirun -np 2 --debug ./a.out a b "c d" > > > > > > Argument parsing doesn't work well. Arguments passed are: > > > > > > a b c d > > > > this double expansion can happen with certain wrappers (also with queuing > > system this happens sometimes). What you can try is: > > > > $ mpirun -np 2 --debug ./a.out a b "'c d'" > > > > $ mpirun -np 2 --debug ./a.out a b "\"c d\"" > > > > -- Reuti > > > > > > > and not > > > > > > a b "c d" > > > > > > I think there is an issue in parsing the arguments invoking Totalview. Is > > > this a bug into mpirun or i need to do it in other way? > > > > > > Thanks in forward. > > > > > > > > > > > > > > > > > > -- > > > Ing. Gabriele Fatigati > > > > > > Parallel programmer > > > > > > CINECA Systems & Tecnologies Department > > > > > > Supercomputing Group > > > > > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > > > > > www.cineca.itTel: +39 051 6171722 > > > > > > g.fatigati [AT] cineca.it > > > ___ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > -- > > Ing. Gabriele Fatigati > > > > Parallel programmer > > > > CINECA Systems & Tecnologies Department > > > > Supercomputing Group > > > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > > > www.cineca.itTel: +39 051 6171722 > > > > g.fatigati [AT] cineca.it > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > -- > Ing. Gabriele Fatigati > > Parallel programmer > > CINECA Systems & Tecnologies Department > > Supercomputing Group > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > www.cineca.itTel: +39 051 6171722 > > g.fatigati [AT] cineca.it > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Argument parsing issue
The command "totalview mpirun..." starts debugging on mpirun not on my executable :( Code showed is related to main.c of OpenMPI. 2011/1/27 Ralph Castain > The problem is that mpirun regenerates itself to exec a command of > "totalview mpirun ", and the quotes are lost in the process. > > Just start your debugged job with "totalview mpirun ..." and it should work > fine. > > On Jan 27, 2011, at 3:00 AM, Gabriele Fatigati wrote: > > The problem is how mpirun scan input parameters when Totalview is invoked. > > There is some wrong behaviour in the middle :( > > 2011/1/27 Reuti > >> Am 27.01.2011 um 10:32 schrieb Gabriele Fatigati: >> >> > Mm, >> > >> > doing as you suggest the output is: >> > >> > a >> > b >> > "c >> > d" >> >> Whoa - your applications without the debugger is running fine - so I don't >> think that it's a problem with `mpirun` per se. >> >> The same happens with single quotes inside double quotes? >> >> -- Reuti >> >> >> > and not: >> > >> > a >> > b >> > "c d" >> > >> > 2011/1/27 Reuti >> > Hi, >> > >> > Am 27.01.2011 um 09:48 schrieb Gabriele Fatigati: >> > >> > > Dear OpenMPI users and developers, >> > > >> > > i'm using OpenMPI 1.4.3 and Intel compiler. My simple application >> require 3 line arguments to work. If i use the follow command: >> > > >> > > mpirun -np 2 ./a.out a b "c d" >> > > >> > > It works well. >> > > >> > > Debugging my application with Totalview: >> > > >> > > mpirun -np 2 --debug ./a.out a b "c d" >> > > >> > > Argument parsing doesn't work well. Arguments passed are: >> > > >> > > a b c d >> > >> > this double expansion can happen with certain wrappers (also with >> queuing system this happens sometimes). What you can try is: >> > >> > $ mpirun -np 2 --debug ./a.out a b "'c d'" >> > >> > $ mpirun -np 2 --debug ./a.out a b "\"c d\"" >> > >> > -- Reuti >> > >> > >> > > and not >> > > >> > > a b "c d" >> > > >> > > I think there is an issue in parsing the arguments invoking Totalview. >> Is this a bug into mpirun or i need to do it in other way? >> > > >> > > Thanks in forward. >> > > >> > > >> > > >> > > >> > > >> > > -- >> > > Ing. Gabriele Fatigati >> > > >> > > Parallel programmer >> > > >> > > CINECA Systems & Tecnologies Department >> > > >> > > Supercomputing Group >> > > >> > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy >> > > >> > > www.cineca.itTel: +39 051 6171722 >> > > >> > > g.fatigati [AT] cineca.it >> > > ___ >> > > users mailing list >> > > us...@open-mpi.org >> > > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > >> > >> > >> > -- >> > Ing. Gabriele Fatigati >> > >> > Parallel programmer >> > >> > CINECA Systems & Tecnologies Department >> > >> > Supercomputing Group >> > >> > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy >> > >> > www.cineca.itTel: +39 051 6171722 >> > >> > g.fatigati [AT] cineca.it >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > > -- > Ing. Gabriele Fatigati > > Parallel programmer > > CINECA Systems & Tecnologies Department > > Supercomputing Group > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > > www.cineca.itTel: +39 051 6171722 > > g.fatigati [AT] cineca.it > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy www.cineca.itTel: +39 051 6171722 g.fatigati [AT] cineca.it
Re: [OMPI users] Argument parsing issue
I found the code in OMPI that is dropping the quoting. Specifically: it *is* OMPI that is dropping your quoting / splitting "foo bar" into 2 arguments when re-execing totalview. Let me see if I can gin up a patch... On Jan 27, 2011, at 7:42 AM, Ralph Castain wrote: > The problem is that mpirun regenerates itself to exec a command of "totalview > mpirun ", and the quotes are lost in the process. > > Just start your debugged job with "totalview mpirun ..." and it should work > fine. > > On Jan 27, 2011, at 3:00 AM, Gabriele Fatigati wrote: > >> The problem is how mpirun scan input parameters when Totalview is invoked. >> >> There is some wrong behaviour in the middle :( >> >> 2011/1/27 Reuti >> Am 27.01.2011 um 10:32 schrieb Gabriele Fatigati: >> >> > Mm, >> > >> > doing as you suggest the output is: >> > >> > a >> > b >> > "c >> > d" >> >> Whoa - your applications without the debugger is running fine - so I don't >> think that it's a problem with `mpirun` per se. >> >> The same happens with single quotes inside double quotes? >> >> -- Reuti >> >> >> > and not: >> > >> > a >> > b >> > "c d" >> > >> > 2011/1/27 Reuti >> > Hi, >> > >> > Am 27.01.2011 um 09:48 schrieb Gabriele Fatigati: >> > >> > > Dear OpenMPI users and developers, >> > > >> > > i'm using OpenMPI 1.4.3 and Intel compiler. My simple application >> > > require 3 line arguments to work. If i use the follow command: >> > > >> > > mpirun -np 2 ./a.out a b "c d" >> > > >> > > It works well. >> > > >> > > Debugging my application with Totalview: >> > > >> > > mpirun -np 2 --debug ./a.out a b "c d" >> > > >> > > Argument parsing doesn't work well. Arguments passed are: >> > > >> > > a b c d >> > >> > this double expansion can happen with certain wrappers (also with queuing >> > system this happens sometimes). What you can try is: >> > >> > $ mpirun -np 2 --debug ./a.out a b "'c d'" >> > >> > $ mpirun -np 2 --debug ./a.out a b "\"c d\"" >> > >> > -- Reuti >> > >> > >> > > and not >> > > >> > > a b "c d" >> > > >> > > I think there is an issue in parsing the arguments invoking Totalview. >> > > Is this a bug into mpirun or i need to do it in other way? >> > > >> > > Thanks in forward. >> > > >> > > >> > > >> > > >> > > >> > > -- >> > > Ing. Gabriele Fatigati >> > > >> > > Parallel programmer >> > > >> > > CINECA Systems & Tecnologies Department >> > > >> > > Supercomputing Group >> > > >> > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy >> > > >> > > www.cineca.itTel: +39 051 6171722 >> > > >> > > g.fatigati [AT] cineca.it >> > > ___ >> > > users mailing list >> > > us...@open-mpi.org >> > > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> > >> > >> > >> > >> > -- >> > Ing. Gabriele Fatigati >> > >> > Parallel programmer >> > >> > CINECA Systems & Tecnologies Department >> > >> > Supercomputing Group >> > >> > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy >> > >> > www.cineca.itTel: +39 051 6171722 >> > >> > g.fatigati [AT] cineca.it >> > ___ >> > users mailing list >> > us...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> >> >> -- >> Ing. Gabriele Fatigati >> >> Parallel programmer >> >> CINECA Systems & Tecnologies Department >> >> Supercomputing Group >> >> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy >> >> www.cineca.itTel: +39 051 6171722 >> >> g.fatigati [AT] cineca.it >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Argument parsing issue
Ok Jeff, explain me where is the code and i'll try to fix it. Thanks a lot. 2011/1/27 Jeff Squyres > I found the code in OMPI that is dropping the quoting. > > Specifically: it *is* OMPI that is dropping your quoting / splitting "foo > bar" into 2 arguments when re-execing totalview. > > Let me see if I can gin up a patch... > > > On Jan 27, 2011, at 7:42 AM, Ralph Castain wrote: > > > The problem is that mpirun regenerates itself to exec a command of > "totalview mpirun ", and the quotes are lost in the process. > > > > Just start your debugged job with "totalview mpirun ..." and it should > work fine. > > > > On Jan 27, 2011, at 3:00 AM, Gabriele Fatigati wrote: > > > >> The problem is how mpirun scan input parameters when Totalview is > invoked. > >> > >> There is some wrong behaviour in the middle :( > >> > >> 2011/1/27 Reuti > >> Am 27.01.2011 um 10:32 schrieb Gabriele Fatigati: > >> > >> > Mm, > >> > > >> > doing as you suggest the output is: > >> > > >> > a > >> > b > >> > "c > >> > d" > >> > >> Whoa - your applications without the debugger is running fine - so I > don't think that it's a problem with `mpirun` per se. > >> > >> The same happens with single quotes inside double quotes? > >> > >> -- Reuti > >> > >> > >> > and not: > >> > > >> > a > >> > b > >> > "c d" > >> > > >> > 2011/1/27 Reuti > >> > Hi, > >> > > >> > Am 27.01.2011 um 09:48 schrieb Gabriele Fatigati: > >> > > >> > > Dear OpenMPI users and developers, > >> > > > >> > > i'm using OpenMPI 1.4.3 and Intel compiler. My simple application > require 3 line arguments to work. If i use the follow command: > >> > > > >> > > mpirun -np 2 ./a.out a b "c d" > >> > > > >> > > It works well. > >> > > > >> > > Debugging my application with Totalview: > >> > > > >> > > mpirun -np 2 --debug ./a.out a b "c d" > >> > > > >> > > Argument parsing doesn't work well. Arguments passed are: > >> > > > >> > > a b c d > >> > > >> > this double expansion can happen with certain wrappers (also with > queuing system this happens sometimes). What you can try is: > >> > > >> > $ mpirun -np 2 --debug ./a.out a b "'c d'" > >> > > >> > $ mpirun -np 2 --debug ./a.out a b "\"c d\"" > >> > > >> > -- Reuti > >> > > >> > > >> > > and not > >> > > > >> > > a b "c d" > >> > > > >> > > I think there is an issue in parsing the arguments invoking > Totalview. Is this a bug into mpirun or i need to do it in other way? > >> > > > >> > > Thanks in forward. > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > -- > >> > > Ing. Gabriele Fatigati > >> > > > >> > > Parallel programmer > >> > > > >> > > CINECA Systems & Tecnologies Department > >> > > > >> > > Supercomputing Group > >> > > > >> > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > >> > > > >> > > www.cineca.itTel: +39 051 6171722 > >> > > > >> > > g.fatigati [AT] cineca.it > >> > > ___ > >> > > users mailing list > >> > > us...@open-mpi.org > >> > > http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > >> > > >> > ___ > >> > users mailing list > >> > us...@open-mpi.org > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > >> > > >> > > >> > > >> > -- > >> > Ing. Gabriele Fatigati > >> > > >> > Parallel programmer > >> > > >> > CINECA Systems & Tecnologies Department > >> > > >> > Supercomputing Group > >> > > >> > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > >> > > >> > www.cineca.itTel: +39 051 6171722 > >> > > >> > g.fatigati [AT] cineca.it > >> > ___ > >> > users mailing list > >> > us...@open-mpi.org > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> > >> > >> -- > >> Ing. Gabriele Fatigati > >> > >> Parallel programmer > >> > >> CINECA Systems & Tecnologies Department > >> > >> Supercomputing Group > >> > >> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > >> > >> www.cineca.itTel: +39 051 6171722 > >> > >> g.fatigati [AT] cineca.it > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- Ing. Gabriele Fatigati Parallel programmer CINECA Systems & Tecnologies Department Supercomputing Group Via Magnanelli 6/3, Casalecchio di Reno (BO
[OMPI users] allow job to survive process death
Hi, I was wondering what support Open MPI has for allowing a job to continue running when one or more processes in the job die unexpectedly? Is there a special mpirun flag for this? Any other ways? It seems obvious that collectives will fail once a process dies, but would it be possible to create a new group (if you knew which ranks are dead) that excludes the dead processes - then turn this group into a working communicator? Thanks, Kirk
Re: [OMPI users] allow job to survive process death
The current version of Open MPI does not support continued operation of an MPI application after process failure within a job. If a process dies, so will the MPI job. Note that this is true of many MPI implementations out there at the moment. At Oak Ridge National Laboratory, we are working on a version of Open MPI that will be able to run-through process failure, if the application wishes to do so. The semantics and interfaces needed to support this functionality are being actively developed by the MPI Forums Fault Tolerance Working Group, and can be found at the wiki page below: https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization This work is on-going, but once we have a stable prototype we will assess how to bring it back to the mainline Open MPI trunk. For the moment, there is no public release of this branch, but once there is we will be sure to announce it on the appropriate Open MPI mailing list for folks to start playing around with it. -- Josh On Jan 27, 2011, at 9:11 AM, Kirk Stako wrote: > Hi, > > I was wondering what support Open MPI has for allowing a job to > continue running when one or more processes in the job die > unexpectedly? Is there a special mpirun flag for this? Any other ways? > > It seems obvious that collectives will fail once a process dies, but > would it be possible to create a new group (if you knew which ranks > are dead) that excludes the dead processes - then turn this group into > a working communicator? > > Thanks, > Kirk > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
Re: [OMPI users] allow job to survive process death
Am 27.01.2011 um 15:23 schrieb Joshua Hursey: > The current version of Open MPI does not support continued operation of an > MPI application after process failure within a job. If a process dies, so > will the MPI job. Note that this is true of many MPI implementations out > there at the moment. > > At Oak Ridge National Laboratory, we are working on a version of Open MPI > that will be able to run-through process failure, if the application wishes > to do so. The semantics and interfaces needed to support this functionality > are being actively developed by the MPI Forums Fault Tolerance Working Group, > and can be found at the wiki page below: > > https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization I had a look at this document, but what is really covered - the application has to react on the notification of a failed rank and act appropriate on its own? Having a true ability to survive a dying process (i.e. rank) which might be computing already for hours would mean to have some kind of "rank RAID" or "rank Parchive". E.g. start 12 ranks when you need 10 - what ever 2 ranks are failing, your job will be ready in time. -- Reuti > This work is on-going, but once we have a stable prototype we will assess how > to bring it back to the mainline Open MPI trunk. For the moment, there is no > public release of this branch, but once there is we will be sure to announce > it on the appropriate Open MPI mailing list for folks to start playing around > with it. > > -- Josh > > On Jan 27, 2011, at 9:11 AM, Kirk Stako wrote: > >> Hi, >> >> I was wondering what support Open MPI has for allowing a job to >> continue running when one or more processes in the job die >> unexpectedly? Is there a special mpirun flag for this? Any other ways? >> >> It seems obvious that collectives will fail once a process dies, but >> would it be possible to create a new group (if you knew which ranks >> are dead) that excludes the dead processes - then turn this group into >> a working communicator? >> >> Thanks, >> Kirk >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] allow job to survive process death
On Jan 27, 2011, at 7:47 AM, Reuti wrote: > Am 27.01.2011 um 15:23 schrieb Joshua Hursey: > >> The current version of Open MPI does not support continued operation of an >> MPI application after process failure within a job. If a process dies, so >> will the MPI job. Note that this is true of many MPI implementations out >> there at the moment. >> >> At Oak Ridge National Laboratory, we are working on a version of Open MPI >> that will be able to run-through process failure, if the application wishes >> to do so. The semantics and interfaces needed to support this functionality >> are being actively developed by the MPI Forums Fault Tolerance Working >> Group, and can be found at the wiki page below: >> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization > > I had a look at this document, but what is really covered - the application > has to react on the notification of a failed rank and act appropriate on its > own? > > Having a true ability to survive a dying process (i.e. rank) which might be > computing already for hours would mean to have some kind of "rank RAID" or > "rank Parchive". E.g. start 12 ranks when you need 10 - what ever 2 ranks are > failing, your job will be ready in time. We have the run-time part of this done - of course, figuring out the MPI part of the problem is harder ;-) > > -- Reuti > > >> This work is on-going, but once we have a stable prototype we will assess >> how to bring it back to the mainline Open MPI trunk. For the moment, there >> is no public release of this branch, but once there is we will be sure to >> announce it on the appropriate Open MPI mailing list for folks to start >> playing around with it. >> >> -- Josh >> >> On Jan 27, 2011, at 9:11 AM, Kirk Stako wrote: >> >>> Hi, >>> >>> I was wondering what support Open MPI has for allowing a job to >>> continue running when one or more processes in the job die >>> unexpectedly? Is there a special mpirun flag for this? Any other ways? >>> >>> It seems obvious that collectives will fail once a process dies, but >>> would it be possible to create a new group (if you knew which ranks >>> are dead) that excludes the dead processes - then turn this group into >>> a working communicator? >>> >>> Thanks, >>> Kirk >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> Joshua Hursey >> Postdoctoral Research Associate >> Oak Ridge National Laboratory >> http://users.nccs.gov/~jjhursey >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] allow job to survive process death
On Jan 27, 2011, at 9:47 AM, Reuti wrote: > Am 27.01.2011 um 15:23 schrieb Joshua Hursey: > >> The current version of Open MPI does not support continued operation of an >> MPI application after process failure within a job. If a process dies, so >> will the MPI job. Note that this is true of many MPI implementations out >> there at the moment. >> >> At Oak Ridge National Laboratory, we are working on a version of Open MPI >> that will be able to run-through process failure, if the application wishes >> to do so. The semantics and interfaces needed to support this functionality >> are being actively developed by the MPI Forums Fault Tolerance Working >> Group, and can be found at the wiki page below: >> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization > > I had a look at this document, but what is really covered - the application > has to react on the notification of a failed rank and act appropriate on its > own? Yes. This is to support application based fault tolerance (ABFT). Libraries could be developed on top of these semantics to hide some of the fault handing. The purpose is to enable fault tolerant MPI applications and libraries to be built on top of MPI. This document only covers run-through stabilization, not process recovery, at the moment. So the application will have well defined semantics to allow it to continue processing without the failed process. Recovering the failed process is not specified in this document. That is the subject of a supplemental document in preparation - the two proposals are meant to be complementary and build upon one another. > > Having a true ability to survive a dying process (i.e. rank) which might be > computing already for hours would mean to have some kind of "rank RAID" or > "rank Parchive". E.g. start 12 ranks when you need 10 - what ever 2 ranks are > failing, your job will be ready in time. Yes, that is one possible technique. So once a process failure occurs, the application is notified via the existing error handling mechanisms. The application is then responsible for determining how best to recover from that process failure. This could include using MPI_Comm_spawn to create new processes (useful in manager/worker applications), recovering the state from an in-memory checksum, using spare processes in the communicator, rolling back some/all ranks to an application level checkpoint, ignoring the failure and allowing the residual error to increase, aborting the job or a single sub-communicator, ... the list goes on. But the purpose of the proposal is to allow an application or library to start building such techniques based on portable semantics and well defined interfaces. Does that help clarify? If you would like to discuss the developing proposals further or have input on how to make it better, I would suggest moving the discussion to the MPI3-ft mailing list so other groups can participate that do not normally follow the Open MPI lists. The mailing list information is below: http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft -- Josh > > -- Reuti > > >> This work is on-going, but once we have a stable prototype we will assess >> how to bring it back to the mainline Open MPI trunk. For the moment, there >> is no public release of this branch, but once there is we will be sure to >> announce it on the appropriate Open MPI mailing list for folks to start >> playing around with it. >> >> -- Josh >> >> On Jan 27, 2011, at 9:11 AM, Kirk Stako wrote: >> >>> Hi, >>> >>> I was wondering what support Open MPI has for allowing a job to >>> continue running when one or more processes in the job die >>> unexpectedly? Is there a special mpirun flag for this? Any other ways? >>> >>> It seems obvious that collectives will fail once a process dies, but >>> would it be possible to create a new group (if you knew which ranks >>> are dead) that excludes the dead processes - then turn this group into >>> a working communicator? >>> >>> Thanks, >>> Kirk >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >> >> >> Joshua Hursey >> Postdoctoral Research Associate >> Oak Ridge National Laboratory >> http://users.nccs.gov/~jjhursey >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
Re: [OMPI users] allow job to survive process death
Am 27.01.2011 um 16:10 schrieb Joshua Hursey: > > On Jan 27, 2011, at 9:47 AM, Reuti wrote: > >> Am 27.01.2011 um 15:23 schrieb Joshua Hursey: >> >>> The current version of Open MPI does not support continued operation of an >>> MPI application after process failure within a job. If a process dies, so >>> will the MPI job. Note that this is true of many MPI implementations out >>> there at the moment. >>> >>> At Oak Ridge National Laboratory, we are working on a version of Open MPI >>> that will be able to run-through process failure, if the application wishes >>> to do so. The semantics and interfaces needed to support this functionality >>> are being actively developed by the MPI Forums Fault Tolerance Working >>> Group, and can be found at the wiki page below: >>> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization >> >> I had a look at this document, but what is really covered - the application >> has to react on the notification of a failed rank and act appropriate on its >> own? > > Yes. This is to support application based fault tolerance (ABFT). Libraries > could be developed on top of these semantics to hide some of the fault > handing. The purpose is to enable fault tolerant MPI applications and > libraries to be built on top of MPI. > > This document only covers run-through stabilization, not process recovery, at > the moment. So the application will have well defined semantics to allow it > to continue processing without the failed process. Recovering the failed > process is not specified in this document. That is the subject of a > supplemental document in preparation - the two proposals are meant to be > complementary and build upon one another. > >> >> Having a true ability to survive a dying process (i.e. rank) which might be >> computing already for hours would mean to have some kind of "rank RAID" or >> "rank Parchive". E.g. start 12 ranks when you need 10 - what ever 2 ranks >> are failing, your job will be ready in time. > > Yes, that is one possible technique. So once a process failure occurs, the > application is notified via the existing error handling mechanisms. The > application is then responsible for determining how best to recover from that > process failure. This could include using MPI_Comm_spawn to create new > processes (useful in manager/worker applications), recovering the state from > an in-memory checksum, using spare processes in the communicator, rolling > back some/all ranks to an application level checkpoint, ignoring the failure > and allowing the residual error to increase, aborting the job or a single > sub-communicator, ... the list goes on. But the purpose of the proposal is to > allow an application or library to start building such techniques based on > portable semantics and well defined interfaces. > > Does that help clarify? Yes - thx. -- Reuti > If you would like to discuss the developing proposals further or have input > on how to make it better, I would suggest moving the discussion to the > MPI3-ft mailing list so other groups can participate that do not normally > follow the Open MPI lists. The mailing list information is below: > http://lists.mpi-forum.org/mailman/listinfo.cgi/mpi3-ft > > > -- Josh > >> >> -- Reuti >> >> >>> This work is on-going, but once we have a stable prototype we will assess >>> how to bring it back to the mainline Open MPI trunk. For the moment, there >>> is no public release of this branch, but once there is we will be sure to >>> announce it on the appropriate Open MPI mailing list for folks to start >>> playing around with it. >>> >>> -- Josh >>> >>> On Jan 27, 2011, at 9:11 AM, Kirk Stako wrote: >>> Hi, I was wondering what support Open MPI has for allowing a job to continue running when one or more processes in the job die unexpectedly? Is there a special mpirun flag for this? Any other ways? It seems obvious that collectives will fail once a process dies, but would it be possible to create a new group (if you knew which ranks are dead) that excludes the dead processes - then turn this group into a working communicator? Thanks, Kirk ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> Joshua Hursey >>> Postdoctoral Research Associate >>> Oak Ridge National Laboratory >>> http://users.nccs.gov/~jjhursey >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > > Joshua Hursey > Postdoctoral Research Associate
Re: [OMPI users] Argument parsing issue
I did my patch against the development trunk; could you try the attached patch to a trunk nightly tarball and see if that works for you? If it does, I can provide patches for v1.4 and v1.5 (the code moved a bit between these 3 versions, so I would need to adapt the patches a little). On Jan 27, 2011, at 9:06 AM, Gabriele Fatigati wrote: > Ok Jeff, > > explain me where is the code and i'll try to fix it. > > Thanks a lot. > > 2011/1/27 Jeff Squyres > I found the code in OMPI that is dropping the quoting. > > Specifically: it *is* OMPI that is dropping your quoting / splitting "foo > bar" into 2 arguments when re-execing totalview. > > Let me see if I can gin up a patch... > > > On Jan 27, 2011, at 7:42 AM, Ralph Castain wrote: > > > The problem is that mpirun regenerates itself to exec a command of > > "totalview mpirun ", and the quotes are lost in the process. > > > > Just start your debugged job with "totalview mpirun ..." and it should work > > fine. > > > > On Jan 27, 2011, at 3:00 AM, Gabriele Fatigati wrote: > > > >> The problem is how mpirun scan input parameters when Totalview is invoked. > >> > >> There is some wrong behaviour in the middle :( > >> > >> 2011/1/27 Reuti > >> Am 27.01.2011 um 10:32 schrieb Gabriele Fatigati: > >> > >> > Mm, > >> > > >> > doing as you suggest the output is: > >> > > >> > a > >> > b > >> > "c > >> > d" > >> > >> Whoa - your applications without the debugger is running fine - so I don't > >> think that it's a problem with `mpirun` per se. > >> > >> The same happens with single quotes inside double quotes? > >> > >> -- Reuti > >> > >> > >> > and not: > >> > > >> > a > >> > b > >> > "c d" > >> > > >> > 2011/1/27 Reuti > >> > Hi, > >> > > >> > Am 27.01.2011 um 09:48 schrieb Gabriele Fatigati: > >> > > >> > > Dear OpenMPI users and developers, > >> > > > >> > > i'm using OpenMPI 1.4.3 and Intel compiler. My simple application > >> > > require 3 line arguments to work. If i use the follow command: > >> > > > >> > > mpirun -np 2 ./a.out a b "c d" > >> > > > >> > > It works well. > >> > > > >> > > Debugging my application with Totalview: > >> > > > >> > > mpirun -np 2 --debug ./a.out a b "c d" > >> > > > >> > > Argument parsing doesn't work well. Arguments passed are: > >> > > > >> > > a b c d > >> > > >> > this double expansion can happen with certain wrappers (also with > >> > queuing system this happens sometimes). What you can try is: > >> > > >> > $ mpirun -np 2 --debug ./a.out a b "'c d'" > >> > > >> > $ mpirun -np 2 --debug ./a.out a b "\"c d\"" > >> > > >> > -- Reuti > >> > > >> > > >> > > and not > >> > > > >> > > a b "c d" > >> > > > >> > > I think there is an issue in parsing the arguments invoking Totalview. > >> > > Is this a bug into mpirun or i need to do it in other way? > >> > > > >> > > Thanks in forward. > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > -- > >> > > Ing. Gabriele Fatigati > >> > > > >> > > Parallel programmer > >> > > > >> > > CINECA Systems & Tecnologies Department > >> > > > >> > > Supercomputing Group > >> > > > >> > > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > >> > > > >> > > www.cineca.itTel: +39 051 6171722 > >> > > > >> > > g.fatigati [AT] cineca.it > >> > > ___ > >> > > users mailing list > >> > > us...@open-mpi.org > >> > > http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > >> > > >> > ___ > >> > users mailing list > >> > us...@open-mpi.org > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > >> > > >> > > >> > > >> > -- > >> > Ing. Gabriele Fatigati > >> > > >> > Parallel programmer > >> > > >> > CINECA Systems & Tecnologies Department > >> > > >> > Supercomputing Group > >> > > >> > Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > >> > > >> > www.cineca.itTel: +39 051 6171722 > >> > > >> > g.fatigati [AT] cineca.it > >> > ___ > >> > users mailing list > >> > us...@open-mpi.org > >> > http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> > >> > >> -- > >> Ing. Gabriele Fatigati > >> > >> Parallel programmer > >> > >> CINECA Systems & Tecnologies Department > >> > >> Supercomputing Group > >> > >> Via Magnanelli 6/3, Casalecchio di Reno (BO) Italy > >> > >> www.cineca.itTel: +39 051 6171722 > >> > >> g.fatigati [AT] cineca.it > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corpora
[OMPI users] Experiences with Mellanox Connect-X HCA ?
Just touting around for any experiences with the following, combination (if it's already out there somewhere?) ahead of fully spec-ing a required software stack: Mellanox Connect-X HCAs talking through a Voltaire ISR4036 IB QDR switch RHEL (yep, not the usual NetBSD!) OFED (built with Portland Group compilers) OpenMPI (is 1.4 series is as high as I can go at present?) SGE (as in, not the Oracle fork) Alternatively, alternatives to the above welcome! Kevin -- Kevin M. Buckley Room: CO327 School of Engineering and Phone: +64 4 463 5971 Computer Science Victoria University of Wellington New Zealand