Re: [OMPI users] Latest Intel Compilers (ICS, version 12.1.0.233 Build 20110811) issues ...
On Tue, Jan 31, 2012 at 5:20 PM, Richard Walsh wrote: > in the malloc.c routine in 1.5.5. Perhaps you should lower the optimization > level to zero and see what you get. Hi Richard, thanks for the suggestion. I was able to solve the problem by upgrading the Intel Compiler to version 12.1.2 and recompiling the openmpi runtime with unchanged options. Now I cannot reproduce that crash. I'll have to test some more, but I think the problem is solved. Thanks, Götz
Re: [OMPI users] Segfault on mpirun with OpenMPI 1.4.5rc2
On Tue, Jan 31, 2012 at 8:19 PM, Daniel Milroy wrote: > Hello, > > I have built OpenMPI 1.4.5rc2 with Intel 12.1 compilers in an HPC > environment. We are running RHEL 5, kernel 2.6.18-238 with Intel Xeon > X5660 cpus. You can find my build options below. In an effort to > test the OpenMPI build, I compiled "Hello world" with an mpi_init call > in C and Fortran. Mpirun of both versions on a single node results in > a segfault. I have attached the pertinent portion of gdb's output of > the "Hello world" core dump. Hi Daniel, that looks like the problem I had with my intel build of openmpi. I could solve it by upgrading the Intel Compiler version to 12.1.2.273: % icc -v icc version 12.1.2 (gcc version 4.4.5 compatibility) % icc -V Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version 12.1 Build 2028 Copyright (C) 1985-2011 Intel Corporation. All rights reserved. After a rebuild of the openmpi runtime, the crashes went away. I was using openmpi 1.5.3, but you could still have the same problem. Regards, Götz
Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine
Am 31.01.2012 um 21:25 schrieb Ralph Castain: > > On Jan 31, 2012, at 12:58 PM, Reuti wrote: > >> >> Am 31.01.2012 um 20:38 schrieb Ralph Castain: >> >>> Not sure I fully grok this thread, but will try to provide an answer. >>> >>> When you start a singleton, it spawns off a daemon that is the equivalent >>> of "mpirun". This daemon is created for the express purpose of allowing the >>> singleton to use MPI dynamics like comm_spawn - without it, the singleton >>> would be unable to execute those functions. >>> >>> The first thing the daemon does is read the local allocation, using the >>> same methods as used by mpirun. So whatever allocation is present that >>> mpirun would have read, the daemon will get. This includes hostfiles and >>> SGE allocations. >> >> So it should honor also the default hostfile of Open MPI if running outside >> of SGE, i.e. from the command line? > > Yes BTW: is there any default for a hostfile for Open MPI - I mean any in my home directory or /etc? When I check `man orte_hosts`, and all possible optiions are unset (like in a singleton run), it will only run local (Job is co-located with mpirun). >>> The exception to this is when the singleton gets started in an altered >>> environment - e.g., if SGE changes the environmental variables when >>> launching the singleton process. We see this in some resource managers - >>> you can get an allocation of N nodes, but when you launch a job, the envar >>> in that job only indicates the number of nodes actually running processes >>> in that job. In such a situation, the daemon will see the altered value as >>> its "allocation", potentially causing confusion. >> >> Not sure whether I get it right. When I launch the same application with: >> >> "mpiexec -np1 ./Mpitest" (and get an allocation of 2+2 on the two machines): >> >> 27422 ?Sl 4:12 /usr/sge/bin/lx24-x86/sge_execd >> 9504 ?S 0:00 \_ sge_shepherd-3791 -bg >> 9506 ?Ss 0:00 \_ /bin/sh >> /var/spool/sge/pc15370/job_scripts/3791 >> 9507 ?S 0:00 \_ mpiexec -np 1 ./Mpitest >> 9508 ?R 0:07 \_ ./Mpitest >> 9509 ?Sl 0:00 \_ /usr/sge/bin/lx24-x86/qrsh >> -inherit -nostdin -V pc15381 orted -mca >> 9513 ?S 0:00 \_ /home/reuti/mpitest/Mpitest --child >> >> 2861 ?Sl10:47 /usr/sge/bin/lx24-x86/sge_execd >> 25434 ?Sl 0:00 \_ sge_shepherd-3791 -bg >> 25436 ?Ss 0:00 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter >> /var/spool/sge/pc15381/active_jobs/3791.1/1.pc15381 >> 25444 ?S 0:00 \_ orted -mca ess env -mca >> orte_ess_jobid 821952512 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 >> --hnp-uri >> 25447 ?S 0:01 \_ /home/reuti/mpitest/Mpitest >> --child >> 25448 ?S 0:01 \_ /home/reuti/mpitest/Mpitest >> --child >> >> This is what I expect (main + 1 child, other node gets 2 children). Now I >> launch the singleton instead (nothing changed besides this, still 2+2 >> granted): >> >> "./Mpitest" and get: >> >> 27422 ?Sl 4:12 /usr/sge/bin/lx24-x86/sge_execd >> 9546 ?S 0:00 \_ sge_shepherd-3793 -bg >> 9548 ?Ss 0:00 \_ /bin/sh >> /var/spool/sge/pc15370/job_scripts/3793 >> 9549 ?R 0:00 \_ ./Mpitest >> 9550 ?Ss 0:00 \_ orted --hnp --set-sid --report-uri >> 6 --singleton-died-pipe 7 >> 9551 ?Sl 0:00 \_ /usr/sge/bin/lx24-x86/qrsh >> -inherit -nostdin -V pc15381 orted >> 9554 ?S 0:00 \_ /home/reuti/mpitest/Mpitest >> --child >> 9555 ?S 0:00 \_ /home/reuti/mpitest/Mpitest >> --child >> >> 2861 ?Sl10:47 /usr/sge/bin/lx24-x86/sge_execd >> 25494 ?Sl 0:00 \_ sge_shepherd-3793 -bg >> 25495 ?Ss 0:00 \_ /usr/sge/utilbin/lx24-x86/qrsh_starter >> /var/spool/sge/pc15381/active_jobs/3793.1/1.pc15381 >> 25502 ?S 0:00 \_ orted -mca ess env -mca >> orte_ess_jobid 814940160 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 >> --hnp-uri >> 25503 ?S 0:00 \_ /home/reuti/mpitest/Mpitest >> --child >> >> Only one child is going to the other node. The environment is the same in >> both cases. Is this the correct behavior? > > > We probably aren't correctly marking the original singleton on that node, and > so the mapper thinks there are still two slots available on the original node. Okay. There is something to discuss/fix. BTW: if started as singleton I get an error at the end with the program the OP provided: [pc15381:25502] [[12435,0],1] routed:binomial: Connection to lifeline [[12435,0],0] lost It's not the case if run by mpiexec. -- Reuti
Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine
On Feb 1, 2012, at 3:49 AM, Reuti wrote: > Am 31.01.2012 um 21:25 schrieb Ralph Castain: > >> >> On Jan 31, 2012, at 12:58 PM, Reuti wrote: > > BTW: is there any default for a hostfile for Open MPI - I mean any in my home > directory or /etc? When I check `man orte_hosts`, and all possible optiions > are unset (like in a singleton run), it will only run local (Job is > co-located with mpirun). Yep - it is /etc/openmpi-default-hostfile >> We probably aren't correctly marking the original singleton on that node, >> and so the mapper thinks there are still two slots available on the original >> node. > > Okay. There is something to discuss/fix. BTW: if started as singleton I get > an error at the end with the program the OP provided: > > [pc15381:25502] [[12435,0],1] routed:binomial: Connection to lifeline > [[12435,0],0] lost Okay, I'll take a look at it - but it may take awhile before I can address either issue as other priorities loom. > > It's not the case if run by mpiexec. > > -- Reuti > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine
Am 01.02.2012 um 15:38 schrieb Ralph Castain: > On Feb 1, 2012, at 3:49 AM, Reuti wrote: > >> Am 31.01.2012 um 21:25 schrieb Ralph Castain: >> >>> On Jan 31, 2012, at 12:58 PM, Reuti wrote: >> >> BTW: is there any default for a hostfile for Open MPI - I mean any in my >> home directory or /etc? When I check `man orte_hosts`, and all possible >> optiions are unset (like in a singleton run), it will only run local (Job is >> co-located with mpirun). > > Yep - it is /etc/openmpi-default-hostfile Thx for replying Ralph. I spotted it too, but this is not working for me. Neither for mpiexec from the command line, nor any singleton. I also tried a plain /etc as location of this file as well. reuti@pc15370:~> which mpicc /home/reuti/local/openmpi-1.4.4-thread/bin/mpicc reuti@pc15370:~> cat /home/reuti/local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile pc15370 slots=2 pc15381 slots=2 reuti@pc15370:~> mpicc -o mpihello mpihello.c reuti@pc15370:~> mpiexec -np 4 ./mpihello Hello World from Node 0. Hello World from Node 1. Hello World from Node 2. Hello World from Node 3. But all is local (no spawn here, traditional mpihello): 19503 ?Ss 0:00 /usr/sbin/sshd -o PidFile=/var/run/sshd.init.pid 11583 ?Ss 0:00 \_ sshd: reuti [priv] 11585 ?S 0:00 | \_ sshd: reuti@pts/6 11587 pts/6Ss 0:00 | \_ -bash 13470 pts/6S+ 0:00 | \_ mpiexec -np 4 ./mpihello 13471 pts/6R+ 0:00 | \_ ./mpihello 13472 pts/6R+ 0:00 | \_ ./mpihello 13473 pts/6R+ 0:00 | \_ ./mpihello 13474 pts/6R+ 0:00 | \_ ./mpihello -- Reuti >>> We probably aren't correctly marking the original singleton on that node, >>> and so the mapper thinks there are still two slots available on the >>> original node. >> >> Okay. There is something to discuss/fix. BTW: if started as singleton I get >> an error at the end with the program the OP provided: >> >> [pc15381:25502] [[12435,0],1] routed:binomial: Connection to lifeline >> [[12435,0],0] lost > > Okay, I'll take a look at it - but it may take awhile before I can address > either issue as other priorities loom. > >> >> It's not the case if run by mpiexec. >> >> -- Reuti >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] Mpirun: How to print STDOUT of just one process?
When running mpirun -n 2 the STDOUT streams of both processes are combined and are displayed by the shell. In such an interleaved format its hard to tell what line comes from which node. Is there a way to have mpirun just merger STDOUT of one process to its STDOUT stream? Best, Frank Cross-reference: http://stackoverflow.com/questions/9098781/mpirun-how-to-print-stdout-of-just-one-process
Re: [OMPI users] Mpirun: How to print STDOUT of just one process?
I don't know about using mpirun to do it, but you can actually call mpirun on a script, and have that script individually call a single instance of your program. Then that script could use shell redirection to redirect the output of the program's instance to a separate file. I've used this technique to play with ulimit sort of things in the script before. I'm not entirely sure what variables are exposed to you in the script, such that you could come up with a unique filename to output to, though. Lloyd Brown Systems Administrator Fulton Supercomputing Lab Brigham Young University http://marylou.byu.edu On 02/01/2012 08:59 AM, Frank wrote: > When running > > mpirun -n 2 > > the STDOUT streams of both processes are combined and are displayed by > the shell. In such an interleaved format its hard to tell what line > comes from which node. > > Is there a way to have mpirun just merger STDOUT of one process to its > STDOUT stream? > > Best, > Frank > > Cross-reference: > http://stackoverflow.com/questions/9098781/mpirun-how-to-print-stdout-of-just-one-process > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Mpirun: How to print STDOUT of just one process?
man mpirun . . . -output-filename, --output-filename Redirect the stdout, stderr, and stddiag of all ranks to a rank-unique version of the specified filename. Any directories in the filename will automatically be created. Each output file will consist of filename.rank, where the rank will be left-filled with zero's for correct ordering in listings. . . .
Re: [OMPI users] Mpirun: How to print STDOUT of just one process?
On 2/1/2012 7:59 AM, Frank wrote: When running mpirun -n 2 the STDOUT streams of both processes are combined and are displayed by the shell. In such an interleaved format its hard to tell what line comes from which node. As far as this part goes, there is also "mpirun --tag-output". Check the mpirun man page. Is there a way to have mpirun just merger STDOUT of one process to its STDOUT stream?
Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine
Could you add --display-allocation to your cmd line? This will tell us if it found/read the default hostfile, or if the problem is with the mapper. On Feb 1, 2012, at 7:58 AM, Reuti wrote: > Am 01.02.2012 um 15:38 schrieb Ralph Castain: > >> On Feb 1, 2012, at 3:49 AM, Reuti wrote: >> >>> Am 31.01.2012 um 21:25 schrieb Ralph Castain: >>> On Jan 31, 2012, at 12:58 PM, Reuti wrote: >>> >>> BTW: is there any default for a hostfile for Open MPI - I mean any in my >>> home directory or /etc? When I check `man orte_hosts`, and all possible >>> optiions are unset (like in a singleton run), it will only run local (Job >>> is co-located with mpirun). >> >> Yep - it is /etc/openmpi-default-hostfile > > Thx for replying Ralph. > > I spotted it too, but this is not working for me. Neither for mpiexec from > the command line, nor any singleton. I also tried a plain /etc as location of > this file as well. > > reuti@pc15370:~> which mpicc > /home/reuti/local/openmpi-1.4.4-thread/bin/mpicc > reuti@pc15370:~> cat > /home/reuti/local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile > pc15370 slots=2 > pc15381 slots=2 > reuti@pc15370:~> mpicc -o mpihello mpihello.c > reuti@pc15370:~> mpiexec -np 4 ./mpihello > Hello World from Node 0. > Hello World from Node 1. > Hello World from Node 2. > Hello World from Node 3. > > But all is local (no spawn here, traditional mpihello): > > 19503 ?Ss 0:00 /usr/sbin/sshd -o PidFile=/var/run/sshd.init.pid > 11583 ?Ss 0:00 \_ sshd: reuti [priv] > > 11585 ?S 0:00 | \_ sshd: reuti@pts/6 > > 11587 pts/6Ss 0:00 | \_ -bash > 13470 pts/6S+ 0:00 | \_ mpiexec -np 4 ./mpihello > 13471 pts/6R+ 0:00 | \_ ./mpihello > 13472 pts/6R+ 0:00 | \_ ./mpihello > 13473 pts/6R+ 0:00 | \_ ./mpihello > 13474 pts/6R+ 0:00 | \_ ./mpihello > > -- Reuti > > We probably aren't correctly marking the original singleton on that node, and so the mapper thinks there are still two slots available on the original node. >>> >>> Okay. There is something to discuss/fix. BTW: if started as singleton I get >>> an error at the end with the program the OP provided: >>> >>> [pc15381:25502] [[12435,0],1] routed:binomial: Connection to lifeline >>> [[12435,0],0] lost >> >> Okay, I'll take a look at it - but it may take awhile before I can address >> either issue as other priorities loom. >> >>> >>> It's not the case if run by mpiexec. >>> >>> -- Reuti >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Mpirun: How to print STDOUT of just one process?
Try out the attached wrapper: $ mpiexec -np 2 masterstdout mpirun -n 2 Is there a way to have mpirun just merger STDOUT of one process to its STDOUT stream? -- Dipl.-Inform. Paul Kapinos - High Performance Computing, RWTH Aachen University, Center for Computing and Communication Seffenter Weg 23, D 52074 Aachen (Germany) Tel: +49 241/80-24915 #!/bin/sh ARGS=$@ if [[ $OMPI_COMM_WORLD_RANK == 0 ]] then $ARGS else $ARGS 1>/dev/null 2>/dev/null fi smime.p7s Description: S/MIME Cryptographic Signature
Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine
Am 01.02.2012 um 17:16 schrieb Ralph Castain: > Could you add --display-allocation to your cmd line? This will tell us if it > found/read the default hostfile, or if the problem is with the mapper. Sure: reuti@pc15370:~> mpiexec --display-allocation -np 4 ./mpihello == ALLOCATED NODES == Data for node: Name: pc15370 Num slots: 1Max slots: 0 = Hello World from Node 0. Hello World from Node 1. Hello World from Node 2. Hello World from Node 3. (Nothing in `strace` about accessing someting with "default") reuti@pc15370:~> mpiexec --default-hostfile local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile --display-allocation -np 4 ./mpihello == ALLOCATED NODES == Data for node: Name: pc15370 Num slots: 2Max slots: 0 Data for node: Name: pc15381 Num slots: 2Max slots: 0 = Hello World from Node 0. Hello World from Node 3. Hello World from Node 2. Hello World from Node 1. Specifying it works fine with correct distribution in `ps`. -- Reuti > On Feb 1, 2012, at 7:58 AM, Reuti wrote: > >> Am 01.02.2012 um 15:38 schrieb Ralph Castain: >> >>> On Feb 1, 2012, at 3:49 AM, Reuti wrote: >>> Am 31.01.2012 um 21:25 schrieb Ralph Castain: > On Jan 31, 2012, at 12:58 PM, Reuti wrote: BTW: is there any default for a hostfile for Open MPI - I mean any in my home directory or /etc? When I check `man orte_hosts`, and all possible optiions are unset (like in a singleton run), it will only run local (Job is co-located with mpirun). >>> >>> Yep - it is /etc/openmpi-default-hostfile >> >> Thx for replying Ralph. >> >> I spotted it too, but this is not working for me. Neither for mpiexec from >> the command line, nor any singleton. I also tried a plain /etc as location >> of this file as well. >> >> reuti@pc15370:~> which mpicc >> /home/reuti/local/openmpi-1.4.4-thread/bin/mpicc >> reuti@pc15370:~> cat >> /home/reuti/local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile >> pc15370 slots=2 >> pc15381 slots=2 >> reuti@pc15370:~> mpicc -o mpihello mpihello.c >> reuti@pc15370:~> mpiexec -np 4 ./mpihello >> Hello World from Node 0. >> Hello World from Node 1. >> Hello World from Node 2. >> Hello World from Node 3. >> >> But all is local (no spawn here, traditional mpihello): >> >> 19503 ?Ss 0:00 /usr/sbin/sshd -o PidFile=/var/run/sshd.init.pid >> 11583 ?Ss 0:00 \_ sshd: reuti [priv] >> >> 11585 ?S 0:00 | \_ sshd: reuti@pts/6 >> >> 11587 pts/6Ss 0:00 | \_ -bash >> 13470 pts/6S+ 0:00 | \_ mpiexec -np 4 ./mpihello >> 13471 pts/6R+ 0:00 | \_ ./mpihello >> 13472 pts/6R+ 0:00 | \_ ./mpihello >> 13473 pts/6R+ 0:00 | \_ ./mpihello >> 13474 pts/6R+ 0:00 | \_ ./mpihello >> >> -- Reuti >> >> > We probably aren't correctly marking the original singleton on that node, > and so the mapper thinks there are still two slots available on the > original node. Okay. There is something to discuss/fix. BTW: if started as singleton I get an error at the end with the program the OP provided: [pc15381:25502] [[12435,0],1] routed:binomial: Connection to lifeline [[12435,0],0] lost >>> >>> Okay, I'll take a look at it - but it may take awhile before I can address >>> either issue as other priorities loom. >>> It's not the case if run by mpiexec. -- Reuti ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine
Ah - crud. Looks like the default-hostfile mca param isn't getting set to the default value. Will resolve - thanks! On Feb 1, 2012, at 9:28 AM, Reuti wrote: > Am 01.02.2012 um 17:16 schrieb Ralph Castain: > >> Could you add --display-allocation to your cmd line? This will tell us if it >> found/read the default hostfile, or if the problem is with the mapper. > > Sure: > > reuti@pc15370:~> mpiexec --display-allocation -np 4 ./mpihello > > == ALLOCATED NODES == > > Data for node: Name: pc15370 Num slots: 1Max slots: 0 > > = > Hello World from Node 0. > Hello World from Node 1. > Hello World from Node 2. > Hello World from Node 3. > > (Nothing in `strace` about accessing someting with "default") > > > reuti@pc15370:~> mpiexec --default-hostfile > local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile --display-allocation > -np 4 ./mpihello > > == ALLOCATED NODES == > > Data for node: Name: pc15370 Num slots: 2Max slots: 0 > Data for node: Name: pc15381 Num slots: 2Max slots: 0 > > = > Hello World from Node 0. > Hello World from Node 3. > Hello World from Node 2. > Hello World from Node 1. > > Specifying it works fine with correct distribution in `ps`. > > -- Reuti > > >> On Feb 1, 2012, at 7:58 AM, Reuti wrote: >> >>> Am 01.02.2012 um 15:38 schrieb Ralph Castain: >>> On Feb 1, 2012, at 3:49 AM, Reuti wrote: > Am 31.01.2012 um 21:25 schrieb Ralph Castain: > >> On Jan 31, 2012, at 12:58 PM, Reuti wrote: > > BTW: is there any default for a hostfile for Open MPI - I mean any in my > home directory or /etc? When I check `man orte_hosts`, and all possible > optiions are unset (like in a singleton run), it will only run local (Job > is co-located with mpirun). Yep - it is /etc/openmpi-default-hostfile >>> >>> Thx for replying Ralph. >>> >>> I spotted it too, but this is not working for me. Neither for mpiexec from >>> the command line, nor any singleton. I also tried a plain /etc as location >>> of this file as well. >>> >>> reuti@pc15370:~> which mpicc >>> /home/reuti/local/openmpi-1.4.4-thread/bin/mpicc >>> reuti@pc15370:~> cat >>> /home/reuti/local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile >>> pc15370 slots=2 >>> pc15381 slots=2 >>> reuti@pc15370:~> mpicc -o mpihello mpihello.c >>> reuti@pc15370:~> mpiexec -np 4 ./mpihello >>> Hello World from Node 0. >>> Hello World from Node 1. >>> Hello World from Node 2. >>> Hello World from Node 3. >>> >>> But all is local (no spawn here, traditional mpihello): >>> >>> 19503 ?Ss 0:00 /usr/sbin/sshd -o PidFile=/var/run/sshd.init.pid >>> 11583 ?Ss 0:00 \_ sshd: reuti [priv] >>> >>> 11585 ?S 0:00 | \_ sshd: reuti@pts/6 >>> >>> 11587 pts/6Ss 0:00 | \_ -bash >>> 13470 pts/6S+ 0:00 | \_ mpiexec -np 4 ./mpihello >>> 13471 pts/6R+ 0:00 | \_ ./mpihello >>> 13472 pts/6R+ 0:00 | \_ ./mpihello >>> 13473 pts/6R+ 0:00 | \_ ./mpihello >>> 13474 pts/6R+ 0:00 | \_ ./mpihello >>> >>> -- Reuti >>> >>> >> We probably aren't correctly marking the original singleton on that >> node, and so the mapper thinks there are still two slots available on >> the original node. > > Okay. There is something to discuss/fix. BTW: if started as singleton I > get an error at the end with the program the OP provided: > > [pc15381:25502] [[12435,0],1] routed:binomial: Connection to lifeline > [[12435,0],0] lost Okay, I'll take a look at it - but it may take awhile before I can address either issue as other priorities loom. > > It's not the case if run by mpiexec. > > -- Reuti > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >>> ___ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Mpirun: How to print STDOUT of just one process?
Great, that works!! Many Thanks! On Wed, Feb 1, 2012 at 4:17 PM, Paul Kapinos wrote: > Try out the attached wrapper: > $ mpiexec -np 2 masterstdout > >> mpirun -n 2 > > >> Is there a way to have mpirun just merger STDOUT of one process to its >> STDOUT stream? > > > > > > -- > Dipl.-Inform. Paul Kapinos - High Performance Computing, > RWTH Aachen University, Center for Computing and Communication > Seffenter Weg 23, D 52074 Aachen (Germany) > Tel: +49 241/80-24915 > > #!/bin/sh > ARGS=$@ > if [[ $OMPI_COMM_WORLD_RANK == 0 ]] > then > $ARGS > else > $ARGS 1>/dev/null 2>/dev/null > fi > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Segfault on mpirun with OpenMPI 1.4.5rc2
Hi Jeff, Pending further testing, your suggestion seems to have fixed the issue. Thank you very much. Dan Milroy 2012/1/31 Jeff Squyres : > We have heard reports of failures with the Intel 12.1 compilers. > > Can you try with rc4 (that was literally just released) with the > --without-memory-manager configure option? > > > On Jan 31, 2012, at 2:19 PM, Daniel Milroy wrote: > >> Hello, >> >> I have built OpenMPI 1.4.5rc2 with Intel 12.1 compilers in an HPC >> environment. We are running RHEL 5, kernel 2.6.18-238 with Intel Xeon >> X5660 cpus. You can find my build options below. In an effort to >> test the OpenMPI build, I compiled "Hello world" with an mpi_init call >> in C and Fortran. Mpirun of both versions on a single node results in >> a segfault. I have attached the pertinent portion of gdb's output of >> the "Hello world" core dump. Submitting a parallel "Hello world" job >> to torque results in segfaults across the respective nodes. However, >> if I execute mpirun of C or Fortran "Hello world" following a segfault >> the program will exit successfully. Additionally, if I strace mpirun >> on either a single node or on multiple nodes in parallel "Hello world" >> runs successfully. I am unsure how to proceed- any help would be >> greatly appreciated. >> >> >> Thank you in advance, >> >> Dan Milroy >> >> >> Build options: >> >> source /ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/iccvars.sh >> intel64 >> source /ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/ifortvars.sh >> intel64 >> export CC=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/icc >> export CXX=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/icpc >> export >> F77=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/ifort >> export >> F90=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/ifort >> export FC=/ics_2012.0.032/composer_xe_2011_sp1.6.233/bin/intel64/ifort >> ./configure --prefix=/openmpi-1.4.5rc2_intel-12.1 >> --with-tm=/torque-2.5.8/ --enable-shared --enable-static --without-psm >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Segfault on mpirun with OpenMPI 1.4.5rc2
Hi Götz, I don't know whether we can implement your suggestion; it is dependent on the terms of our license with Intel. I will take this under advisement. Thank you very much. Dan Milroy 2012/2/1 Götz Waschk : > On Tue, Jan 31, 2012 at 8:19 PM, Daniel Milroy > wrote: >> Hello, >> >> I have built OpenMPI 1.4.5rc2 with Intel 12.1 compilers in an HPC >> environment. We are running RHEL 5, kernel 2.6.18-238 with Intel Xeon >> X5660 cpus. You can find my build options below. In an effort to >> test the OpenMPI build, I compiled "Hello world" with an mpi_init call >> in C and Fortran. Mpirun of both versions on a single node results in >> a segfault. I have attached the pertinent portion of gdb's output of >> the "Hello world" core dump. > > Hi Daniel, > > that looks like the problem I had with my intel build of openmpi. I > could solve it by upgrading the Intel Compiler version to 12.1.2.273: > % icc -v > icc version 12.1.2 (gcc version 4.4.5 compatibility) > % icc -V > Intel(R) C Intel(R) 64 Compiler XE for applications running on > Intel(R) 64, Version 12.1 Build 2028 > Copyright (C) 1985-2011 Intel Corporation. All rights reserved. > > > After a rebuild of the openmpi runtime, the crashes went away. I was > using openmpi 1.5.3, but you could still have the same problem. > > Regards, Götz > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Latest Intel Compilers (ICS, version 12.1.0.233 Build 20110811) issues ...
I think we need to add something to the FAQ so that it's googleable "The Intel 12.1 Linux compilers before 12.1.2 are busted. Upgrade to at least 12.1.2, and OMPI should compile and work fine." On Feb 1, 2012, at 3:34 AM, Götz Waschk wrote: > On Tue, Jan 31, 2012 at 5:20 PM, Richard Walsh > wrote: >> in the malloc.c routine in 1.5.5. Perhaps you should lower the optimization >> level to zero and see what you get. > Hi Richard, > > thanks for the suggestion. I was able to solve the problem by > upgrading the Intel Compiler to version 12.1.2 and recompiling the > openmpi runtime with unchanged options. Now I cannot reproduce that > crash. I'll have to test some more, but I think the problem is solved. > > Thanks, Götz > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Mpirun: How to print STDOUT of just one process?
Hi Frank, Lloyd If all you want is to sort out from which process the output is coming from, you can use the "-tag-output" switch to the [OpenMPI] mpirun. Check it out with 'man mpirun'. I hope this helps, Gus Correa On Feb 1, 2012, at 11:04 AM, Lloyd Brown wrote: > I don't know about using mpirun to do it, but you can actually call > mpirun on a script, and have that script individually call a single > instance of your program. Then that script could use shell redirection > to redirect the output of the program's instance to a separate file. > > I've used this technique to play with ulimit sort of things in the > script before. I'm not entirely sure what variables are exposed to you > in the script, such that you could come up with a unique filename to > output to, though. > > Lloyd Brown > Systems Administrator > Fulton Supercomputing Lab > Brigham Young University > http://marylou.byu.edu > > On 02/01/2012 08:59 AM, Frank wrote: >> When running >> >> mpirun -n 2 >> >> the STDOUT streams of both processes are combined and are displayed by >> the shell. In such an interleaved format its hard to tell what line >> comes from which node. >> >> Is there a way to have mpirun just merger STDOUT of one process to its >> STDOUT stream? >> >> Best, >> Frank >> >> Cross-reference: >> http://stackoverflow.com/questions/9098781/mpirun-how-to-print-stdout-of-just-one-process >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Latest Intel Compilers (ICS, version 12.1.0.233 Build 20110811) issues ...
I just added it: http://www.open-mpi.org/faq/?category=troubleshooting#intel-12.1-compiler On Feb 1, 2012, at 12:41 PM, Jeff Squyres wrote: > I think we need to add something to the FAQ so that it's googleable "The > Intel 12.1 Linux compilers before 12.1.2 are busted. Upgrade to at least > 12.1.2, and OMPI should compile and work fine." > > > On Feb 1, 2012, at 3:34 AM, Götz Waschk wrote: > >> On Tue, Jan 31, 2012 at 5:20 PM, Richard Walsh >> wrote: >>> in the malloc.c routine in 1.5.5. Perhaps you should lower the optimization >>> level to zero and see what you get. >> Hi Richard, >> >> thanks for the suggestion. I was able to solve the problem by >> upgrading the Intel Compiler to version 12.1.2 and recompiling the >> openmpi runtime with unchanged options. Now I cannot reproduce that >> crash. I'll have to test some more, but I think the problem is solved. >> >> Thanks, Götz >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] Spawn_multiple with tight integration to SGE grid engine
FWIW: I have fixed this on the developer's trunk, and Jeff has scheduled it for release in the upcoming 1.6 release (when 1.5 series rolls over). I don't expect we'll backport it to 1.4 unless someone really needs it there. Thanks! Ralph On Feb 1, 2012, at 9:31 AM, Ralph Castain wrote: > Ah - crud. Looks like the default-hostfile mca param isn't getting set to the > default value. Will resolve - thanks! > > On Feb 1, 2012, at 9:28 AM, Reuti wrote: > >> Am 01.02.2012 um 17:16 schrieb Ralph Castain: >> >>> Could you add --display-allocation to your cmd line? This will tell us if >>> it found/read the default hostfile, or if the problem is with the mapper. >> >> Sure: >> >> reuti@pc15370:~> mpiexec --display-allocation -np 4 ./mpihello >> >> == ALLOCATED NODES == >> >> Data for node: Name: pc15370 Num slots: 1Max slots: 0 >> >> = >> Hello World from Node 0. >> Hello World from Node 1. >> Hello World from Node 2. >> Hello World from Node 3. >> >> (Nothing in `strace` about accessing someting with "default") >> >> >> reuti@pc15370:~> mpiexec --default-hostfile >> local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile --display-allocation >> -np 4 ./mpihello >> >> == ALLOCATED NODES == >> >> Data for node: Name: pc15370 Num slots: 2Max slots: 0 >> Data for node: Name: pc15381 Num slots: 2Max slots: 0 >> >> = >> Hello World from Node 0. >> Hello World from Node 3. >> Hello World from Node 2. >> Hello World from Node 1. >> >> Specifying it works fine with correct distribution in `ps`. >> >> -- Reuti >> >> >>> On Feb 1, 2012, at 7:58 AM, Reuti wrote: >>> Am 01.02.2012 um 15:38 schrieb Ralph Castain: > On Feb 1, 2012, at 3:49 AM, Reuti wrote: > >> Am 31.01.2012 um 21:25 schrieb Ralph Castain: >> >>> On Jan 31, 2012, at 12:58 PM, Reuti wrote: >> >> BTW: is there any default for a hostfile for Open MPI - I mean any in my >> home directory or /etc? When I check `man orte_hosts`, and all possible >> optiions are unset (like in a singleton run), it will only run local >> (Job is co-located with mpirun). > > Yep - it is /etc/openmpi-default-hostfile Thx for replying Ralph. I spotted it too, but this is not working for me. Neither for mpiexec from the command line, nor any singleton. I also tried a plain /etc as location of this file as well. reuti@pc15370:~> which mpicc /home/reuti/local/openmpi-1.4.4-thread/bin/mpicc reuti@pc15370:~> cat /home/reuti/local/openmpi-1.4.4-thread/etc/openmpi-default-hostfile pc15370 slots=2 pc15381 slots=2 reuti@pc15370:~> mpicc -o mpihello mpihello.c reuti@pc15370:~> mpiexec -np 4 ./mpihello Hello World from Node 0. Hello World from Node 1. Hello World from Node 2. Hello World from Node 3. But all is local (no spawn here, traditional mpihello): 19503 ?Ss 0:00 /usr/sbin/sshd -o PidFile=/var/run/sshd.init.pid 11583 ?Ss 0:00 \_ sshd: reuti [priv] 11585 ?S 0:00 | \_ sshd: reuti@pts/6 11587 pts/6Ss 0:00 | \_ -bash 13470 pts/6S+ 0:00 | \_ mpiexec -np 4 ./mpihello 13471 pts/6R+ 0:00 | \_ ./mpihello 13472 pts/6R+ 0:00 | \_ ./mpihello 13473 pts/6R+ 0:00 | \_ ./mpihello 13474 pts/6R+ 0:00 | \_ ./mpihello -- Reuti >>> We probably aren't correctly marking the original singleton on that >>> node, and so the mapper thinks there are still two slots available on >>> the original node. >> >> Okay. There is something to discuss/fix. BTW: if started as singleton I >> get an error at the end with the program the OP provided: >> >> [pc15381:25502] [[12435,0],1] routed:binomial: Connection to lifeline >> [[12435,0],0] lost > > Okay, I'll take a look at it - but it may take awhile before I can > address either issue as other priorities loom. > >> >> It's not the case if run by mpiexec. >> >> -- Reuti >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users >>
Re: [OMPI users] OpenMPI / SLURM -> Send/Recv blocking
On Jan 31, 2012, at 11:16 AM, adrian sabou wrote: > Like I said, a very simple program. > When launching this application with SLURM (using "salloc -N2 mpirun > ./"), it hangs at the barrier. Are you able to run the MPI example programs in examples/ ? > However, it passes the barrier if I launch it without SLURM (using "mpirun > -np 2 ./"). I first noticed this problem when my application hanged > if I tried to send two successive messages from a process to another. Only > the first MPI_Send would work. The second MPI_Send would block indefinitely. > I was wondering whether any of you have encountered a similar problem, or may > have an ideea as to what is causing the Send/Receive pair to block when using > SLURM. The exact output in my console is as follows: > > salloc: Granted job allocation 1138 > Process 0 - Sending... > Process 1 - Receiving... > Process 1 - Received. > Process 1 - Barrier reached. > Process 0 - Sent. > Process 0 - Barrier reached. > (it just hangs here) > > I am new to MPI programming and to OpenMPI and would greatly appreciate any > help. My OpenMPI version is 1.4.4 (although I have also tried it on 1.5.4), > my SLURM version is 0.3.3-1 (slurm-llnl 2.1.0-1), I'm not sure what SLURM version that is -- my "srun --version" shows 2.2.4. 0.3.3 would be pretty ancient, no? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/