[OMPI users] Performance do not scale at all when run jobs on same single node (Rocks, AMD Barcelona, Torque, Maui, Vasp, Openmpi, Gigabit Ethernet)
Dear, all. We just finished installing the first batch of nodes with the following configurations. Machines: Dual Quad core AMD 2350 + 16 Gig of RAMs OS + Apps: Rocks 4.3 + Torque (2.1.8-1) + Maui (3.2.6p19-1) + Openmpi (1.1.1-8) + VASP Interconnections: Gigabit Ethernet ports + Extreme Summit x450a We were able to compile VASP + Openmpi + ACML and ran a bunch of tests. However, for all the tests that we ran a _single_ job on ONE node (1/2/4/8 core jobs) the performances of VASP jobs scaled well like what we expected. The problems have surfaced when we tried to run VASP jobs on the same node (like 2 4-cores jobs on 1 node) then we would see the performance degraded around a factor of 2. A sample VASP 4 cores test run on a single node (with no other jobs) would take closed to 900 seconds and for this same job, if we ran 2 instances of the same jobs on a single node, would would see around 1700-1800 seconds/job. On the compute nodes, I used top command and saw that all 8 threads were running (~100 %CPU) and the loads were around 8.0 and a little bit up to 8.5. I thought that processor and/or memory affinity needed to specify: #ompi_info | grep affinity MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.1) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.1) MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.1.1) and in my job.txt file for qsub, I modified to include mpi_paffinity_alone: mpiexec --mca mpi_paffinity_alone 1 --np $NPROCS vaspmpi_barcelona However, with or without mpi_paffinity_alone, the performances still sucks pretty bad and are not acceptable. With mpi_paffinity_alone set, the performances were worse since as we observed with top command that some threads were idled a great deal of times. We also tried to run jobs without using qsub and PBS and used mpirun directly on the nodes, and the performance scaled well like running jobs on an isolated node. Weird?? What Torque + Maui could cause such problems? I am just wondering, what I have mis-configured my cluster: torque? vasp? maui? openmpi? Without the scaling issue, when jobs run with qsub and PBS, then things are great. My users's .bashrc have these 2 lines: export OMP_NUM_THREADS=1 export LD_LIBRARY_PATH=/opt/acml4.0.1/gfortran64/lib and # ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals (-i) 1024 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 4096 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 135168 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited My Torque's nodes file has such a simple entry like this: compute-0-0.local np=4 My Maui's setup is a very simple one. Please give you advices and suggestions on how to resolve these performance issues. Thank you very much.
Re: [OMPI users] Performance do not scale at all when run jobs on same single node (Rocks, AMD Barcelona, Torque, Maui, Vasp, Openmpi, Gigabit Ethernet)
Correction: compute-0-0.local np=8 (and not np =4) Besides, that when we set mpi_paffinity_alone 1, then even though 8 threads were running but the total sum of %CPU was around 400%. For some reasons, only half of the processing powers of the nodes were being utilized. The 4 threads of the first job seemed to dominate and use most of the 400% CPU. Thank you. On Mon, Feb 25, 2008 at 11:36 PM, Steven Truong wrote: > Dear, all. We just finished installing the first batch of nodes with > the following configurations. > Machines: Dual Quad core AMD 2350 + 16 Gig of RAMs > OS + Apps: Rocks 4.3 + Torque (2.1.8-1) + Maui (3.2.6p19-1) + Openmpi > (1.1.1-8) + VASP > Interconnections: Gigabit Ethernet ports + Extreme Summit x450a > > We were able to compile VASP + Openmpi + ACML and ran a bunch of > tests. However, for all the tests that we ran a _single_ job on ONE > node (1/2/4/8 core jobs) the performances of VASP jobs scaled well > like what we expected. > > The problems have surfaced when we tried to run VASP jobs on the same > node (like 2 4-cores jobs on 1 node) then we would see the performance > degraded around a factor of 2. A sample VASP 4 cores test run on a > single node (with no other jobs) would take closed to 900 seconds and > for this same job, if we ran 2 instances of the same jobs on a single > node, would would see around 1700-1800 seconds/job. On the compute > nodes, I used top command and saw that all 8 threads were running > (~100 %CPU) and the loads were around 8.0 and a little bit up to > 8.5. > > I thought that processor and/or memory affinity needed to specify: > #ompi_info | grep affinity >MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.1) >MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.1) >MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.1.1) > > and in my job.txt file for qsub, I modified to include mpi_paffinity_alone: > > mpiexec --mca mpi_paffinity_alone 1 --np $NPROCS vaspmpi_barcelona > > > However, with or without mpi_paffinity_alone, the performances still > sucks pretty bad and are not acceptable. With mpi_paffinity_alone > set, the performances were worse since as we observed with top command > that some threads were idled a great deal of times. We also tried to > run jobs without using qsub and PBS and used mpirun directly on the > nodes, and the performance scaled well like running jobs on an > isolated node. Weird?? What Torque + Maui could cause such problems? > > I am just wondering, what I have mis-configured my cluster: torque? > vasp? maui? openmpi? Without the scaling issue, when jobs run with > qsub and PBS, then things are great. > > My users's .bashrc have these 2 lines: > export OMP_NUM_THREADS=1 > export LD_LIBRARY_PATH=/opt/acml4.0.1/gfortran64/lib > > and > > # ulimit -a > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > file size (blocks, -f) unlimited > pending signals (-i) 1024 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 4096 > pipe size(512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > stack size (kbytes, -s) unlimited > cpu time (seconds, -t) unlimited > max user processes (-u) 135168 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > My Torque's nodes file has such a simple entry like this: > > compute-0-0.local np=4 > > My Maui's setup is a very simple one. > > Please give you advices and suggestions on how to resolve these > performance issues. > > Thank you very much. >
Re: [OMPI users] Performance do not scale at all when run jobs on same single node (Rocks, AMD Barcelona, Torque, Maui, Vasp, Openmpi, Gigabit Ethernet)
I think that ACML creates threads to fill all the available cores. If you run 2 instances of ACML, it will create twice as many threads as available cores and performances are obviously terrible. You should check the ACML documentation to get the name of the environment variable controling the number of threads ACML creates. As an example GotoBLAS uses GOTO_NUM_THREADS. Hope it helps, Aurelien Le 26 févr. 08 à 02:36, Steven Truong a écrit : Dear, all. We just finished installing the first batch of nodes with the following configurations. Machines: Dual Quad core AMD 2350 + 16 Gig of RAMs OS + Apps: Rocks 4.3 + Torque (2.1.8-1) + Maui (3.2.6p19-1) + Openmpi (1.1.1-8) + VASP Interconnections: Gigabit Ethernet ports + Extreme Summit x450a We were able to compile VASP + Openmpi + ACML and ran a bunch of tests. However, for all the tests that we ran a _single_ job on ONE node (1/2/4/8 core jobs) the performances of VASP jobs scaled well like what we expected. The problems have surfaced when we tried to run VASP jobs on the same node (like 2 4-cores jobs on 1 node) then we would see the performance degraded around a factor of 2. A sample VASP 4 cores test run on a single node (with no other jobs) would take closed to 900 seconds and for this same job, if we ran 2 instances of the same jobs on a single node, would would see around 1700-1800 seconds/job. On the compute nodes, I used top command and saw that all 8 threads were running (~100 %CPU) and the loads were around 8.0 and a little bit up to 8.5. I thought that processor and/or memory affinity needed to specify: #ompi_info | grep affinity MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.1) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.1) MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.1.1) and in my job.txt file for qsub, I modified to include mpi_paffinity_alone: mpiexec --mca mpi_paffinity_alone 1 --np $NPROCS vaspmpi_barcelona However, with or without mpi_paffinity_alone, the performances still sucks pretty bad and are not acceptable. With mpi_paffinity_alone set, the performances were worse since as we observed with top command that some threads were idled a great deal of times. We also tried to run jobs without using qsub and PBS and used mpirun directly on the nodes, and the performance scaled well like running jobs on an isolated node. Weird?? What Torque + Maui could cause such problems? I am just wondering, what I have mis-configured my cluster: torque? vasp? maui? openmpi? Without the scaling issue, when jobs run with qsub and PBS, then things are great. My users's .bashrc have these 2 lines: export OMP_NUM_THREADS=1 export LD_LIBRARY_PATH=/opt/acml4.0.1/gfortran64/lib and # ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals (-i) 1024 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 4096 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) 135168 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited My Torque's nodes file has such a simple entry like this: compute-0-0.local np=4 My Maui's setup is a very simple one. Please give you advices and suggestions on how to resolve these performance issues. Thank you very much. ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Dr. Aurélien Bouteiller Sr. Research Associate - Innovative Computing Laboratory Suite 350, 1122 Volunteer Boulevard Knoxville, TN 37996 865 974 6321