Correction: compute-0-0.local np=8 (and not np =4) Besides, that when we set mpi_paffinity_alone 1, then even though 8 threads were running but the total sum of %CPU was around 400%. For some reasons, only half of the processing powers of the nodes were being utilized. The 4 threads of the first job seemed to dominate and use most of the 400% CPU.
Thank you. On Mon, Feb 25, 2008 at 11:36 PM, Steven Truong <midai...@gmail.com> wrote: > Dear, all. We just finished installing the first batch of nodes with > the following configurations. > Machines: Dual Quad core AMD 2350 + 16 Gig of RAMs > OS + Apps: Rocks 4.3 + Torque (2.1.8-1) + Maui (3.2.6p19-1) + Openmpi > (1.1.1-8) + VASP > Interconnections: Gigabit Ethernet ports + Extreme Summit x450a > > We were able to compile VASP + Openmpi + ACML and ran a bunch of > tests. However, for all the tests that we ran a _single_ job on ONE > node (1/2/4/8 core jobs) the performances of VASP jobs scaled well > like what we expected. > > The problems have surfaced when we tried to run VASP jobs on the same > node (like 2 4-cores jobs on 1 node) then we would see the performance > degraded around a factor of 2. A sample VASP 4 cores test run on a > single node (with no other jobs) would take closed to 900 seconds and > for this same job, if we ran 2 instances of the same jobs on a single > node, would would see around 1700-1800 seconds/job. On the compute > nodes, I used top command and saw that all 8 threads were running > (~100 %CPU) and the loads were around 8.0 and a little bit up to > 8.5. > > I thought that processor and/or memory affinity needed to specify: > #ompi_info | grep affinity > MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.1) > MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.1) > MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.1.1) > > and in my job.txt file for qsub, I modified to include mpi_paffinity_alone: > .... > mpiexec --mca mpi_paffinity_alone 1 --np $NPROCS vaspmpi_barcelona > .... > > However, with or without mpi_paffinity_alone, the performances still > sucks pretty bad and are not acceptable. With mpi_paffinity_alone > set, the performances were worse since as we observed with top command > that some threads were idled a great deal of times. We also tried to > run jobs without using qsub and PBS and used mpirun directly on the > nodes, and the performance scaled well like running jobs on an > isolated node. Weird?? What Torque + Maui could cause such problems? > > I am just wondering, what I have mis-configured my cluster: torque? > vasp? maui? openmpi? Without the scaling issue, when jobs run with > qsub and PBS, then things are great. > > My users's .bashrc have these 2 lines: > export OMP_NUM_THREADS=1 > export LD_LIBRARY_PATH=/opt/acml4.0.1/gfortran64/lib > > and > > # ulimit -a > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > file size (blocks, -f) unlimited > pending signals (-i) 1024 > max locked memory (kbytes, -l) unlimited > max memory size (kbytes, -m) unlimited > open files (-n) 4096 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > stack size (kbytes, -s) unlimited > cpu time (seconds, -t) unlimited > max user processes (-u) 135168 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > > My Torque's nodes file has such a simple entry like this: > > compute-0-0.local np=4 > > My Maui's setup is a very simple one. > > Please give you advices and suggestions on how to resolve these > performance issues. > > Thank you very much. >