Correction:
compute-0-0.local np=8   (and not np =4)

Besides, that when we set mpi_paffinity_alone 1, then even though 8
threads were running but the total sum of %CPU was around 400%.  For
some reasons, only half of the processing powers of the nodes were
being utilized.  The 4 threads of the first job seemed to dominate and
use most of the 400% CPU.

Thank you.

On Mon, Feb 25, 2008 at 11:36 PM, Steven Truong <midai...@gmail.com> wrote:
> Dear, all.  We just finished installing the first batch of nodes with
>  the following configurations.
>  Machines: Dual Quad core AMD 2350 + 16 Gig of RAMs
>  OS + Apps: Rocks 4.3 + Torque (2.1.8-1) + Maui (3.2.6p19-1) + Openmpi
>  (1.1.1-8) + VASP
>  Interconnections:  Gigabit Ethernet ports + Extreme Summit x450a
>
>  We were able to compile VASP + Openmpi + ACML and ran a bunch of
>  tests.  However, for all the tests that we ran a _single_ job on ONE
>  node (1/2/4/8 core jobs) the performances of VASP jobs scaled well
>  like what we expected.
>
>  The problems have surfaced when we tried to run VASP jobs on the same
>  node (like 2 4-cores jobs on 1 node) then we would see the performance
>  degraded around a factor of 2.  A sample VASP 4 cores test run on a
>  single node (with no other jobs) would take closed to 900 seconds and
>  for this same job, if we ran 2 instances of the same jobs on a single
>  node, would would see around 1700-1800 seconds/job. On the compute
>  nodes, I used top command and saw that all 8 threads were running
>  (~100 %CPU) and  the loads were around 8.0  and a little bit up to
>  8.5.
>
>  I thought that processor and/or memory affinity needed to specify:
>   #ompi_info | grep affinity
>            MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.1)
>            MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.1)
>            MCA maffinity: libnuma (MCA v1.0, API v1.0, Component v1.1.1)
>
>  and in my job.txt file for qsub, I modified to include mpi_paffinity_alone:
>  ....
>  mpiexec --mca mpi_paffinity_alone 1  --np $NPROCS vaspmpi_barcelona
>  ....
>
>  However, with or without mpi_paffinity_alone, the performances still
>  sucks pretty bad and are not acceptable.  With mpi_paffinity_alone
>  set, the performances were worse since as we observed with top command
>  that some threads were idled a great deal of times. We also tried to
>  run jobs without using qsub and PBS and used mpirun directly on the
>  nodes, and the performance scaled well like running jobs on an
>  isolated node.  Weird??  What Torque + Maui could cause such problems?
>
>  I am just wondering, what I have mis-configured my cluster: torque?
>  vasp? maui? openmpi?  Without the scaling issue, when jobs run with
>  qsub and PBS, then things are great.
>
>  My users's .bashrc have these 2 lines:
>  export OMP_NUM_THREADS=1
>  export LD_LIBRARY_PATH=/opt/acml4.0.1/gfortran64/lib
>
>  and
>
>  # ulimit -a
>  core file size          (blocks, -c) 0
>  data seg size           (kbytes, -d) unlimited
>  file size               (blocks, -f) unlimited
>  pending signals                 (-i) 1024
>  max locked memory       (kbytes, -l) unlimited
>  max memory size         (kbytes, -m) unlimited
>  open files                      (-n) 4096
>  pipe size            (512 bytes, -p) 8
>  POSIX message queues     (bytes, -q) 819200
>  stack size              (kbytes, -s) unlimited
>  cpu time               (seconds, -t) unlimited
>  max user processes              (-u) 135168
>  virtual memory          (kbytes, -v) unlimited
>  file locks                      (-x) unlimited
>
>  My Torque's nodes file has such a simple entry like this:
>
>  compute-0-0.local np=4
>
>  My Maui's setup is a very simple one.
>
>  Please give you advices and suggestions on how to resolve these
>  performance issues.
>
>  Thank you very much.
>

Reply via email to