Hi I have been trying to run simulation on a cluster consisting of 24 nodes Intel(R) Xeon(R) CPU X5670 @ 2.93GHz. Each node has 12 processors and they are connected via 1Gbit Ethernet and Infiniband interconnect. The batch system is TORQUE. However due to some issues with the parallel queue I have been trying to run the simulations directly on the cluster using mpdboot and mpirun. Following is the mdp.out file that I am using for simulation ; VARIOUS PREPROCESSING OPTIONS ; Preprocessor information: use cpp syntax. ; e.g.: -I/home/joe/doe -I/home/mary/roe include = ; e.g.: -DPOSRES -DFLEXIBLE (note these variable names are case sensitive) define = -DPOSRES
; RUN CONTROL PARAMETERS integrator = md ; Start time and timestep in ps tinit = 0 dt = 0.002 nsteps = 250000 ; For exact run continuation or redoing part of a run init-step = 0 ; Part index is updated automatically on checkpointing (keeps files separate) simulation-part = 1 ; mode for center of mass motion removal comm-mode = Linear ; number of steps for center of mass motion removal nstcomm = 100 ; group(s) for center of mass motion removal comm-grps = ; LANGEVIN DYNAMICS OPTIONS ; Friction coefficient (amu/ps) and random seed bd-fric = 0 ld-seed = 1993 ; ENERGY MINIMIZATION OPTIONS ; Force tolerance and initial step-size emtol = 10 emstep = 0.01 ; Max number of iterations in relax-shells niter = 20 ; Step size (ps^2) for minimization of flexible constraints fcstep = 0 ; Frequency of steepest descents steps when doing CG nstcgsteep = 1000 nbfgscorr = 10 ; TEST PARTICLE INSERTION OPTIONS rtpi = 0.05 ; OUTPUT CONTROL OPTIONS ; Output frequency for coords (x), velocities (v) and forces (f) nstxout = 100 nstvout = 100 nstfout = 0 ; Output frequency for energies to log file and energy file nstlog = 100 nstcalcenergy = 100 nstenergy = 100 ; Output frequency and precision for .xtc file nstxtcout = 0 xtc-precision = 1000 ; This selects the subset of atoms for the .xtc file. You can ; select multiple groups. By default all atoms will be written. xtc-grps = ; Selection of energy groups energygrps = ; NEIGHBORSEARCHING PARAMETERS ; cut-off scheme (group: using charge groups, Verlet: particle based cut-offs) cutoff-scheme = Group ; nblist update frequency nstlist = 5 ; ns algorithm (simple or grid) ns_type = grid ; Periodic boundary conditions: xyz, no, xy pbc = xyz periodic-molecules = no ; Allowed energy drift due to the Verlet buffer in kJ/mol/ps per atom, ; a value of -1 means: use rlist verlet-buffer-drift = 0.005 ; nblist cut-off rlist = 1.0 ; long-range cut-off for switched potentials rlistlong = -1 nstcalclr = -1 ; OPTIONS FOR ELECTROSTATICS AND VDW ; Method for doing electrostatics coulombtype = PME coulomb-modifier = Potential-shift-Verlet rcoulomb-switch = 0 rcoulomb = 1.0 ; Relative dielectric constant for the medium and the reaction field epsilon-r = 1 epsilon-rf = 0 ; Method for doing Van der Waals vdw-type = Cut-off vdw-modifier = Potential-shift-Verlet ; cut-off lengths rvdw-switch = 0 rvdw = 1.0 ; Apply long range dispersion corrections for Energy and Pressure DispCorr = EnerPres ; Extension of the potential lookup tables beyond the cut-off table-extension = 1 ; Separate tables between energy group pairs energygrp-table = ; Spacing for the PME/PPPM FFT grid fourierspacing = 0.16 ; FFT grid size, when a value is 0 fourierspacing will be used fourier-nx = 0 fourier-ny = 0 fourier-nz = 0 ; EWALD/PME/PPPM parameters pme_order = 4 ewald-rtol = 1e-05 ewald-geometry = 3d epsilon-surface = 0 optimize-fft = no ; IMPLICIT SOLVENT ALGORITHM implicit-solvent = No ; GENERALIZED BORN ELECTROSTATICS ; Algorithm for calculating Born radii gb-algorithm = Still ; Frequency of calculating the Born radii inside rlist nstgbradii = 1 ; Cutoff for Born radii calculation; the contribution from atoms ; between rlist and rgbradii is updated every nstlist steps rgbradii = 1 ; Dielectric coefficient of the implicit solvent gb-epsilon-solvent = 80 ; Salt concentration in M for Generalized Born models gb-saltconc = 0 ; Scaling factors used in the OBC GB model. Default values are OBC(II) gb-obc-alpha = 1 gb-obc-beta = 0.8 gb-obc-gamma = 4.85 gb-dielectric-offset = 0.009 sa-algorithm = Ace-approximation ; Surface tension (kJ/mol/nm^2) for the SA (nonpolar surface) part of GBSA ; The value -1 will set default value for Still/HCT/OBC GB-models. sa-surface-tension = -1 ; OPTIONS FOR WEAK COUPLING ALGORITHMS ; Temperature coupling tcoupl = V-rescale nsttcouple = -1 nh-chain-length = 10 print-nose-hoover-chain-variables = no ; Groups to couple separately tc-grps = Protein Non-Protein ; Time constant (ps) and reference temperature (K) tau_t = 0.1 0.1 ref_t = 300 300 ; pressure coupling pcoupl = no pcoupltype = Isotropic nstpcouple = -1 ; Time constant (ps), compressibility (1/bar) and reference P (bar) tau-p = 1 compressibility = ref-p = ; Scaling of reference coordinates, No, All or COM refcoord-scaling = No ; OPTIONS FOR QMMM calculations QMMM = no ; Groups treated Quantum Mechanically QMMM-grps = ; QM method QMmethod = ; QMMM scheme QMMMscheme = normal ; QM basisset QMbasis = ; QM charge QMcharge = ; QM multiplicity QMmult = ; Surface Hopping SH = ; CAS space options CASorbitals = CASelectrons = SAon = SAoff = SAsteps = ; Scale factor for MM charges MMChargeScaleFactor = 1 ; Optimization of QM subsystem bOPT = bTS = ; SIMULATED ANNEALING ; Type of annealing for each temperature group (no/single/periodic) annealing = ; Number of time points to use for specifying annealing in each group annealing-npoints = ; List of times at the annealing points for each group annealing-time = ; Temp. at each annealing point, for each group. annealing-temp = ; GENERATE VELOCITIES FOR STARTUP RUN gen_vel = yes gen_temp = 300 gen_seed = -1 ; OPTIONS FOR BONDS constraints = all-bonds ; Type of constraint algorithm constraint_algorithm = lincs ; Do not constrain the start configuration continuation = no ; Use successive overrelaxation to reduce the number of shake iterations Shake-SOR = no ; Relative tolerance of shake shake-tol = 0.0001 ; Highest order in the expansion of the constraint coupling matrix lincs_order = 4 ; Number of iterations in the final step of LINCS. 1 is fine for ; normal simulations, but use 2 to conserve energy in NVE runs. ; For energy minimization with constraints it should be 4 to 8. lincs_iter = 1 ; Lincs will write a warning to the stderr if in one step a bond ; rotates over more degrees than lincs-warnangle = 30 ; Convert harmonic bonds to morse potentials morse = no ; ENERGY GROUP EXCLUSIONS ; Pairs of energy groups for which all non-bonded interactions are excluded energygrp-excl = ; WALLS ; Number of walls, type, atom types, densities and box-z scale factor for Ewald nwall = 0 wall-type = 9-3 wall-r-linpot = -1 wall-atomtype = wall-density = wall-ewald-zfac = 3 ; COM PULLING ; Pull type: no, umbrella, constraint or constant-force pull = no ; ENFORCED ROTATION ; Enforced rotation: No or Yes rotation = no ; NMR refinement stuff ; Distance restraints type: No, Simple or Ensemble disre = No ; Force weighting of pairs in one distance restraint: Conservative or Equal disre-weighting = Conservative ; Use sqrt of the time averaged times the instantaneous violation disre-mixed = no disre-fc = 1000 disre-tau = 0 ; Output frequency for pair distances to energy file nstdisreout = 100 ; Orientation restraints: No or Yes orire = no ; Orientation restraints force constant and tau for time averaging orire-fc = 0 orire-tau = 0 orire-fitgrp = ; Output frequency for trace(SD) and S to energy file nstorireout = 100 ; Free energy variables free-energy = no couple-moltype = couple-lambda0 = vdw-q couple-lambda1 = vdw-q couple-intramol = no init-lambda = -1 init-lambda-state = -1 delta-lambda = 0 nstdhdl = 50 fep-lambdas = mass-lambdas = coul-lambdas = vdw-lambdas = bonded-lambdas = restraint-lambdas = temperature-lambdas = calc-lambda-neighbors = 1 init-lambda-weights = dhdl-print-energy = no sc-alpha = 0 sc-power = 1 sc-r-power = 6 sc-sigma = 0.3 sc-coul = no separate-dhdl-file = yes dhdl-derivatives = yes dh_hist_size = 0 dh_hist_spacing = 0.1 ; Non-equilibrium MD stuff acc-grps = accelerate = freezegrps = freezedim = cos-acceleration = 0 deform = ; simulated tempering variables simulated-tempering = no simulated-tempering-scaling = geometric sim-temp-low = 300 sim-temp-high = 300 ; Electric fields ; Format is number of terms (int) and for all terms an amplitude (real) ; and a phase angle (real) E-x = E-xt = E-y = E-yt = E-z = E-zt = ; AdResS parameters adress = no ; User defined thingies user1-grps = user2-grps = userint1 = 0 userint2 = 0 userint3 = 0 userint4 = 0 userreal1 = 0 userreal2 = 0 userreal3 = 0 userreal4 = 0 The system has 250853 atoms. I used g_tune_pme in order to check the performance with different number of processors Following are the perf.out for 48 and 160 processors respectively Summary of successful runs: Line tpr PME nodes Gcycles Av. Std.dev. ns/day PME/f DD grid 0 0 8 181.713 7.698 0.952 1.334 8 5 1 1 0 6 156.720 4.086 1.104 1.420 6 7 1 2 0 4 196.320 16.161 0.885 0.916 4 11 1 3 0 3 195.312 1.127 0.886 0.840 3 5 3 4 0 0 370.539 12.942 0.468 - 8 6 1 5 0 -1( 8) 185.688 0.839 0.932 1.322 8 5 1 6 1 8 185.651 14.798 0.934 1.294 8 5 1 7 1 6 155.970 3.320 1.110 1.157 6 7 1 8 1 4 177.021 15.459 0.980 1.005 4 11 1 9 1 3 190.704 22.673 0.914 0.931 3 5 3 10 1 0 293.676 5.460 0.589 - 8 6 1 11 1 -1( 8) 188.978 3.686 0.915 1.266 8 5 1 12 2 8 210.631 17.457 0.824 1.176 8 5 1 13 2 6 171.926 10.462 1.008 1.186 6 7 1 14 2 4 200.015 6.696 0.865 0.839 4 11 1 15 2 3 215.013 5.881 0.804 0.863 3 5 3 16 2 0 298.363 7.187 0.580 - 8 6 1 17 2 -1( 8) 208.821 34.409 0.840 1.088 8 5 1 ------------------------------------------------------------ Best performance was achieved with 6 PME nodes (see line 7) Optimized PME settings: New Coulomb radius: 1.100000 nm (was 1.000000 nm) New Van der Waals radius: 1.100000 nm (was 1.000000 nm) New Fourier grid xyz: 80 80 80 (was 96 96 96) Please use this command line to launch the simulation: mpirun -np 48 mdrun_mpi -npme 6 -s tuned.tpr -pin on Summary of successful runs: Line tpr PME nodes Gcycles Av. Std.dev. ns/day PME/f DD grid 0 0 25 283.628 2.191 0.610 1.749 5 9 3 1 0 20 240.888 9.132 0.719 1.618 5 4 7 2 0 16 166.570 0.394 1.038 1.239 8 6 3 3 0 0 435.389 3.399 0.397 - 10 8 2 4 0 -1( 20) 237.623 6.298 0.729 1.406 5 4 7 5 1 25 286.990 1.662 0.603 1.813 5 9 3 6 1 20 235.818 0.754 0.734 1.495 5 4 7 7 1 16 167.888 3.028 1.030 1.256 8 6 3 8 1 0 284.264 3.775 0.609 - 8 5 4 9 1 -1( 16) 167.858 1.924 1.030 1.303 8 6 3 10 2 25 298.637 1.660 0.579 1.696 5 9 3 11 2 20 281.647 1.074 0.614 1.296 5 4 7 12 2 16 184.012 4.022 0.941 1.244 8 6 3 13 2 0 304.658 0.793 0.568 - 8 5 4 14 2 -1( 16) 183.084 2.203 0.945 1.188 8 6 3 ------------------------------------------------------------ Best performance was achieved with 16 PME nodes (see line 2) and original PME settings. Please use this command line to launch the simulation: mpirun -np 160 /data1/shashi/localbin/gromacs/bin/mdrun_mpi -npme 16 -s 4icl.tpr -pin on Both of these outcomes(1.110ns/day and 1.038ns/day) are lower than what I get on my workstation with Xeon W3550 3.07 GHz using 8 thread (1.431ns/day) for a similar system. The bench.log file generated by g_tune PME shows very high load imbalance (>60% -100 %). I have tried several combinations of np and npme but the perfomance is always in this range only. Can someone please tell me what is it that I am doing wrong or how can I decrease the simulation time. -- Regards Ashutosh Srivastava -- gmx-users mailing list gmx-users@gromacs.org http://lists.gromacs.org/mailman/listinfo/gmx-users * Please search the archive at http://www.gromacs.org/Support/Mailing_Lists/Search before posting! * Please don't post (un)subscribe requests to the list. Use the www interface or send it to gmx-users-requ...@gromacs.org. * Can't post? Read http://www.gromacs.org/Support/Mailing_Lists