Hi Don,

Somehow I thought it might not be so easy.. otherwise it might have been spotted before!

Although we first spotted the problem with our own application, I did the most recent tests using the Intel MPI Benchmarks (intel_clustertools3.tar.gz) and saw the same behaviour. It might be interesting if I send you my compiled binary to test and vice versa (I can send the source too, but it should be easy to find on the Intel website). I'm not sure what the memory requirement is for the benchmark; I'd need to run it again to check. I don't think it's very big though.

As for our environment, we have 4 nodes, 3 of X4100 and one X4200 as the head node. Each machine has two dual core CPUs so runs 4 MPI processes. We use Sun GE for batch submission and the queue is configured to only allow one job at a time. If I run a 4 MPI processes all on the same node I see the sustained memory loss.

I would be interested to know the environment you are running on: what hardware, compiler versions, SGE version, OS & patch level.

Hope that helps. Sorry for the delay in getting back to you, I've been away for one reason or another. I'm going to be away as of tomorrow for another week. After the holiday season, it's conference season!

Best wishes,
                       Glenn


Glenn,

This will require some more investigation. I have verified that the
udapl btl is making the proper calls to free registered memory and
though I have seen the free memory as listed by vmstat drop and I see it
come back as well.  Additionally if I run a basic bandwidth test
serially(one job at a time) I see the vmstat number go down and come
back. If I run the program many times simultaneously the vmstat free
number goes down and then comes backup. Does it come back all the way is
not entirely clear because there does seem to be a lazy component to the
releasing of locked pages, which is the part I need to investigate
furhter. I am not seeing a sustained continuous drop.

I wonder if you could tell me more about the environment. Number of MPI
jobs running simultaneously? Size of the job(s)? Is your code something
you can share?  Reproducing what you are seeing is my intent.

-DON
p.s. I will not be checking email or working on this again until the
week of August 27 as I am taking a little vacation.

Glenn Carver wrote:

Don,

Following up on this, here are the results of the tests. All is well
until udapl is included.  In addition there are no mca parameters set
in these jobs. As I reported to you before, if I add --mca
btl_udapl_flags=1, the memory problem goes away.

The batch jobs run vmstat before and after the mpirun command. Here's
the appropriate part of the batch output from the 3 tests. The
problem is highlighted by the difference in the 'free' column
reported by vmstat before and after mpirun. You'll notice a drop of
145Mb in the case with 'btl self,sm,tcp,udapl'.

Regards,   Glenn.

======== btl self,tcp
+ vmstat 3 3
  kthr      memory            page            disk          faults      cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 s5 in sy cs us sy id
  0 0 0 6923680 2189060 6 97  5  0  0  0 15  0  0  0  1 3809 369393
2324 27 10 62
0 0 0 6803144 1964320 1 22 0 0 0 0 0 0 0 0 0 587 388 184 0 0 100 0 0 0 6803112 1964292 0 0 0 0 0 0 0 0 0 0 0 442 329 144 0 0 100
+ mpirun --mca btl self,tcp -np 16 ./IMB-MPI1.ct7.studio12 -npmin 16
-map 4x4 -multi 1
+ vmstat 3 3
  kthr      memory            page            disk          faults      cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 s5 in sy cs us sy id
  0 0 0 6780740 2144660 6 98  5  0  0  0 14  0  0  0  1 5145 455335
3147 27 14 59
0 0 0 6799020 1959984 3 31 0 0 0 0 0 0 0 0 0 640 358 268 0 0 100 0 0 0 6799012 1959980 0 0 0 0 0 0 0 0 0 0 0 432 305 128 0 0 100

========== btl self,sm,tcp
 >+ vmstat 3 3
  kthr      memory            page            disk          faults      cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 s5 in sy cs us sy id 0 0 0 9038736 2291420 7 107 6 0 0 0 20 0 0 0 1 2445 164773 1373 28 7 65 0 0 0 9084592 2149496 1 22 0 0 0 0 0 0 0 0 0 537 343 170 0 0 100
> 0 0 0 9084580 2149488 0 0 0 0 0 0 0 0 0 0 0 527 357 168 0 0 100
+ mpirun --mca btl self,sm,tcp -np 16 ./IMB-MPI1.ct7.studio12 -npmin
16 -map 4x4 -multi 1
+ vmstat 3 3
  kthr      memory            page            disk          faults      cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 s5 in sy cs us sy id
  0 0 0 8879504 2239168 7 106 6  0  0  0 18  0  0  0  1 4205 416635
2470 29 12 60
0 0 0 9079008 2143824 3 32 0 0 0 0 0 0 0 0 0 648 358 279 0 0 100 0 0 0 9079000 2143820 0 0 0 0 0 0 0 0 0 0 0 433 327 133 0 0 100

========= btl self,sm,tcp,udapl
+ vmstat 3 3
  kthr      memory            page            disk          faults      cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 s5 in sy cs us sy id
  0 0 0 6771044 2134784 6 101 5  0  0  0 14  0  0  0  1 5060 447191
3094 28 14 58
0 0 0 6799340 1960104 1 22 0 0 0 0 0 0 0 0 0 538 320 164 0 0 100 0 0 0 6799328 1960096 0 0 0 0 0 0 0 0 0 0 0 439 321 139 0 0 100
+ mpirun --mca btl self,sm,tcp,udapl -np 16 ./IMB-MPI1.ct7.studio12
-npmin 16 -map 4x4 -multi 1
+ vmstat 3 3
  kthr      memory            page            disk          faults      cpu
r b w swap free re mf pi po fr de sr s0 s1 s2 s5 in sy cs us sy id
  0 0 0 6726824 2120420 6 105 4  0  0  0 13  0  0  0  1 4967 438387
3035 29 14 57
0 0 0 6654032 1814788 3 31 0 0 0 0 0 0 0 0 0 656 457 284 0 0 100 0 0 0 6654024 1814784 0 0 0 0 0 0 0 0 0 0 0 453 336 146 0 0 100



Glenn,

While I look into the possibility of registered memory not being freed
could you run your same tests but without shared memory or udapl.

"--mca btl self,tcp"

If this is successful, i.e. frees memory as expected. The next step
would be to run including shared memory,  "--mca btl self,sm,tcp".  If
this is successful the last step would be to add in udapl, "--mca btl
self,sm,udapl".

-DON

Glenn Carver wrote:

Just to clarify, the MPI applications exit cleanly. We have our own
f90 code (in various configurations) and I'm also testing using
 >Intel's IMB.  If I watch the applications whilst they run, there is a
drop in free memory as our code begins, the free memory then steadily
drops as the code runs. When it exits normally, free memory increases
but falls short of where it was before the code started. The longer
we run the code for the bigger the final drop in memory.  Taking the
machine down to single user mode doesn't help so it's not an issue of
processes still running. Neither can I find any files still open with
lsof.

We installed Sun's clustertools 6 (not based on openmpi) and we don't
see the same problem.  I'm currently testing whether setting
btl_udapl_flags=1 makes a difference.  I'm guessing that registered
memory is leaking?  We're also trying some mca parameters to turn off
features we don't need to see if that makes a difference.  I'll
report back on point 2. below and further tests later.  If there's
specific mca parameters you'd like to see specified let me know?

Thanks, Glenn




Guess I don't see how stale shared memory files would cause swapping to
occur. Besides, the user provided no indication that the applications were
abnormally terminating, which makes it likely we cleaned up the session
directories as we should.

However, we definitely leak memory (i.e., we don't free all memory we malloc
while supporting execution of an application), so if the OS isn't cleaning
up after us, it is quite possible we could be causing the problem as
described. It would appear exactly as described - a slow leak that gradually
 >>>>builds up until the "dead" area was so big that it forces applications to
swap to find enough room to work.

So I guess we should ask for clarification:

1. are the Open MPI applications exiting cleanly? Do you see any stale
"orted" executables still in the process table?
 >>>>
2. can you check the temp directory where we would be operating? This is
usually your /tmp directory, unless you specified some other location. Look
for our session directories - they have a name that includes "openmpi" in
them. Are they being cleaned up (i.e., removed) when the applications exit?

Thanks
Ralph


On 8/6/07 5:53 PM, "Jeff Squyres" <jsquy...@cisco.com> wrote:



 Unless there's something weird going on in the Solaris kernel, the
 only memory that we should be leaking after MPI processes exit would
 >>> be shared memory files that are [somehow] not getting removed properly.
 Right?


 On Aug 6, 2007, at 8:15 AM, Ralph H Castain wrote:

 Hmmm...just to clarify as I think there may be some confusion here.

 Orte-clean will kill any outstanding Open MPI daemons (which should
 kill
 their local apps) and will cleanup their associated temporary file
 systems.
 If you are having problems with zombied processes or stale daemons,
 then
 this will hopefully help (it isn't perfect, but it helps).

 However, orte-clean will not do anything about releasing memory
 that has
 been "leaked" by Open MPI. We don't have any tools for doing that, I'm
 afraid.


 On 8/6/07 8:08 AM, "Don Kerr" <don.k...@sun.com> wrote:

 Glenn,

 With CT7 there is a utility which can be used to clean up left over
 cruft from stale  MPI processes.

 % man -M /opt/SUNWhpc/man -s 1 orte-clean

 Achtung: This will remove current running jobs as well. Use of "-
 v" for
 verbose recommended.

 I would be curious if this helps.

 -DON
 p.s. orte-clean does not exist in the ompi v1.2 branch, it is in the
 trunk but  I think there is an issue with it currently

 Ralph H Castain wrote:

 On 8/5/07 6:35 PM, "Glenn Carver" <glenn.car...@atm.ch.cam.ac.uk>
 >>>> wrote:


 I'd appreciate some advice and help on this one.  We're having
 serious problems running parallel applications on our cluster.
>>>>>>>
 >>>>> After


 each batch job finishes, we lose a certain amount of available
 memory. Additional jobs cause free memory to gradually go down
 until
 the machine starts swapping and becomes unusable or hangs.
 Taking the
 machine to single user mode doesn't restore the memory, only a
 reboot
 returns all available memory. This happens on all our nodes.

 We've been doing some testing to try to pin the problems down,
 although we still don't fully know where the problem is coming
 from.
 We have ruled out our applications (fortran codes); we see the same
 behaviour with  Intel's IMB. We know it's not a network issue as a
 parallel job running solely on the 4 cores on each node produces
 the
 same effect. All nodes have been brought up to the very latest OS
 patches and we still see the same problem.

 Details: we're running Solaris 10/06, Sun Studio 12, Clustertools 7
 (open-mpi 1.2.1) and Sun Gridengine 6.1. Hardware is Sun X4100/
 >>>>>>>>> X4200.
 Kernel version: SunOS 5.10 Generic_125101-10 on all nodes.

 I read in the release notes that a number of memory leaks were
 fixed
 for the 1.2.1 release but none have been noticed since so I'm not
 >>>>>>>>> sure where the problem might be.


 I'm not sure where that claim came from, but it is certainly not
 true that
 we haven't noticed any leaks since 1.2.1. We know we have quite a
 few memory
 leaks in the code base, many of which are small in themselves but
 can add up
 depending upon exactly what the application does (i.e., which
 code paths it
 travels). Running a simple hello_world app under valgrind will show
 significant unreleased memory.

 I doubt you will see much, if any, improvement in 1.2.4. There
 have probably
 been a few patches applied, but a comprehensive effort to
 eradicate the
 problem has not been made. It is something we are trying to
 cleanup over
 time, but hasn't been a crash priority as most OS's do a fairly
 good job of
 cleaning up when the app completes.



 My next move is to try the very latest release (probably
 >>>>>>> 1.2.4pre-release). As CT7 is built with sun studio 11 rather
 than 12
 which we're using, I might also try downgrading. At the moment
 we're
 rebooting our cluster nodes every day to keep things going. So any
 suggestions are appreciated.

 Thanks,        Glenn




 $ ompi_info
                 Open MPI: 1.2.1r14096-ct7b030r1838
    Open MPI SVN revision: 0
                 Open RTE: 1.2.1r14096-ct7b030r1838
    Open RTE SVN revision: 0
                     OPAL: 1.2.1r14096-ct7b030r1838
        OPAL SVN revision: 0
                   Prefix: /opt/SUNWhpc/HPC7.0
  Configured architecture: i386-pc-solaris2.10
            Configured by: root
            Configured on: Fri Mar 30 13:40:12 EDT 2007
           Configure host: burpen-csx10-0
                 Built by: root
                 Built on: Fri Mar 30 13:57:25 EDT 2007
               Built host: burpen-csx10-0
               C bindings: yes
             C++ bindings: yes
       Fortran77 bindings: yes (all)
       Fortran90 bindings: yes
  Fortran90 bindings size: trivial
               C compiler: cc
      C compiler absolute: /ws/ompi-tools/SUNWspro/SOS11/bin/cc
             C++ compiler: CC
    C++ compiler absolute: /ws/ompi-tools/SUNWspro/SOS11/bin/CC
       Fortran77 compiler: f77
   Fortran77 compiler abs: /ws/ompi-tools/SUNWspro/SOS11/bin/f77
       Fortran90 compiler: f95
   Fortran90 compiler abs: /ws/ompi-tools/SUNWspro/SOS11/bin/f95
 >>>>>>>              C profiling: yes
 >>>>>            C++ profiling: yes


      Fortran77 profiling: yes
      Fortran90 profiling: yes
           C++ exceptions: yes
           Thread support: no
   Internal debug support: no
 >>>>>      MPI parameter check: runtime


 Memory profiling support: no
 Memory debugging support: no
          libltdl support: yes
    Heterogeneous support: yes
  mpirun default --prefix: yes
            MCA backtrace: printstack (MCA v1.0, API v1.0,
 >>>>>>>>> Component v1.2.1)
            MCA paffinity: solaris (MCA v1.0, API v1.0, Component
 v1.2.1)
            MCA maffinity: first_use (MCA v1.0, API v1.0,
 Component v1.2.1)
               MCA timer: solaris (MCA v1.0, API v1.0, Component
 >>>>>>>>> v1.2.1)
            MCA allocator: basic (MCA v1.0, API v1.0, Component
 v1.0)
            MCA allocator: bucket (MCA v1.0, API v1.0, Component
 v1.0)
                 MCA coll: basic (MCA v1.0, API v1.0, Component
 v1.2.1)
                 MCA coll: self (MCA v1.0, API v1.0, Component
 v1.2.1)
                 MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.1)
                 MCA coll: tuned (MCA v1.0, API v1.0, Component
 v1.2.1)
                   MCA io: romio (MCA v1.0, API v1.0, Component
 v1.2.1)
                MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.1)
                MCA mpool: udapl (MCA v1.0, API v1.0, Component
 v1.2.1)
                  MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.1)
                  MCA pml: ob1 (MCA v1.0, API v1.0, Component
 v1.2.1)
                  MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.1)
               MCA rcache: rb (MCA v1.0, API v1.0, Component v1.2.1)
               MCA rcache: vma (MCA v1.0, API v1.0, Component
 v1.2.1)
                  MCA btl: self (MCA v1.0, API v1.0.1, Component
 v1.2.1)
                  MCA btl: sm (MCA v1.0, API v1.0.1, Component
 v1.2.1)
                  MCA btl: tcp (MCA v1.0, API v1.0.1, Component
 v1.0)
                  MCA btl: udapl (MCA v1.0, API v1.0, Component
 >>>>>>> v1.2.1)
                 MCA topo: unity (MCA v1.0, API v1.0, Component
 v1.2.1)
                  MCA osc: pt2pt (MCA v1.0, API v1.0, Component
 v1.2.1)
               MCA errmgr: hnp (MCA v1.0, API v1.3, Component
 v1.2.1)
               MCA errmgr: orted (MCA v1.0, API v1.3, Component
 v1.2.1)
               MCA errmgr: proxy (MCA v1.0, API v1.3, Component
 v1.2.1)
                  MCA gpr: null (MCA v1.0, API v1.0, Component
 v1.2.1)
                  MCA gpr: proxy (MCA v1.0, API v1.0, Component
 v1.2.1)
                  MCA gpr: replica (MCA v1.0, API v1.0, Component
 v1.2.1)
                  MCA iof: proxy (MCA v1.0, API v1.0, Component
 v1.2.1)
                  MCA iof: svc (MCA v1.0, API v1.0, Component
 v1.2.1)
                   MCA ns: proxy (MCA v1.0, API v2.0, Component
 v1.2.1)
                   MCA ns: replica (MCA v1.0, API v2.0, Component
 v1.2.1)
                  MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
                  MCA ras: dash_host (MCA v1.0, API v1.3,
 Component v1.2.1)
                  MCA ras: gridengine (MCA v1.0, API v1.3,
 Component v1.2.1)
                  MCA ras: localhost (MCA v1.0, API v1.3,
 Component v1.2.1)
                  MCA ras: tm (MCA v1.0, API v1.3, Component v1.2.1)
                  MCA rds: hostfile (MCA v1.0, API v1.3,
 Component v1.2.1)
                  MCA rds: proxy (MCA v1.0, API v1.3, Component
 v1.2.1)
                  MCA rds: resfile (MCA v1.0, API v1.3, Component
 >>>>>>> v1.2.1)
                MCA rmaps: round_robin (MCA v1.0, API v1.3,
 Component v1.2.1)
                 MCA rmgr: proxy (MCA v1.0, API v2.0, Component
 v1.2.1)
                 MCA rmgr: urm (MCA v1.0, API v2.0, Component
 >>>> >>>>> v1.2.1)


                  MCA rml: oob (MCA v1.0, API v1.0, Component
 v1.2.1)
                  MCA pls: gridengine (MCA v1.0, API v1.3,
 Component v1.2.1)
                  MCA pls: proxy (MCA v1.0, API v1.3, Component
>>>>>>>>>

 >>>>> v1.2.1)


                  MCA pls: rsh (MCA v1.0, API v1.3, Component
 v1.2.1)
                  MCA pls: tm (MCA v1.0, API v1.3, Component v1.2.1)
                  MCA sds: env (MCA v1.0, API v1.0, Component
 v1.2.1)
                  MCA sds: pipe (MCA v1.0, API v1.0, Component
 v1.2.1)
                  MCA sds: seed (MCA v1.0, API v1.0, Component
 v1.2.1)
                  MCA sds: singleton (MCA v1.0, API v1.0,
 Component v1.2.1)
 _______________________________________________
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users


 _______________________________________________
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users


 _______________________________________________
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
 _______________________________________________
 users mailing list
 us...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to