Hi Don,
Somehow I thought it might not be so easy.. otherwise it might have
been spotted before!
Although we first spotted the problem with our own application, I did
the most recent tests using the Intel MPI Benchmarks
(intel_clustertools3.tar.gz) and saw the same behaviour. It might be
i
Glenn,
This will require some more investigation. I have verified that the
udapl btl is making the proper calls to free registered memory and
though I have seen the free memory as listed by vmstat drop and I see it
come back as well. Additionally if I run a basic bandwidth test
serially(one
Don,
Following up on this, here are the results of the tests. All is well
until udapl is included. In addition there are no mca parameters set
in these jobs. As I reported to you before, if I add --mca
btl_udapl_flags=1, the memory problem goes away.
The batch jobs run vmstat before and aft
Glenn,
While I look into the possibility of registered memory not being freed
could you run your same tests but without shared memory or udapl.
"--mca btl self,tcp"
If this is successful, i.e. frees memory as expected. The next step
would be to run including shared memory, "--mca btl self,s
I will run some tests to check out this possibility.
-DON
Jeff Squyres wrote:
I guess this is a question for Sun: what happens if registered memory
is not freed after a process exits? Does the kernel leave it allocated?
On Aug 6, 2007, at 7:00 PM, Glenn Carver wrote:
Just to clarify,
I guess this is a question for Sun: what happens if registered memory
is not freed after a process exits? Does the kernel leave it allocated?
On Aug 6, 2007, at 7:00 PM, Glenn Carver wrote:
Just to clarify, the MPI applications exit cleanly. We have our own
f90 code (in various configuratio
Just to clarify, the MPI applications exit cleanly. We have our own
f90 code (in various configurations) and I'm also testing using
Intel's IMB. If I watch the applications whilst they run, there is a
drop in free memory as our code begins, the free memory then steadily
drops as the code runs.
Guess I don't see how stale shared memory files would cause swapping to
occur. Besides, the user provided no indication that the applications were
abnormally terminating, which makes it likely we cleaned up the session
directories as we should.
However, we definitely leak memory (i.e., we don't fr
Unless there's something weird going on in the Solaris kernel, the
only memory that we should be leaking after MPI processes exit would
be shared memory files that are [somehow] not getting removed properly.
Right?
On Aug 6, 2007, at 8:15 AM, Ralph H Castain wrote:
Hmmm...just to clarify
Hmmm...just to clarify as I think there may be some confusion here.
Orte-clean will kill any outstanding Open MPI daemons (which should kill
their local apps) and will cleanup their associated temporary file systems.
If you are having problems with zombied processes or stale daemons, then
this wil
Glenn,
With CT7 there is a utility which can be used to clean up left over
cruft from stale MPI processes.
% man -M /opt/SUNWhpc/man -s 1 orte-clean
Achtung: This will remove current running jobs as well. Use of "-v" for
verbose recommended.
I would be curious if this helps.
-DON
p.s. or
On 8/5/07 6:35 PM, "Glenn Carver" wrote:
> I'd appreciate some advice and help on this one. We're having
> serious problems running parallel applications on our cluster. After
> each batch job finishes, we lose a certain amount of available
> memory. Additional jobs cause free memory to grad
I'd appreciate some advice and help on this one. We're having
serious problems running parallel applications on our cluster. After
each batch job finishes, we lose a certain amount of available
memory. Additional jobs cause free memory to gradually go down until
the machine starts swapping an
13 matches
Mail list logo