On 5/06/2011 12:31 PM, Dimitar Pachov wrote:
This script is not using mdrun -append.
-append is the default, it doesn't need to be explicitly listed.
Ah yes, very true.
Your original post suggested the use of -append was a problem. Why
aren't we seeing a script with mdrun -append? Also, please provide
the full script - it looks like there might be a loop around your
tpbconv-then-mdrun fragment.
There is no loop; this is a job script with PBS directives. The header
of it looks like:
===========================
#!/bin/bash
#$ -S /bin/bash
#$ -pe mpich 8
#$ -ckpt reloc
#$ -l mem_total=6G
===========================
as usual submitted by:
qsub -N aaaa myjob.q
Note that a useful trouble-shooting technique can be to construct
your command line in a shell variable, echo it to stdout
(redirected as suitable) and then execute the contents of the
variable. Now, nobody has to parse a shell script to know what
command line generated what output, and it can be co-located with
the command's stdout.
I somewhat understand your point, but could give an example if you
think it is really necessary?
It's just generally helpful if your stdout has "mpirun -np 8
/path/to/mdrun_mpi -deffnm run_4 -cpi run_4" at the top of it so that
you have a definitive record of what you did under the environment that
existed at the time of execution.
As I said, the queue is like this: you submit the job, it finds an
empty node, it goes there, however seconds later another user with
higher privileges on that particular node submits a job, his job kicks
out my job, mine goes on the queue again, it finds another empty node,
goes there, then another user with high privileges on that node
submits a job, which consequently kicks out my job again, and the
cycle repeats itself ... theoretically, it could continue forever,
depending on how many and where the empty nodes are, if any.
You've said that *now* - but previously you've said nothing about why
you were getting lots of restarts. In my experience, PBS queues suspend
jobs rather than deleting them, in order that resources are not wasted.
Apparently other places do things this way. I think that this
information is highly relevant to explaining your observations.
These many restarts suggest that the queue was full with relatively
short jobs ran by users with high privileges. Technically, I cannot
see why the same processes should be running simultaneously because at
any instant my job runs only on one node, or it stays in
the queuing list.
I/O can be buffered such that the termination of the process and the
completion of its I/O are asynchronous. Perhaps it *shouldn't* be that
way, but this is a problem for the administrators of your cluster to
address. They know how the file system works. If the next job executes
before the old one has finished output, then I think the symptoms you
observe might be possible.
Note that there is nothing GROMACS can do about that, unless somehow
GROMACS can apply a lock in the first mdrun that is respected by your
file system such that a subsequent mdrun cannot open the same file until
all pending I/O has completed. I'd expect proper HPC file systems do
that automatically, but I don't really know.
From md-1-2360.out:
=====================================
:::::::
Getting Loaded...
Reading file run1.tpr, VERSION 4.5.4 (single precision)
Reading checkpoint file run1.cpt generated: Tue May 31 10:45:22 2011
Loaded with Money
Making 2D domain decomposition 4 x 2 x 1
WARNING: This run will generate roughly 4915 Mb of data
starting mdrun 'run1'
100000000 steps, 200000.0 ps (continuing from step 51879590,
103759.2 ps).
=====================================
These aren't showing anything other than that the restart is
coming from the same point each time.
And from the last generated output md-1-2437.out (I think I
killed the job at that point because of the above observed behavior):
=====================================
:::::::
Getting Loaded...
Reading file run1.tpr, VERSION 4.5.4 (single precision)
=====================================
I have at least 5-6 additional examples like this one. In some of
them the *xtc file does have size greater than zero yet still
very small, but it starts from some random frame (for example, in
one of the cases it contains frames from ~91000ps to ~104000ps,
but all frames before 91000ps are missing).
I think that demonstrating a problem requires that the set of
output files were fine before one particular restart, and weird
afterwards. I don't think we've seen that yet.
I don't understand your point here. I am providing you with all info I
have. I am showing the output files of 3 restarts, and they are
different in a sense that the last two did not progress further enough
before another job restart occurred. The first was fine before the
restart, and the others were not exactly fine after the restart. At
this point I realize that what I call "restart" and what you call
"restart" might be two different things. And here is where the problem
might be lying.
I realize there might be another problem, but the bottom line is
that there is no mechanism that can prevent this from happening
if many restarts are required, and particularly if the timing
between these restarts is prone to be small (distributed
computing could easily satisfy this condition).
Any suggestions, particularly related to the minimum resistance
path to regenerate the missing data? :)
Using the checkpoint capability & appending make sense when
many restarts are expected, but unfortunately it is exactly
then when these options completely fail! As a new user of
Gromacs, I must say I am disappointed, and would like to
obtain an explanation of why the usage of these options is
clearly stated to be safe when it is not, and why the append
option is the default, and why at least a single warning has
not been posted anywhere in the docs & manuals?
I can understand and sympathize with your frustration if
you've experienced the loss of a simulation. Do be careful
when suggesting that others' actions are blame-worthy, however.
I have never suggested this. As a user, I am entitled to ask.
Sure. However, talking about something that can "completely fail"
This is a fact, backed up by my evidences => I don't see anything bad
directed to anybody.
which makes you "disappointed"
This is me being honest => again not related to anybody else.
and wanting to "obtain an explanation"
Well, this even is funny :) - many people want this, especially in
science. Is that bad?
about why something doesn't work as stated and lacks "a single
warning"
Again a fact => again nothing bad here.
suggests that someone has done something less than appropriate
This is a completely personal interpretation, and I am personally not
responsible of how people perceive information. For unknown to me
reason you moved into a very defensive mode. What could I do?
, and so blame-worthy. It also assumes that the actions of a new
user were correct, and the actions of a developer with long
experience were not.
Sorry, this is too much. Where was this suggested? It seems to me you
took it too personally.
This may or may not prove to be true. Starting such a discussion
from a conciliatory (rather than antagonistic) stance is usually
more productive. The shared objective should be to fix the
problem, not prove that someone did something wrong.
Agree, and I did it. Again, your perception does not seem to be
correlated with my intended approach.
Words are open to interpretation. Communicating well requires that you
consider the impact of your words on your reader. You want people who
can address the problem to want to help. You don't want them to feel
defensive about the situation - whether you think that would be an
over-reaction or not.
An alternative way of wording your paragraph could have been:
"Using the checkpoint capability & appending make sense when many
restarts are expected, however I observe that under such
circumstances this capability can fail. I am a new user of
GROMACS, might I have been using them incorrectly? Are the
developers aware of any situations under which the capability is
unreliable? If so, should the default behaviour be different, and
should this issue be documented somewhere?"
This is helpful, but again a bit too much. I don't tell you how to
write, please do the same.
OK, but how the tone of how you write determines whether people will
respond, no matter how important your message.
Moreover, how could I ask questions the answers to which were mostly
known to me before sending my post?
These are the same ideas about which you asked in your original post.
And since my questions were not clearly answered, I will repeat
them in a structured way:
1. Why is the usage of these options (-cpi and -append) clearly
stated to be safe when in fact it is not?
Because they are believed to be safe. Jussi's suggestion about
file locking may have merit.
2. Why have you made the -append option the default in the most
current GMX versions?
Because it's the most convenient mode of operation.
3. Why has not a single warning been posted anywhere in the docs
& manuals? (this question is somewhat clear - because you did not
know about such a problem, but people say "ignorance of the law
excuses no one", which means ignoring to put a warning for
something that you were not 100% certain it would be error-free
could not be an excuse)
Because no-one is aware of a problem to warn about.
No, people are aware, they just do not think it is a problem, because
there is an easy work-around (-noappend), although not
as convenient and clean. Ask users of the Condor distributed grid
using Gromacs.
You asked why there was no warning in the documentation - that's because
no-one who can fix the documentation is aware of a problem. If
Condor-using people want to keep using a work-around and not
communicate, that's their prerogative. But if the issue isn't
communicated, then it isn't going to be documented, whether it's a real
issue or not.
I am blame-worthy - for blindly believing what was written in the
manual without taking the necessary precautions. Lesson learned.
However, developers' time rarely permits addressing "feature
X doesn't work, why not?" in a productive way. Solving bugs
can be hard, but will be easier (and solved faster!) if the
user who thinks a problem exists follows good procedure. See
http://www.chiark.greenend.org.uk/~sgtatham/bugs.html
<http://www.chiark.greenend.org.uk/%7Esgtatham/bugs.html>
Implying that I did not follow a certain procedure related to a
certain problem without you knowing what my initial intention was
is just a speculation.
I don't follow your point. If your intent is to get the problem
being fixed, the advice on that web page is useful.
My intend was clearly stated before, but for the sake of
clarification, let's repeat it again:
1. To let you know about the existence of such a problem.
Great. So far, I think it's an artefact of the combination of your PBS
and file system configuration.
2. To find out why I encountered the problem, although I have read and
followed all of the Gromacs documentation related to the used by me
features.
As above - it's not really the fault of GROMACS. I don't know if a
better solution exists.
3. To somewhat improve the way the documentation is written.
OK, I will add a short note to mdrun -h noting that there exist
execution environments where timing of file access across separate
GROMACS processes might be a problem.
Mark
--
gmx-users mailing list gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists