Re: [gmx-users] Why does the -append option exist?

Mark Abraham Sat, 04 Jun 2011 23:15:27 -0700

On 5/06/2011 12:31 PM, Dimitar Pachov wrote:


    This script is not using mdrun -append.


-append is the default, it doesn't need to be explicitly listed.


Ah yes, very true.


    Your original post suggested the use of -append was a problem. Why
    aren't we seeing a script with mdrun -append? Also, please provide
    the full script - it looks like there might be a loop around your
    tpbconv-then-mdrun fragment.

There is no loop; this is a job script with PBS directives. The headerof it looks like:

===========================
#!/bin/bash
#$ -S /bin/bash
#$ -pe mpich 8
#$ -ckpt reloc
#$ -l mem_total=6G
===========================

as usual submitted by:

qsub -N aaaa myjob.q


    Note that a useful trouble-shooting technique can be to construct
    your command line in a shell variable, echo it to stdout
    (redirected as suitable) and then execute the contents of the
    variable. Now, nobody has to parse a shell script to know what
    command line generated what output, and it can be co-located with
    the command's stdout.

I somewhat understand your point, but could give an example if youthink it is really necessary?

It's just generally helpful if your stdout has "mpirun -np 8/path/to/mdrun_mpi -deffnm run_4 -cpi run_4" at the top of it so thatyou have a definitive record of what you did under the environment thatexisted at the time of execution.

As I said, the queue is like this: you submit the job, it finds anempty node, it goes there, however seconds later another user withhigher privileges on that particular node submits a job, his job kicksout my job, mine goes on the queue again, it finds another empty node,goes there, then another user with high privileges on that nodesubmits a job, which consequently kicks out my job again, and thecycle repeats itself ... theoretically, it could continue forever,depending on how many and where the empty nodes are, if any.

You've said that *now* - but previously you've said nothing about whyyou were getting lots of restarts. In my experience, PBS queues suspendjobs rather than deleting them, in order that resources are not wasted.Apparently other places do things this way. I think that thisinformation is highly relevant to explaining your observations.

These many restarts suggest that the queue was full with relativelyshort jobs ran by users with high privileges. Technically, I cannotsee why the same processes should be running simultaneously because atany instant my job runs only on one node, or it stays inthe queuing list.

I/O can be buffered such that the termination of the process and thecompletion of its I/O are asynchronous. Perhaps it *shouldn't* be thatway, but this is a problem for the administrators of your cluster toaddress. They know how the file system works. If the next job executesbefore the old one has finished output, then I think the symptoms youobserve might be possible.

Note that there is nothing GROMACS can do about that, unless somehowGROMACS can apply a lock in the first mdrun that is respected by yourfile system such that a subsequent mdrun cannot open the same file untilall pending I/O has completed. I'd expect proper HPC file systems dothat automatically, but I don't really know.


    From md-1-2360.out:
    =====================================
    :::::::
    Getting Loaded...
    Reading file run1.tpr, VERSION 4.5.4 (single precision)

    Reading checkpoint file run1.cpt generated: Tue May 31 10:45:22 2011


    Loaded with Money

    Making 2D domain decomposition 4 x 2 x 1

    WARNING: This run will generate roughly 4915 Mb of data

    starting mdrun 'run1'
    100000000 steps, 200000.0 ps (continuing from step 51879590,
    103759.2 ps).
    =====================================


    These aren't showing anything other than that the restart is
    coming from the same point each time.

    And from the last generated output md-1-2437.out (I think I
    killed the job at that point because of the above observed behavior):
    =====================================
    :::::::
    Getting Loaded...
    Reading file run1.tpr, VERSION 4.5.4 (single precision)
    =====================================

    I have at least 5-6 additional examples like this one. In some of
    them the *xtc file does have size greater than zero yet still
    very small, but it starts from some random frame (for example, in
    one of the cases it contains frames from ~91000ps to ~104000ps,
    but all frames before 91000ps are missing).


    I think that demonstrating a problem requires that the set of
    output files were fine before one particular restart, and weird
    afterwards. I don't think we've seen that yet.

I don't understand your point here. I am providing you with all info Ihave. I am showing the output files of 3 restarts, and they aredifferent in a sense that the last two did not progress further enoughbefore another job restart occurred. The first was fine before therestart, and the others were not exactly fine after the restart. Atthis point I realize that what I call "restart" and what you call"restart" might be two different things. And here is where the problemmight be lying.

    I realize there might be another problem, but the bottom line is
    that there is no mechanism that can prevent this from happening
    if many restarts are required, and particularly if the timing
    between these restarts is prone to be small (distributed
    computing could easily satisfy this condition).

    Any suggestions, particularly related to the minimum resistance
    path to regenerate the missing data? :)

        Using the checkpoint capability & appending make sense when
        many restarts are expected, but unfortunately it is exactly
        then when these options completely fail! As a new user of
        Gromacs, I must say I am disappointed, and would like to
        obtain an explanation of why the usage of these options is
        clearly stated to be safe when it is not, and why the append
        option is the default, and why at least a single warning has
        not been posted anywhere in the docs & manuals?


        I can understand and sympathize with your frustration if
        you've experienced the loss of a simulation. Do be careful
        when suggesting that others' actions are blame-worthy, however.


    I have never suggested this. As a user, I am entitled to ask.


    Sure. However, talking about something that can "completely fail"

This is a fact, backed up by my evidences => I don't see anything baddirected to anybody.


    which makes you "disappointed"


This is me being honest => again not related to anybody else.

    and wanting to "obtain an explanation"

Well, this even is funny :) - many people want this, especially inscience. Is that bad?


    about why something doesn't work as stated and lacks "a single
    warning"


Again a fact => again nothing bad here.

    suggests that someone has done something less than appropriate

This is a completely personal interpretation, and I am personally notresponsible of how people perceive information. For unknown to mereason you moved into a very defensive mode. What could I do?


    , and so blame-worthy. It also assumes that the actions of a new
    user were correct, and the actions of a developer with long
    experience were not.

Sorry, this is too much. Where was this suggested? It seems to me youtook it too personally.


    This may or may not prove to be true. Starting such a discussion
    from a conciliatory (rather than antagonistic) stance is usually
    more productive. The shared objective should be to fix the
    problem, not prove that someone did something wrong.

Agree, and I did it. Again, your perception does not seem to becorrelated with my intended approach.

Words are open to interpretation. Communicating well requires that youconsider the impact of your words on your reader. You want people whocan address the problem to want to help. You don't want them to feeldefensive about the situation - whether you think that would be anover-reaction or not.


    An alternative way of wording your paragraph could have been:
    "Using the checkpoint capability & appending make sense when many
    restarts are expected, however I observe that under such
    circumstances this capability can fail. I am a new user of
    GROMACS, might I have been using them incorrectly? Are the
    developers aware of any situations under which the capability is
    unreliable? If so, should the default behaviour be different, and
    should this issue be documented somewhere?"

This is helpful, but again a bit too much. I don't tell you how towrite, please do the same.

OK, but how the tone of how you write determines whether people willrespond, no matter how important your message.

Moreover, how could I ask questions the answers to which were mostlyknown to me before sending my post?


These are the same ideas about which you asked in your original post.

    And since my questions were not clearly answered, I will repeat
    them in a structured way:

    1. Why is the usage of these options (-cpi and -append) clearly
    stated to be safe when in fact it is not?


    Because they are believed to be safe. Jussi's suggestion about
    file locking may have merit.

    2. Why have you made the -append option the default in the most
    current GMX versions?


    Because it's the most convenient mode of operation.

    3. Why has not a single warning been posted anywhere in the docs
    & manuals? (this question is somewhat clear - because you did not
    know about such a problem, but people say "ignorance of the law
    excuses no one", which means ignoring to put a warning for
    something that you were not 100% certain it would be error-free
    could not be an excuse)


    Because no-one is aware of a problem to warn about.

No, people are aware, they just do not think it is a problem, becausethere is an easy work-around (-noappend), although notas convenient and clean. Ask users of the Condor distributed gridusing Gromacs.

You asked why there was no warning in the documentation - that's becauseno-one who can fix the documentation is aware of a problem. IfCondor-using people want to keep using a work-around and notcommunicate, that's their prerogative. But if the issue isn'tcommunicated, then it isn't going to be documented, whether it's a realissue or not.

    I am blame-worthy - for blindly believing what was written in the
    manual without taking the necessary precautions. Lesson learned.

        However, developers' time rarely permits addressing "feature
        X doesn't work, why not?" in a productive way. Solving bugs
        can be hard, but will be easier (and solved faster!) if the
        user who thinks a problem exists follows good procedure. See
        http://www.chiark.greenend.org.uk/~sgtatham/bugs.html
        <http://www.chiark.greenend.org.uk/%7Esgtatham/bugs.html>


    Implying that I did not follow a certain procedure related to a
    certain problem without you knowing what my initial intention was
    is just a speculation.


    I don't follow your point. If your intent is to get the problem
    being fixed, the advice on that web page is useful.

My intend was clearly stated before, but for the sake ofclarification, let's repeat it again:


1. To let you know about the existence of such a problem.

Great. So far, I think it's an artefact of the combination of your PBSand file system configuration.

2. To find out why I encountered the problem, although I have read andfollowed all of the Gromacs documentation related to the used by mefeatures.

As above - it's not really the fault of GROMACS. I don't know if abetter solution exists.

3. To somewhat improve the way the documentation is written.

OK, I will add a short note to mdrun -h noting that there existexecution environments where timing of file access across separateGROMACS processes might be a problem.


Mark

-- 
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Re: [gmx-users] Why does the -append option exist?

Reply via email to