On 5/06/2011 12:31 PM, Dimitar Pachov wrote:

    This script is not using mdrun -append.


-append is the default, it doesn't need to be explicitly listed.

Ah yes, very true.

    Your original post suggested the use of -append was a problem. Why
    aren't we seeing a script with mdrun -append? Also, please provide
    the full script - it looks like there might be a loop around your
    tpbconv-then-mdrun fragment.


There is no loop; this is a job script with PBS directives. The header of it looks like:
===========================
#!/bin/bash
#$ -S /bin/bash
#$ -pe mpich 8
#$ -ckpt reloc
#$ -l mem_total=6G
===========================

as usual submitted by:

qsub -N aaaa myjob.q


    Note that a useful trouble-shooting technique can be to construct
    your command line in a shell variable, echo it to stdout
    (redirected as suitable) and then execute the contents of the
    variable. Now, nobody has to parse a shell script to know what
    command line generated what output, and it can be co-located with
    the command's stdout.


I somewhat understand your point, but could give an example if you think it is really necessary?

It's just generally helpful if your stdout has "mpirun -np 8 /path/to/mdrun_mpi -deffnm run_4 -cpi run_4" at the top of it so that you have a definitive record of what you did under the environment that existed at the time of execution.

As I said, the queue is like this: you submit the job, it finds an empty node, it goes there, however seconds later another user with higher privileges on that particular node submits a job, his job kicks out my job, mine goes on the queue again, it finds another empty node, goes there, then another user with high privileges on that node submits a job, which consequently kicks out my job again, and the cycle repeats itself ... theoretically, it could continue forever, depending on how many and where the empty nodes are, if any.

You've said that *now* - but previously you've said nothing about why you were getting lots of restarts. In my experience, PBS queues suspend jobs rather than deleting them, in order that resources are not wasted. Apparently other places do things this way. I think that this information is highly relevant to explaining your observations.

These many restarts suggest that the queue was full with relatively short jobs ran by users with high privileges. Technically, I cannot see why the same processes should be running simultaneously because at any instant my job runs only on one node, or it stays in the queuing list.

I/O can be buffered such that the termination of the process and the completion of its I/O are asynchronous. Perhaps it *shouldn't* be that way, but this is a problem for the administrators of your cluster to address. They know how the file system works. If the next job executes before the old one has finished output, then I think the symptoms you observe might be possible.

Note that there is nothing GROMACS can do about that, unless somehow GROMACS can apply a lock in the first mdrun that is respected by your file system such that a subsequent mdrun cannot open the same file until all pending I/O has completed. I'd expect proper HPC file systems do that automatically, but I don't really know.


    From md-1-2360.out:
    =====================================
    :::::::
    Getting Loaded...
    Reading file run1.tpr, VERSION 4.5.4 (single precision)

    Reading checkpoint file run1.cpt generated: Tue May 31 10:45:22 2011


    Loaded with Money

    Making 2D domain decomposition 4 x 2 x 1

    WARNING: This run will generate roughly 4915 Mb of data

    starting mdrun 'run1'
    100000000 steps, 200000.0 ps (continuing from step 51879590,
    103759.2 ps).
    =====================================

    These aren't showing anything other than that the restart is
    coming from the same point each time.


    And from the last generated output md-1-2437.out (I think I
    killed the job at that point because of the above observed behavior):
    =====================================
    :::::::
    Getting Loaded...
    Reading file run1.tpr, VERSION 4.5.4 (single precision)
    =====================================

    I have at least 5-6 additional examples like this one. In some of
    them the *xtc file does have size greater than zero yet still
    very small, but it starts from some random frame (for example, in
    one of the cases it contains frames from ~91000ps to ~104000ps,
    but all frames before 91000ps are missing).

    I think that demonstrating a problem requires that the set of
    output files were fine before one particular restart, and weird
    afterwards. I don't think we've seen that yet.


I don't understand your point here. I am providing you with all info I have. I am showing the output files of 3 restarts, and they are different in a sense that the last two did not progress further enough before another job restart occurred. The first was fine before the restart, and the others were not exactly fine after the restart. At this point I realize that what I call "restart" and what you call "restart" might be two different things. And here is where the problem might be lying.


    I realize there might be another problem, but the bottom line is
    that there is no mechanism that can prevent this from happening
    if many restarts are required, and particularly if the timing
    between these restarts is prone to be small (distributed
    computing could easily satisfy this condition).

    Any suggestions, particularly related to the minimum resistance
    path to regenerate the missing data? :)


        Using the checkpoint capability & appending make sense when
        many restarts are expected, but unfortunately it is exactly
        then when these options completely fail! As a new user of
        Gromacs, I must say I am disappointed, and would like to
        obtain an explanation of why the usage of these options is
        clearly stated to be safe when it is not, and why the append
        option is the default, and why at least a single warning has
        not been posted anywhere in the docs & manuals?

        I can understand and sympathize with your frustration if
        you've experienced the loss of a simulation. Do be careful
        when suggesting that others' actions are blame-worthy, however.


    I have never suggested this. As a user, I am entitled to ask.

    Sure. However, talking about something that can "completely fail"


This is a fact, backed up by my evidences => I don't see anything bad directed to anybody.

    which makes you "disappointed"


This is me being honest => again not related to anybody else.

    and wanting to "obtain an explanation"


Well, this even is funny :) - many people want this, especially in science. Is that bad?

    about why something doesn't work as stated and lacks "a single
    warning"


Again a fact => again nothing bad here.

    suggests that someone has done something less than appropriate


This is a completely personal interpretation, and I am personally not responsible of how people perceive information. For unknown to me reason you moved into a very defensive mode. What could I do?

    , and so blame-worthy. It also assumes that the actions of a new
    user were correct, and the actions of a developer with long
    experience were not.


Sorry, this is too much. Where was this suggested? It seems to me you took it too personally.

    This may or may not prove to be true. Starting such a discussion
    from a conciliatory (rather than antagonistic) stance is usually
    more productive. The shared objective should be to fix the
    problem, not prove that someone did something wrong.


Agree, and I did it. Again, your perception does not seem to be correlated with my intended approach.

Words are open to interpretation. Communicating well requires that you consider the impact of your words on your reader. You want people who can address the problem to want to help. You don't want them to feel defensive about the situation - whether you think that would be an over-reaction or not.


    An alternative way of wording your paragraph could have been:
    "Using the checkpoint capability & appending make sense when many
    restarts are expected, however I observe that under such
    circumstances this capability can fail. I am a new user of
    GROMACS, might I have been using them incorrectly? Are the
    developers aware of any situations under which the capability is
    unreliable? If so, should the default behaviour be different, and
    should this issue be documented somewhere?"


This is helpful, but again a bit too much. I don't tell you how to write, please do the same.

OK, but how the tone of how you write determines whether people will respond, no matter how important your message.

Moreover, how could I ask questions the answers to which were mostly known to me before sending my post?

These are the same ideas about which you asked in your original post.

    And since my questions were not clearly answered, I will repeat
    them in a structured way:

    1. Why is the usage of these options (-cpi and -append) clearly
    stated to be safe when in fact it is not?

    Because they are believed to be safe. Jussi's suggestion about
    file locking may have merit.


    2. Why have you made the -append option the default in the most
    current GMX versions?

    Because it's the most convenient mode of operation.



    3. Why has not a single warning been posted anywhere in the docs
    & manuals? (this question is somewhat clear - because you did not
    know about such a problem, but people say "ignorance of the law
    excuses no one", which means ignoring to put a warning for
    something that you were not 100% certain it would be error-free
    could not be an excuse)

    Because no-one is aware of a problem to warn about.


No, people are aware, they just do not think it is a problem, because there is an easy work-around (-noappend), although not as convenient and clean. Ask users of the Condor distributed grid using Gromacs.

You asked why there was no warning in the documentation - that's because no-one who can fix the documentation is aware of a problem. If Condor-using people want to keep using a work-around and not communicate, that's their prerogative. But if the issue isn't communicated, then it isn't going to be documented, whether it's a real issue or not.

    I am blame-worthy - for blindly believing what was written in the
    manual without taking the necessary precautions. Lesson learned.

        However, developers' time rarely permits addressing "feature
        X doesn't work, why not?" in a productive way. Solving bugs
        can be hard, but will be easier (and solved faster!) if the
        user who thinks a problem exists follows good procedure. See
        http://www.chiark.greenend.org.uk/~sgtatham/bugs.html
        <http://www.chiark.greenend.org.uk/%7Esgtatham/bugs.html>


    Implying that I did not follow a certain procedure related to a
    certain problem without you knowing what my initial intention was
    is just a speculation.

    I don't follow your point. If your intent is to get the problem
    being fixed, the advice on that web page is useful.


My intend was clearly stated before, but for the sake of clarification, let's repeat it again:

1. To let you know about the existence of such a problem.

Great. So far, I think it's an artefact of the combination of your PBS and file system configuration.

2. To find out why I encountered the problem, although I have read and followed all of the Gromacs documentation related to the used by me features.

As above - it's not really the fault of GROMACS. I don't know if a better solution exists.

3. To somewhat improve the way the documentation is written.

OK, I will add a short note to mdrun -h noting that there exist execution environments where timing of file access across separate GROMACS processes might be a problem.

Mark

-- 
gmx-users mailing list    gmx-users@gromacs.org
http://lists.gromacs.org/mailman/listinfo/gmx-users
Please search the archive at 
http://www.gromacs.org/Support/Mailing_Lists/Search before posting!
Please don't post (un)subscribe requests to the list. Use the 
www interface or send it to gmx-users-requ...@gromacs.org.
Can't post? Read http://www.gromacs.org/Support/Mailing_Lists

Reply via email to