On Monday, May 27, 2019 at 11:40:00 AM UTC-7, Tracy Hall wrote:
>
> Thanks for the reply. That's disappointing—I'm not crazy about the idea of 
> spawning a hundred million Sage processes, most of them to do about 1/50th 
> of a second's worth of calculation, just because a few of them will need to 
> be timed out before they take hours or days or run out of memory.
>

That shouldn't be quite necessary. A normal way of setting up 
longer-running computations is to write "check points" at regular 
intervals. In your case, that could consist of writing the results of each 
individual case to a file (and be sure to flush buffers when you do, so 
that the file is actually updated -- open file, append line, close file is 
the simple way of ensuring that happens). When your long-running process 
crashes, you know which things were completed, so you can restart at the 
point where things were left off. An advantage of such a setup is that it 
also lends itself to parallelization.

If there is a chance that some jobs won't finish, you could run the process 
with a CPU time quota, so that it gets killed when it has used up a lot of 
time. The case it was working on at that time either got killed because it 
was unlucky to be scheduled when almost no quota was left or because it 
took a lot of time itself. That's easily figured out by rerunning this one 
job by itself.

You'd not be spawning a hundred million processes, just perhaps a few 
hundred over the course of the computation. Setting the quota gives you a 
trade-off between how many restarts and how much time you want to waste on 
impossible jobs.

It's only after an unexpected interruption that we can't fully trust the 
state of various libraries (and quite frankly, I wouldn't trust any 
computer algebra system under those conditions. It requires a lot of work 
and overhead to ensure that interruptions leave the system in an entirely 
consistent state (and even then -- which one?). I don't trust programmers 
of high-performance computational software to not cut corners on that 
somewhere (if they even try).

The reason why they probably don't try is because for long-running 
computations you need to checkpoint anyway, which means the cost of 
restarting isn't huge, so it's probably a price worth paying for the 
increased performance of not caring about interrupt clean-up too much.

It's worth learning the general tools that unix provides for setting quota 
and for scripting batch jobs in a restartable way, because they are 
applicable to all software. It's much more efficient than equipping each 
individual program with such tools and requiring people to learn the 
particulars of that one program.




-- 
You received this message because you are subscribed to the Google Groups 
"sage-support" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to sage-support+unsubscr...@googlegroups.com.
To post to this group, send email to sage-support@googlegroups.com.
Visit this group at https://groups.google.com/group/sage-support.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/sage-support/beb3a273-b85b-45d6-bc2a-7820fb7f93da%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to