On Monday, May 27, 2019 at 11:40:00 AM UTC-7, Tracy Hall wrote: > > Thanks for the reply. That's disappointing—I'm not crazy about the idea of > spawning a hundred million Sage processes, most of them to do about 1/50th > of a second's worth of calculation, just because a few of them will need to > be timed out before they take hours or days or run out of memory. >
That shouldn't be quite necessary. A normal way of setting up longer-running computations is to write "check points" at regular intervals. In your case, that could consist of writing the results of each individual case to a file (and be sure to flush buffers when you do, so that the file is actually updated -- open file, append line, close file is the simple way of ensuring that happens). When your long-running process crashes, you know which things were completed, so you can restart at the point where things were left off. An advantage of such a setup is that it also lends itself to parallelization. If there is a chance that some jobs won't finish, you could run the process with a CPU time quota, so that it gets killed when it has used up a lot of time. The case it was working on at that time either got killed because it was unlucky to be scheduled when almost no quota was left or because it took a lot of time itself. That's easily figured out by rerunning this one job by itself. You'd not be spawning a hundred million processes, just perhaps a few hundred over the course of the computation. Setting the quota gives you a trade-off between how many restarts and how much time you want to waste on impossible jobs. It's only after an unexpected interruption that we can't fully trust the state of various libraries (and quite frankly, I wouldn't trust any computer algebra system under those conditions. It requires a lot of work and overhead to ensure that interruptions leave the system in an entirely consistent state (and even then -- which one?). I don't trust programmers of high-performance computational software to not cut corners on that somewhere (if they even try). The reason why they probably don't try is because for long-running computations you need to checkpoint anyway, which means the cost of restarting isn't huge, so it's probably a price worth paying for the increased performance of not caring about interrupt clean-up too much. It's worth learning the general tools that unix provides for setting quota and for scripting batch jobs in a restartable way, because they are applicable to all software. It's much more efficient than equipping each individual program with such tools and requiring people to learn the particulars of that one program. -- You received this message because you are subscribed to the Google Groups "sage-support" group. To unsubscribe from this group and stop receiving emails from it, send an email to sage-support+unsubscr...@googlegroups.com. To post to this group, send email to sage-support@googlegroups.com. Visit this group at https://groups.google.com/group/sage-support. To view this discussion on the web visit https://groups.google.com/d/msgid/sage-support/beb3a273-b85b-45d6-bc2a-7820fb7f93da%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.