So, the ports build infrastructure got a new cluster (two large amd64 with 8 cores). (Reminder: we never could afford that without people giving money to the project, so if you want cool stuff to keep happening, think about contributing financially to the well-being of the project.)
and dpb got in trouble. You know, that big port that dwarves all the rest (libreoffice), well, with enough parallel juice, it defines a critical path, and THAT critical path ends up *after* the rest of the ports, by 3 hours or so. Incidentally, that answers the question I did have two years ago: how many cores do we need until dpb starts getting starved ? somewhere above ten... So, a different approach is needed. First hint is that we know libreoffice can be built with MAKE_JOBS=n, using several cores at one. Now, you're saying, why don't we ditch dpb and build everything with MAKE_JOBS=n ? There are several reasons not to: - some build clusters do not have MP machines. Those are dependent on the ability of dpb to dispatch jobs on a lot of machines. - most ports in the system do not build correctly with MAKE_JOBS=n. Partly because of makefile bugs. Partly, because of make bugs (I'm working on the later, slowly, because the make code is full of gremlins). - most ports in the system won't benefit from MAKE_JOBS=n. A lot of small ports spend most of their time in configure anyways. And some big ports don't have good parallel properties. For instance, *all* of kde3 does not benefit from MAKE_JOBS=n at all ! because of automake and recursive properties, plus the fact that kde3 coalesces source files into one single shared object for runtime performance (incidentally, that's the major reason why the kde people switched to cmake). So, we figured a mixed approach would work: allow selected ports to build with MAKE_JOBS=n, but keep most of them building sequentially. There's a bit of cheating involved: when you start a job with MAKE_JOBS>1, it *will* spawn several processes, so the load of the machine will go up. E.g., on a 4-core machine, you may end up running job1 job2 job3 job4 <- x2 processes The idea there is that job4 will steal existing jobs as they end, so it actually shows up as job1 job2 job3 1*job4 and will (hopefully) soon end up as job1 job2 2*job4 when job3 terminates. That's an idea I took from make (which tags recursive makes to avoid starting new jobs exponentially) after some discussion with Theo. So, there's a first part to the technique: figure out a parallel number that will speed things up, but not too high, because your machine WILL temporarily have a bigger number of processes running than it's supposed to. As a base figure, for the 2 * 8-core clusters, we took MAKE_JOBS=4 as a basis. This means that, temporarily, one machine may run 7 normal job plus one parallel job, to a grand total of 11 processes. That number may not go higher, since no new job will be allowed to start until we're down to 4 normal jobs and a parallel job. At which point a new job may replace an existing one. That may be or not be a parallel job, so we can end up with the machine having two parallel jobs. In practice, this seems to work. One thing working for us is that those big jobs are big ports. And the extract/patch/configure stage does not parallelize in most cases... So this gives smaller ports extra time to finish while the port is setting up (likewise for fake/package, but since this is IO-intensive, actually having fewer processes running at that point may be a good idea too, since everything will be starved on disk buffers anyways). And we've started looking at ports that fit a number of criteria: - must be parallel-safe. Don't introduce work-arounds for make bugs at this point, please ! - must be a clear win with respect to build time. If starting the job with MAKE_JOBS=n does not "roughly" divide the build time by n, it's not worth it. - must be noticeable in the build. Only stuff that "unlocks" a lot of ports and takes noticeable time to build is a correct candidate. All the pieces are in place, and you're going to soon see commits related to that. - dpb has a new option, -p for parallel jobs. It translates into parallel=n in an host file. It should be set to a number smaller than the number of jobs on the host, and it should divide the number of jobs on the host ! (otherwise, say if you run -j 8 -p 3, you *will* end up with 9 processes running). - the ports tree has a new variable, DPB_PROPERTIES. It can be set to parallel. Yep, this is overly general. And I have no idea whether we're ever going to add something else in there. For now, *please*, *please*, *please*, do not add DPB_PROPERTIES=parallel blindly to any port that seems to "like" it. There's still a small cost to this: the tree will run with two many processes for a bit of time, and some ports that "look" parallel safe are not, if you throw enough cores at them. I estimate we might end up with about 30~50 ports with that property, tops. Oh, and in case you wonder, this is loads of fun.
