I wrote: > So I rearranged things to allocate arcs out of a common pool, and for > good measure made the state allocation code do the same thing. I was > pretty much blown away by the results: not only is the average-case > space usage about half what it is on HEAD, but the worst-case drops > by well more than a factor of ten.
BTW, I was initially a bit baffled by how this could be. Per previous measurements, the average number of arcs per state is around 4; so if the code is allocating ten arcs for each state right off the bat, it's pretty clear how we could have a factor-of-two-or-so bloat problem. And I think that does explain the average-case results. But it can't possibly explain bloats of more than 10x. After further study I think this is what explains it: * The "average number of arcs" is pretty misleading, because in a large NFA some of the states have hundreds of out-arcs, while most have only a couple. * The NFA is not static; the code moves arcs around all the time. There's actually a function (moveouts) that deletes all the out-arcs of a state and creates images of them on another state. That operation can be invoked a lot of times during NFA optimization. * Once a given state has acquired N out-arcs, it keeps that pool of arc storage, even if some or all of those arcs get deleted. Indeed, the state itself could be dropped and later recycled, but it still keeps its arc pool. Unfortunately, even if it does get recycled for re-use, it's likely to be resurrected as a state with only a couple of out-arcs. So I think the explanation for 20x or 30x bloat arises from the optimize pass resulting in having a bunch of states that have large but largely unused arc pools. Getting rid of the per-state arc pools in favor of one common pool fixes that nicely. I realized while looking at this that some cycles could be shaved from moveouts, because there's no longer a reason why it can't just scribble on the arcs in-place (cf. now-obsolete comment on changearctarget()). It's late but I'll see about improving that tomorrow. regards, tom lane