On 05/30/2015 11:47 PM, Andres Freund wrote:
I don't think it's primarily a problem of lack of review; although that
is a large problem. I think the biggest systematic problem is that the
compound complexity of postgres has increased dramatically over the
years. Features have added complexity little by little, each not
incrementally not looking that bad. But very little has been done to
manage complexity. Since 8.0 the codesize has roughly doubled, but
little has been done to manage the increased complexity. Few new
abstractions have been introduced and the structure of the code is
largely the same.
As a somewhat extreme example, let's look at StartupXLOG(). In 8.0 it
was ~500 LOC, in master it's ~1500. The interactions in 8.0 were
complex, they have gotten much more complex since. It fullfills lots of
different roles, all in one function:
(roughly in the order things happen, but simplified)
* Read the control file/determine whether we crashed
* recovery.conf handling
* backup label handling
* tablespace map handling (huh, I missed that this was added directly to
StartupXLOG. What a bad idea)
* Determine whether we're doing archive recovery, read the relevant
checkpoint if so
* relcache init file removal
* timeline switch handling
* Loading the checkpoint we're starting from
* Initialization of a lot of subsystems
* crash recovery/replay
* Including pgstat, unlogged table, exported snapshot handling
* iff hot standby, some more subsystems are initialized here
* hot standby state handling
* replay process intialization
* crash replay itself, including
* progress tracking
* recovery pause handling
* nextxid tracking
* timeline increase handling
* hot standby state handling
* unlogged relations handling
* archive recovery handling
* creation/initialization of the end of recovery checkpoint
* timeline increment if failover
* subsystem initialization iff !hot_standby
* end of recovery actions
Yes. that's one routine. And, to make things even funnier, half of that
routine isn't exercised by our tests.
You can argue that this is an outlier, but I don't think so. Heapam, the
planner, etc. have similar cases.
And I think this, to some degree, explains a lot of the multixact
problems. While there were a few "simple bugs", most of them were
interactions between the various subsystems that are rather intricate.
I think this explanation is wrong. I agree that there are many places
that would be good to refactor - like StartupXLOG() - but the multixact
code was not too bad in that regard. IIRC the patch included some
refactoring, it added some new helper functions in heapam.c, for
example. You can argue that it didn't do enough of it, but that was not
the big issue.
The big issue was at the architecture level. Basically, we liked
vacuuming of XIDs and clog so much that we decided that it'd be nice if
you had to vacuum multixids too, in order to not lose data. Many of the
bugs and issues were not new - we had multixids before - but we upped
the ante and turned minor locking bugs into data loss. And that had
nothing to do with the code structure - we'd have similar issues if we
had rewritten everything java, with the same design.
So, I'm all for refactoring and adding abstractions where it makes
sense, but it's not going to solve design problems.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers