On 05/30/2015 11:47 PM, Andres Freund wrote:
I don't think it's primarily a problem of lack of review; although that
is a large problem.  I think the biggest systematic problem is that the
compound complexity of postgres has increased dramatically over the
years.  Features have added complexity little by little, each not
incrementally not looking that bad.  But very little has been done to
manage complexity. Since 8.0 the codesize has roughly doubled, but
little has been done to manage the increased complexity. Few new
abstractions have been introduced and the structure of the code is
largely the same.

As a somewhat extreme example, let's look at StartupXLOG(). In 8.0 it
was ~500 LOC, in master it's ~1500.  The interactions in 8.0 were
complex, they have gotten much more complex since.  It fullfills lots of
different roles, all in one function:

(roughly in the order things happen, but simplified)
* Read the control file/determine whether we crashed
* recovery.conf handling
* backup label handling
* tablespace map handling (huh, I missed that this was added directly to
   StartupXLOG. What a bad idea)
* Determine whether we're doing archive recovery, read the relevant
   checkpoint if so
* relcache init file removal
* timeline switch handling
* Loading the checkpoint we're starting from
* Initialization of a lot of subsystems
* crash recovery/replay
   * Including pgstat, unlogged table, exported snapshot handling
   * iff hot standby, some more subsystems are initialized here
   * hot standby state handling
   * replay process intialization
   * crash replay itself, including
     * progress tracking
     * recovery pause handling
     * nextxid tracking
     * timeline increase handling
     * hot standby state handling
   * unlogged relations handling
   * archive recovery handling
   * creation/initialization of the end of recovery checkpoint
   * timeline increment if failover
* subsystem initialization iff !hot_standby
* end of recovery actions

Yes. that's one routine. And, to make things even funnier, half of that
routine isn't exercised by our tests.

You can argue that this is an outlier, but I don't think so. Heapam, the
planner, etc. have similar cases.

And I think this, to some degree, explains a lot of the multixact
problems. While there were a few "simple bugs", most of them were
interactions between the various subsystems that are rather intricate.

I think this explanation is wrong. I agree that there are many places that would be good to refactor - like StartupXLOG() - but the multixact code was not too bad in that regard. IIRC the patch included some refactoring, it added some new helper functions in heapam.c, for example. You can argue that it didn't do enough of it, but that was not the big issue.

The big issue was at the architecture level. Basically, we liked vacuuming of XIDs and clog so much that we decided that it'd be nice if you had to vacuum multixids too, in order to not lose data. Many of the bugs and issues were not new - we had multixids before - but we upped the ante and turned minor locking bugs into data loss. And that had nothing to do with the code structure - we'd have similar issues if we had rewritten everything java, with the same design.

So, I'm all for refactoring and adding abstractions where it makes sense, but it's not going to solve design problems.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to