I've been getting some more flak lately about fixing the RPM-supplied init scripts to avoid the problem where the postmaster fails to start because there's a stale lockfile in the data directory. See for instance https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=134090 but it's hardly the first time the question has been brought up.
I've always resisted the standard solution of having the init script itself remove the lock file, because I think that is a great way to shoot yourself in the foot. But I finally had an idea that would make things better with no increased risk. The postmaster does not abort simply because the lockfile is there. It checks whether the PID mentioned in the lockfile exists, and belongs to a postgres-owned process (the latter by seeing if "kill(pid, 0)" succeeds), and is not the postmaster itself. If any of these tests fail then it knows the lockfile is stale. So the scenario in which it gets fooled is where the reboot has used just one or two more processes than were used in the last boot cycle, so that the PID that belonged to the postmaster in the last cycle now belongs to either pg_ctl or the postgres-owned shell launched by "su". So what occurred to me is that we could eliminate this scenario if we could get rid of those two processes. The init script is running as root, and so if its PID is the one mentioned in the old lockfile, the kill() test will fail and the postmaster will go on its merry way. It's only the other processes launched by "su" that could fool the kill() test. After some experimentation, I have found that what will actually work requires two changes: 1. In the init script, do not use pg_ctl to launch the postmaster, but instead invoke it directly, ie something like su - postgres -c "/usr/bin/postmaster ...args... &" pg_ctl is not really buying us any functionality or notational advantage here, so this seems like no loss. This brings us down to one extra postgres-owned process, namely the shell that su launches which in turn launches the postmaster. (Depending on timing, the shell might or might not still be around when the postmaster probes.) 2. In the postmaster, reject as bogus matches not only our own PID, but our parent's PID from getppid(). This is perfectly safe since whatever launched the postmaster is certainly not a competing postmaster. Now we cannot be fooled by the parent shell either. AFAICS this is a bulletproof solution for typical users who only launch one postmaster during system boot. If you launch two or more postmasters then it's still possible for the first-launched one to have the same PID that belonged to the second one during the previous boot cycle. But the odds of this seem pretty low, since it would imply a much greater change in the usage of PIDs during the boot cycle. The known failure cases involve a change of just one or two PIDs, whereas postmasters launched by different init scripts will surely be separated by dozens if not hundreds of PIDs. (Besides, if you are truly paranoid you'd be running separate postmasters under separate user IDs, which'd eliminate the problem again.) I'd also want to remove the existing code in the init scripts that zaps the socket lockfile, since it would be unnecessary. I've forebore to remove it so far because it doesn't introduce a serious risk of data corruption the way zapping the datadir lockfile would, but it is definitely risky especially in multi-postmaster scenarios. Anyone see any downsides to these changes? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faqs/FAQ.html