+1 for both log messages and allowing connections. I believe these two complement each other.
In the cloud world, we oftentimes want to monitor the progress of the recovery without connecting to the server as the operators don't necessarily have the required permissions to connect and query. Secondly, having this information in the log helps going back in time and understand where Postgres spent time during recovery. The ability to query the server provides real time information and come handy. Thanks, Satya On Mon, Apr 19, 2021 at 10:55 AM Robert Haas <robertmh...@gmail.com> wrote: > Hi, > > I've noticed that customers not infrequently complain that they start > postgres and then the system doesn't come up for a while and they have > no idea what's going on and are (understandably) worried. There are > probably a number of reasons why this can happen, but the ones that > seem to come up most often in my experience are (1) SyncDataDirectory > takes a long time, (b) ResetUnloggedRelations takes a long time, and > (c) there's a lot of WAL to apply so that takes a long time. It's > possible to distinguish this last case from the other two by looking > at the output of 'ps', but that's not super-convenient if your normal > method of access to the server is via libpq, and it only works if you > are monitoring it as it's happening rather than looking at the logs > after-the-fact. I am not sure there's any real way to distinguish the > other two cases without using strace or gdb or similar. > > It seems to me that we could do better. One approach would be to try > to issue a log message periodically - maybe once per minute, or some > configurable interval, e.g. perhaps add messages something like this: > > LOG: still syncing data directory, elapsed time %ld.%03d ms, current path > %s > LOG: data directory sync complete after %ld.%03d ms > LOG: still resetting unlogged relations, elapsed time %ld.%03d ms, > current path %s > LOG: unlogged relations reset after %ld.%03d ms > LOG: still performing crash recovery, elapsed time %ld.%03d ms, > current LSN %08X/%08X > > We already have a message when redo is complete, so there's no need > for another one. The implementation here doesn't seem too hard either: > the startup process would set a timer, when the timer expires the > signal handler sets a flag, at a convenient point we notice the flag > is set and responding by printing a message and clearing the flag. > > Another possible approach would be to accept connections for > monitoring purposes even during crash recovery. We can't allow access > to any database at that point, since the system might not be > consistent, but we could allow something like a replication connection > (the non-database-associated variant). Maybe it would be precisely a > replication connection and we'd just refuse all but a subset of > commands, or maybe it would be some other kinds of thing. But either > way you'd be able to issue a command in some mini-language saying "so, > tell me how startup is going" and it would reply with a result set of > some kind. > > If I had to pick one of these two ideas, I'd pick the one the > log-based solution, since it seems easier to access and simplifies > retrospective analysis, but I suspect SQL access would be quite useful > for some users too, especially in cloud environments where "just log > into the machine and have a look" is not an option. > > Thoughts? > > -- > Robert Haas > EDB: http://www.enterprisedb.com > > >