Dear Hackers,

I want to propose the feature that checks the health of foreign servers. 
As a first step I want to add an infrastructure for periodical checking to 
PostgreSQL core.
Currently this is the WIP, it does not contain docs.

## Background 

Currently there is no way to check the status of an foreign server in 
PostgreSQL.
If an foreign server's postmaster goes down, or if the network between servers 
is lost,
the backend process will only detect these when it uses the connection 
corresponding to that foreign server.

Consider a workload that updates data on an foreign server only at the 
beginning of a transaction,
and runs a lot of local SQLs. Even if the network is disconnected immediately 
after accessing the foreign server,
the backend process will continue to execute local SQLs without realizing it.

The process will eventually finish to execute SQLs and try to commit.
Only then will it realize that the foreign server cannot be connect and will 
abort the transaction.
This situation should be detected as soon as possible
because it is impossible to commit a transaction when the foreign server goes 
down.
This can be more of a problem if you have system-wide downtime requirements.
That's why I want to implement the health-check feature to postgres.

## Design

In general, PostgreSQL can have a variety of RDBMSs as foreign servers,
so the core cannot support all of them.
Therefore, I propose a method to leave the monitoring of the foreign server to 
each FDW extensions
and register it as a callback function on the body side.
The attached patch adds this monitoring infrastructure to core.
Within the callback functions, it is expected
that each FDWs will check the state of the connection they hold and call 
ereport (ERROR)
if it cannot connect to someone.
Of course, you can also have the callback function return false.
There is no particular reason to choose the current method.
Callback functions will be called periodically.

## Implementation

This patch introduces a new timeout and a new GUC parameter. GUC controls the 
timeout interval.
The timeout takes effect when the callback function is first registered,
before each SQL command is executed, and at the end of the timeout.
This implementation is based on the client_connection_check_interval and other 
timeouts.

## Further work

As the next step I have a plan to implement the callback function to 
postgres_fdw.
I already made a PoC, but it deeply depends on the following thread:
https://commitfest.postgresql.org/35/3098/

I also want you to review the postgres_fdw part,
but I think it should not be attached because cfbot cannot understand such a 
dependency
and will throw build error. Do you know how to deal with them in this case?

Your comments and suggestions are very welcome.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachment: v01_add_checking_infrastracture.patch
Description: v01_add_checking_infrastracture.patch

Reply via email to