Much of the discussion above seems to be related to where to store the error information and how much information is needed to be useful.
As a summary, the 5 alternatives I have seen mentioned are: #1. Store some simple message in the pg_subscription ("I wasn't trying to capture everything, just enough to give the user a reasonable indication of what went wrong" [Mark-1]). Storing the error message was also seen as a convenience for writing TAP tests ("I originally ran into the motivation to write this patch when frustrated that TAP tests needed to parse the apply worker log file" [Mark-2}). It also can sometimes provide a simple clue for the error (e.g. PK violation for table TBL) but still the user will have to look elsewhere for details to resolve the error. So while this implementation seems good for simple scenarios, it appears to have been shot down because the non-trivial scenarios either have insufficient or wrong information in the error message. Some DETAILS could have been added to give more information but that would maybe bloat the catalog ("I have not yet included the DETAIL field because I didn't want to bloat the catalog." [Mark-3]) #2. Similarly another idea was to use another existing catalog table pg_subscription_rel. This could have the same problems ("It won't be sufficient to store information in either pg_subscription_rel or pg_susbscription." [Amit-1]) #3. There is another suggestion to use the Stats Collector to hold the error message [Amit-2]. For me, this felt like blurring too much the distinction between "stats tracking/metrics" and "logs". ERROR logs must be flushed, whereas for stats (IIUC) there is no guarantee that everything you need to see would be present. Indeed Amit wrote "But in this case, if the stats collector missed updating the information, the user may have to manually update the subscription and let the error happen again to see it." [Amit-3]. Requesting the user to cause the same error again just in case it was not captured a first time seems too strange to me. #4. The next idea was to have an entirely new catalog for holding the subscription error information. I feel that storing/duplicating lots of error information in another table seems like a bridge too far. What about the risks of storing incorrect or sufficient information? What is the advantage of duplicating errors over just referring to the log files for ERROR details? #5. Document to refer to the logs. All ERROR details are already in the logs, and this seems to me the intuitive place to look for them. Searching for specific errors becomes difficult programmatically (is this really a problem other than complex TAP tests?). But here there is no risk of missing or insufficient information captured in the log files ("but still there will be some information related to ERROR which we wanted the user to see unless we ask them to refer to logs for that." [Amit-4}). --- My preferred alternative is #5. ERRORs are logged in the log file, so there is nothing really for this patch to do in this regard (except documentation), and there is no risk of missing any information, no ambiguity of having duplicated errors, and it is the intuitive place the user would look. So I felt current best combination is just this: a) A tri-state indicating the state of the subscription: e.g. something like "enabled" ('e')/ "disabled" ('d') / "auto-disabled" ('a') [Amit-5] b) For "auto-disabled" the PG docs would be updated tell the user to check the logs to resolve the problem before re-enabling the subscription ////////// IMO it is not made exactly clear to me what is the main goal of this patch. Because of this, I feel that you can't really judge if this new option is actually useful or not except only in hindsight. It seems like whatever you implement can be made to look good or bad, just by citing different test scenarios. e.g. * Is the goal mainly to help automated (TAP) testing? In that case, then maybe you do want to store the error message somewhere other than the log files. But still I wonder if results would be unpredictable anyway - e.g if there are multiple tables all with errors then it depends on the tablesync order of execution which error you see caused the auto-disable, right? And if it is not predictable maybe it is less useful. * Is the goal to prevent some *unattended* SUBSCRIPTION from going bad at some point in future and then going into a relaunch loop for days/weeks and causing 1000's of errors without the user noticing. In that case, this patch seems to be quite useful, but for this goal maybe you don't want to be checking the tablesync workers at all, but should only be checking the apply worker like your original v1 patch did. * Is the goal just to be a convenient way to disable the subscription during the CREATE SUBSCRIPTION phase so that the user can make corrections in peace without the workers re-launching and making more error logs? Here the patch is helpful, but only for simple scenarios like 1 faulty table. Imagine if there are 10 tables (all with PK violations at DATASYNC copy) then you will encounter them one at a time and have to re-enable the subscription 10 times, after fixing each error in turn. So in this scenario the new option might be more of a hindrance than a help because it would be easier if the user just did "ALTER SUBSCRIPTION sub DISABLE" manually and fixed all the problems in one sitting before re-enabling. * etc ////////// Finally, here is one last (crazy?) thought-bubble just for consideration. I might be wrong, but my gut feeling is that the Stats Collector is intended more for "tracking" and for "metrics" rather than for holding duplicates of logged error messages. At the same time, I felt that disabling an entire subscription due to a single rogue error might be overkill sometimes. But I wonder if there is a way to combine those two ideas so that the Stats Collector gets some new counter for tracking the number of worker re-launches that have occurred, meanwhile there could be a subscription option which gives a threshold above which you would disable the subscription. e.g. "disable_on_error_threshold=0" default, relaunch forever "disable_on_error_threshold=1" disable upon first error encountered. (This is how your patch behaves now I think.) "disable_on_error_threshold=500" disable if the re-launch errors go unattended and happen 500 times. ------ [Mark-1] https://www.postgresql.org/message-id/A539C848-670E-454F-B31C-82D3CBE9F5AC%40enterprisedb.com [Mark-2] https://www.postgresql.org/message-id/DB35438F-9356-4841-89A0-412709EBD3AB%40enterprisedb.com [Mark-3] https://www.postgresql.org/message-id/DE7E13B7-DC76-416A-A98F-3BC3F80E6BE9%40enterprisedb.com [Amit-1] https://www.postgresql.org/message-id/CAA4eK1K_JFSFrAkr_fgp3VX6hTSmjK%3DwNs4Tw8rUWHGp0%2BBsaw%40mail.gmail.com [Amit-2] https://www.postgresql.org/message-id/CAA4eK1%2BNoRbYSH1J08zi4OJ_EUMcjmxTwnmwVqZ6e_xzS0D6VA%40mail.gmail.com [Amit-3] https://www.postgresql.org/message-id/CAA4eK1Kyx6U9yxC7OXoBD7pHC3bJ4LuNGd%3DOiABmiW6%2BqG%2BvEQ%40mail.gmail.com [Amit-4] https://www.postgresql.org/message-id/CAA4eK1Kyx6U9yxC7OXoBD7pHC3bJ4LuNGd%3DOiABmiW6%2BqG%2BvEQ%40mail.gmail.com [Amit-5] https://www.postgresql.org/message-id/CAA4eK1Kyx6U9yxC7OXoBD7pHC3bJ4LuNGd%3DOiABmiW6%2BqG%2BvEQ%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia