Re: Patroni vs pgpool II

2023-04-07 Thread Tatsuo Ishii
> I truly believe that this problem – HA – is PostgreSQL's, not 3rd
> party's. And it's a shame that Postgres itself doesn't solve this. So
> we're discussing it here.

Let's see what other subscribers on this forum say.

>> > What if pg1 is currently primary, pg0 is standby, both are healthy, but
>> > due not network issues, both pg1 and w2 are not reachable to other
>> > nodes? Will pg1 remain primary, and w0 and w1 decide to promote pg0?
>>
>> pg1 will remain primary but it is set to "quarantine" state from
>> pgpool's point of view, which means clients cannot access pg1 via
>> pgpool.
> 
> So we have a split brain here – two primaries. Especially if some
> clients communicate with PG directly. 

Clients are not allowed to communicate with PostgreSQL
directory. That's the prerequisite of using Pgpool-II.

> And even if there are no such
> clients, archive_command is going to
> work on both nodes,

What's the problem with this? Moreover you can write a logic to
disable this in the failover command.

> monitoring will show two primaries confusing
> humans (e.g, SREs) and various systems,

That's why pgpool provides its own monitoring tools. Clustering system
is different from standalone PostgreSQL. Existing PostgreSQL tools
usually only take account of stand alone PostgreSQL. Users have to
realize the difference.

> if we have many standby nodes,
> some of them might continue replicating from the old primary if they
> happen to be in the same network partition, and so on.

As of pg0 and existing standby in the same network as pg0, you can
either manually or automatically make them to follow pg0.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


Re: Patroni vs pgpool II

2023-04-07 Thread Jehan-Guillaume de Rorthais
On Fri, 07 Apr 2023 13:16:59 +0900 (JST)
Tatsuo Ishii  wrote:

> >> > But, I heard PgPool is still affected by Split brain syndrome.
> >> 
> >> Can you elaborate more? If more than 3 pgpool watchdog nodes (the
> >> number of nodes must be odd) are configured, a split brain can be
> >> avoided.  
> > 
> > Split brain is a hard situation to avoid. I suppose OP is talking about
> > PostgreSQL split brain situation. I'm not sure how PgPool's watchdog would
> > avoid that.  
> 
> Ok, "split brain" means here that there are two or more PostgreSQL
> primary serves exist.
> 
> Pgpool-II's watchdog has a feature called "quorum failover" to avoid
> the situation. To make this work, you need to configure 3 or more
> Pgpool-II nodes. Suppose they are w0, w1 and w2. Also suppose there
> are two PostgreSQL servers pg0 (primary) and pg1 (standby). The goal
> is to avoid that both pg0 and pg1 become primary servers.
> 
> Pgpool-II periodically monitors PostgreSQL healthiness by checking
> whether it can reach to the PostgreSQL servers. Suppose w0 and w1
> detect that pg0 is healthy but pg1 is not, while w2 thinks oppositely,
> i.e. pg0 is unhealthy but pg1 is healthy (this could happen if w0, w1,
> pg0 are in a network A, but w2 and pg1 in different network B. A and B
> cannot reach each other).
> 
> In this situation if w2 promotes pg1 because w0 seems to be down, then
> the system ends up with two primary servers: split brain.
> 
> With quorum failover is enabled, w0, w1, and w2 communicate each other
> to vote who is correct (if it cannot communicate, it regards other
> watchdog is down). In the case above w0 and w1 are majority and will
> win. Thus w0 and w1 just detach pg1 and keep on using pg0 as the
> primary. On the other hand, since wg2 looses, and it gives up
> promoting pg1, thus the split brain is avoided.
> 
> Note that in the configuration above, clients access the cluster via
> VIP. VIP is always controlled by majority watchdog, clients will not
> access pg1 because it is set to down status by w0 and w1.
> 
> > To avoid split brain, you need to implement a combinaison of quorum and
> > (self-)fencing.
> > 
> > Patroni quorum is in the DCS's hands. Patroni's self-fencing can be achieved
> > with the (hardware) watchdog. You can also implement node fencing through
> > the "pre_promote" script to fence the old primary node before promoting the
> > new one.
> > 
> > If you need HA with a high level of anti-split-brain security, you'll not be
> > able to avoid some sort of fencing, no matter what.
> > 
> > Good luck.  
> 
> Well, if you define fencing as STONITH (Shoot The Other Node in the
> Head), Pgpool-II does not have the feature.

And I believe that's part of what Cen was complaining about:

«
  It is basically a daemon glued together with scripts for which you are 
  entirely responsible for. Any small mistake in failover scripts and 
  cluster enters  a broken state.
»

If you want to build something clean, including fencing, you'll have to
handle/dev it by yourself in scripts

> However I am not sure STONITH is always mandatory.

Sure, it really depend on how much risky you can go and how much complexity you
can afford. Some cluster can leave with a 10 minute split brain where some other
can not survive a 5s split brain.

> I think that depends what you want to avoid using fencing. If the purpose is
> to avoid having two primary servers at the same time, Pgpool-II achieve that
> as described above.

How could you be so sure?

See https://www.alteeve.com/w/The_2-Node_Myth

«
  * Quorum is a tool for when things are working predictably
  * Fencing is a tool for when things go wrong
»

Regards,




Re: Patroni vs pgpool II

2023-04-07 Thread Tatsuo Ishii
> And I believe that's part of what Cen was complaining about:
> 
> «
>   It is basically a daemon glued together with scripts for which you are 
>   entirely responsible for. Any small mistake in failover scripts and 
>   cluster enters  a broken state.
> »
> 
> If you want to build something clean, including fencing, you'll have to
> handle/dev it by yourself in scripts

That's a design decision. This gives maximum flexibility to users.

Please note that we provide step-by-step installation/configuration
documents which has been used by production systems.

https://www.pgpool.net/docs/44/en/html/example-cluster.html

>> However I am not sure STONITH is always mandatory.
> 
> Sure, it really depend on how much risky you can go and how much complexity 
> you
> can afford. Some cluster can leave with a 10 minute split brain where some 
> other
> can not survive a 5s split brain.
> 
>> I think that depends what you want to avoid using fencing. If the purpose is
>> to avoid having two primary servers at the same time, Pgpool-II achieve that
>> as described above.
> 
> How could you be so sure?
> 
> See https://www.alteeve.com/w/The_2-Node_Myth
> 
> «
>   * Quorum is a tool for when things are working predictably
>   * Fencing is a tool for when things go wrong

I think the article does not apply to Pgpool-II.

---
 3-Node

When node 1 stops responding, node 2 declares it lost, reforms a
cluster with the quorum node, node 3, and is quorate. It begins
recovery by mounting the filesystem under NFS, which replays journals
and cleans up, then starts NFS and takes the virtual IP address.

Later, node 1 recovers from its hang. At the moment of recovery, it
has no concept that time has passed and so has no reason to check to
see if it is still quorate or whether its locks are still valid. It
just finished doing whatever it was doing at the moment it hung.

In the best case scenario, you now have two machines claiming the same
IP address. At worse, you have uncoordinated writes to storage and you
corrupt your data.
---

> Later, node 1 recovers from its hang.

Pgpool-II does not allow an automatic recover. If node 1 hangs and
once it is recognized as "down" by other nodes, it will not be used
without manual intervention. Thus the disaster described above will
not happen in pgpool.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp




Re: Patroni vs pgpool II

2023-04-07 Thread Jehan-Guillaume de Rorthais
On Fri, 07 Apr 2023 18:04:05 +0900 (JST)
Tatsuo Ishii  wrote:

> > And I believe that's part of what Cen was complaining about:
> > 
> > «
> >   It is basically a daemon glued together with scripts for which you are 
> >   entirely responsible for. Any small mistake in failover scripts and 
> >   cluster enters  a broken state.
> > »
> > 
> > If you want to build something clean, including fencing, you'll have to
> > handle/dev it by yourself in scripts  
> 
> That's a design decision. This gives maximum flexibility to users.

Sure, no problem with that. But people has to realize that the downside is that
it left the whole complexity and reliability of the cluster in the hands of
the administrator. And these are much more complicated and racy than 
a simple promote node.

Even dealing with a simple vIP can become a nightmare if not done correctly.

> Please note that we provide step-by-step installation/configuration
> documents which has been used by production systems.
> 
> https://www.pgpool.net/docs/44/en/html/example-cluster.html

These scripts rely on SSH, which is really bad. What if you have a SSH failure
in the mix? 

Moreover, even if SSH wouldn't be a weakness by itself, the script it doesn't
even try to shutdown the old node or stop the old primary.

You can add to the mix that both Pgpool and SSH rely on TCP for availability
checks and actions. You better have very low TCP timeout/retry...

When a service lose quorum on a resource, it is supposed to shutdown as fast as
possible... Or even self-fence itself using a watchdog device if the shutdown
action doesn't return fast enough.

> >> However I am not sure STONITH is always mandatory.  
> > 
> > Sure, it really depend on how much risky you can go and how much complexity
> > you can afford. Some cluster can leave with a 10 minute split brain where
> > some other can not survive a 5s split brain.
> >   
> >> I think that depends what you want to avoid using fencing. If the purpose
> >> is to avoid having two primary servers at the same time, Pgpool-II achieve
> >> that as described above.  
> > 
> > How could you be so sure?
> > 
> > See https://www.alteeve.com/w/The_2-Node_Myth
> > 
> > «
> >   * Quorum is a tool for when things are working predictably
> >   * Fencing is a tool for when things go wrong  
> 
> I think the article does not apply to Pgpool-II.

It is a simple example using NFS. The point here is that when things are
getting unpredictable, Quorum is just not enough. So yes, it does apply to
Pgpool.

Quorum is nice when nodes can communicate with each others, when they have
enough time and/or minimal load to complete actions correctly. 

My point is that a proper cluster with a anti-split-brain solution required
need both quorum and fencing.

> [...]
> > Later, node 1 recovers from its hang.  
> 
> Pgpool-II does not allow an automatic recover. 

This example neither. There's no automatic recover. It just state that node 1
was unable to answer in a timely fashion, just enough for a new quorum to be
formed and elect a new primary. But node 1 was not dead, and when node 1 is
able to answer, boom.

Service being muted for some period of time is really common. There's various
articles/confs feedback about cluster failing-over wrongly because of eg. a high
load on the primary... Last one was during the fosdem iirc.

> If node 1 hangs and once it is recognized as "down" by other nodes, it will
> not be used without manual intervention. Thus the disaster described above
> will not happen in pgpool.

Ok, so I suppose **all** connections, scripts, softwares, backups, maintenances
and admins must go through Pgpool to be sure to hit the correct primary.

This might be acceptable in some situation, but I wouldn't call that an
anti-split-brain solution. It's some kind of «software hiding the rogue node
behind a curtain and pretend it doesn't exist anymore»

Regards,




Re: "PANIC: could not open critical system index 2662" - twice

2023-04-07 Thread Laurenz Albe
On Thu, 2023-04-06 at 16:41 +, Evgeny Morozov wrote:
>  Our PostgreSQL 15.2 instance running on Ubuntu 18.04 has crashed with this 
> error: 
> 
> 2023-04-05 09:24:03.448 UTC [15227] ERROR:  index "pg_class_oid_index" 
> contains unexpected zero page at block 0
> [...]
> 
> We had the same thing happened about a month ago on a different database on 
> the same cluster.
> For a while PG actually ran OK as long as you didn't access that specific DB, 
> but when trying
> to back up that DB with pg_dump it would crash every time. At that time one 
> of the disks
> hosting the ZFS dataset with the PG data directory on it was reporting 
> errors, so we thought
> it was likely due to that.
> 
> Unfortunately, before we could replace the disks, PG crashed completely and 
> would not start
> again at all, so I had to rebuild the cluster from scratch and restore from 
> pg_dump backups
> (still onto the old, bad disks). Once the disks were replaced (all of them) I 
> just copied
> the data to them using zfs send | zfs receive and didn't bother restoring 
> pg_dump backups
> again - which was perhaps foolish in hindsight.
> 
> Well, yesterday it happened again. The server still restarted OK, so I took 
> fresh pg_dump
> backups of the databases we care about (which ran fine), rebuilt the cluster 
> and restored
> the pg_dump backups again - now onto the new disks, which are not reporting 
> any problems.
> 
> So while everything is up and running now this error has me rather concerned. 
> Could the
> error we're seeing now have been caused by some corruption in the PG data 
> that's been there
> for a month (so it could still be attributed to the bad disk), which should 
> now be fixed by
> having restored from backups onto good disks?

Yes, that is entirely possible.

> Could this be a PG bug?

It could be, but data corruption caused by bad hardware is much more likely.

> What can I do to figure out why this is happening and prevent it from 
> happening again?

No idea about the former, but bad hardware is a good enough explanation.

As to keeping it from happening: use good hardware.

Yours,
Laurenz Albe




Re: Patroni vs pgpool II

2023-04-07 Thread Ron

On 4/7/23 05:46, Jehan-Guillaume de Rorthais wrote:

On Fri, 07 Apr 2023 18:04:05 +0900 (JST)
Tatsuo Ishii  wrote:


And I believe that's part of what Cen was complaining about:

«
   It is basically a daemon glued together with scripts for which you are
   entirely responsible for. Any small mistake in failover scripts and
   cluster enters  a broken state.
»

If you want to build something clean, including fencing, you'll have to
handle/dev it by yourself in scripts

That's a design decision. This gives maximum flexibility to users.

Sure, no problem with that. But people has to realize that the downside is that
it left the whole complexity and reliability of the cluster in the hands of
the administrator. And these are much more complicated and racy than
a simple promote node.

Even dealing with a simple vIP can become a nightmare if not done correctly.


Please note that we provide step-by-step installation/configuration
documents which has been used by production systems.

https://www.pgpool.net/docs/44/en/html/example-cluster.html

These scripts rely on SSH, which is really bad. What if you have a SSH failure
in the mix?

Moreover, even if SSH wouldn't be a weakness by itself, the script it doesn't
even try to shutdown the old node or stop the old primary.


That does not matter, when only PgPool does the writing to the database.


You can add to the mix that both Pgpool and SSH rely on TCP for availability
checks and actions. You better have very low TCP timeout/retry...

When a service lose quorum on a resource, it is supposed to shutdown as fast as
possible... Or even self-fence itself using a watchdog device if the shutdown
action doesn't return fast enough.


Scenario:
S0 - Running Postgresql as primary, and also PgPool.
S1 - Running Postgresql as secondary, and also PgPool.
S2 - Running only PgPool.  Has the VIP.

There's no /need/ for Postgresql or PgPool on server 0 to shut down if it 
loses contact with S1 and S2, since they'll also notice that that S1 has 
disappeared.  In that case, they'll vote S1 into degraded state, and promote 
S1 to be the Postgresql primary.


A good question is what happens when S0 and S1 lose connection to S2 
(meaning that S2 loses connection to them, too).  S0 and S1 then "should" 
vote that S0 take over the VIP.  But, if S2 is still up and can connect to 
"the world", does it voluntarily decide to give up the VIP since it's all alone?


--
Born in Arizona, moved to Babylonia.

Re: "PANIC: could not open critical system index 2662" - twice

2023-04-07 Thread Michael Paquier
On Fri, Apr 07, 2023 at 01:04:34PM +0200, Laurenz Albe wrote:
> On Thu, 2023-04-06 at 16:41 +, Evgeny Morozov wrote:
>> Could this be a PG bug?
> 
> It could be, but data corruption caused by bad hardware is much more likely.

There is no way to be completely sure here, except if we would be able
to put our hands on a reproducible test case that would break the
cluster so much that we'd get into this state.  I don't recall seeing
this error pattern in recent history, though.
--
Michael


signature.asc
Description: PGP signature


Re: Patroni vs pgpool II

2023-04-07 Thread Tatsuo Ishii
> Scenario:
> S0 - Running Postgresql as primary, and also PgPool.
> S1 - Running Postgresql as secondary, and also PgPool.
> S2 - Running only PgPool.  Has the VIP.
> 
> There's no /need/ for Postgresql or PgPool on server 0 to shut down if
> it loses contact with S1 and S2, since they'll also notice that that
> S1 has disappeared.  In that case, they'll vote S1 into degraded
> state, and promote S1 to be the Postgresql primary.
> 
> A good question is what happens when S0 and S1 lose connection to S2
> (meaning that S2 loses connection to them, too).  S0 and S1 then
> "should" vote that S0 take over the VIP.  But, if S2 is still up and
> can connect to "the world", does it voluntarily decide to give up the
> VIP since it's all alone?

Yes, because S2 pgpool is not the leader anymore. In this case S2
voluntarily gives up VIP.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp




Re: Patroni vs pgpool II

2023-04-07 Thread Tatsuo Ishii
>> If node 1 hangs and once it is recognized as "down" by other nodes, it will
>> not be used without manual intervention. Thus the disaster described above
>> will not happen in pgpool.
> 
> Ok, so I suppose **all** connections, scripts, softwares, backups, 
> maintenances
> and admins must go through Pgpool to be sure to hit the correct primary.
> 
> This might be acceptable in some situation, but I wouldn't call that an
> anti-split-brain solution. It's some kind of «software hiding the rogue node
> behind a curtain and pretend it doesn't exist anymore»

You can call Pgpool-II whatever you like. Important thing for me (and
probably for users) is, if it can solve user's problem or not.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp




Re: "PANIC: could not open critical system index 2662" - twice

2023-04-07 Thread Peter J. Holzer
On 2023-04-07 13:04:34 +0200, Laurenz Albe wrote:
> On Thu, 2023-04-06 at 16:41 +, Evgeny Morozov wrote:
> > What can I do to figure out why this is happening and prevent it from 
> > happening again?
> 
> No idea about the former, but bad hardware is a good enough explanation.
> 
> As to keeping it from happening: use good hardware.

Also: Use checksums. PostgreSQL offers data checksums[1]. Some filesystems
also offer checksums.

This doesn't prevent corruption but at least it will be detected early
and can't spread.

hp

[1] For some reason I thought the Debian/Ubuntu packages enabled this by
default. But that doesn't seem to be the case.

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: Patroni vs pgpool II

2023-04-07 Thread Jehan-Guillaume de Rorthais
On Fri, 07 Apr 2023 21:16:04 +0900 (JST)
Tatsuo Ishii  wrote:

> >> If node 1 hangs and once it is recognized as "down" by other nodes, it will
> >> not be used without manual intervention. Thus the disaster described above
> >> will not happen in pgpool.  
> > 
> > Ok, so I suppose **all** connections, scripts, softwares, backups,
> > maintenances and admins must go through Pgpool to be sure to hit the
> > correct primary.
> > 
> > This might be acceptable in some situation, but I wouldn't call that an
> > anti-split-brain solution. It's some kind of «software hiding the rogue node
> > behind a curtain and pretend it doesn't exist anymore»  
> 
> You can call Pgpool-II whatever you like.

I didn't mean to be rude here. Please, accept my apologies if my words offended
you.

I consider "proxy-based" fencing architecture fragile because you just don't
know what is happening on your rogue node as long as a meatware is coming along
to deal with it. Moreover, you must trust your scripts, configurations,
procedures, admins, applications, users, replication, network, Pgpool, etc to
not fail on you in the meantime...

In the Pacemaker world, where everything MUST be **predictable**, the only way
to predict the state of a rogue node is to fence it from the cluster. Either cut
it from the network, shut it down or set up the watchdog so it reset itself if
needed. At the end, you know your old primary is off or idle or screaming in
the void with no one to hear it. It can't harm your other nodes, data or apps
anymore, no matter what.

> Important thing for me (and probably for users) is, if it can solve user's
> problem or not.

In my humble (and biased) opinion, Patroni, PAF or shared storage cluster are
solving user's problem in regard with HA. All with PROs and CONs. All rely on
strong, safe, well known and well developed clustering concepts.

Some consider they are complex pieces of software to deploy and maintain, but
this is because HA is complex. No miracle here.

Solutions like Pgpool or Repmgr are trying hard to re-implement HA concepts
but left most of this complexity and safety to the user discretion.
Unfortunately, this is not the role of the user to deal with such things. This
kind of architecture probably answer a need, a gray zone, where it is good
enough. I've seen similar approach in the past with pgbouncer + bash scripting
calling themselves "fencing" solution [1]. I'm fine with it as far as people
are clear about the limitations.

Kind regards,

[1] eg.
https://www.postgresql.eu/events/pgconfeu2016/sessions/session/1348-ha-with-repmgr-barman-and-pgbouncer/




Re: [EXTERNAL]: Re: UPSERT in Postgres

2023-04-07 Thread Adrian Klaver

On 4/6/23 18:27, Louis Tian wrote:

Hi Adrian,

Thank you. I think this is a better approach than trigger-based 
solution, at least for my taste.
That being said, it does require some logic to push to the client side 
(figuring out which required column value is missing and set it value to 
the existing one via reference of the table name).
Still wish there would be UPSERT statement that can handle this and make 
dev experience better.


It does what is advertised on the tin:

https://www.postgresql.org/docs/current/sql-insert.html

The optional ON CONFLICT clause specifies an alternative action to 
raising a unique violation or exclusion constraint violation error


[...]

ON CONFLICT DO UPDATE guarantees an atomic INSERT or UPDATE outcome; 
provided there is no independent error, one of those two outcomes is 
guaranteed, even under high concurrency. This is also known as UPSERT — 
“UPDATE or INSERT”.


You got caught by the '...independent error...' part. The same thing 
would have happened if you had just done:


insert into person (id, is_active) values(0, true);
ERROR:  null value in column "name" of relation "person" violates 
not-null constraint



The insert has to be valid on its own before you get to the 'alternative 
action to raising a unique violation or exclusion constraint violation 
error' part. Otherwise you are asking Postgres to override this 'insert 
into person (id, is_active)' and guess you really wanted something like:


insert into person (id, name, is_active) values(0, , true)

I'm would not like the server making those guesses on my behalf.


,
Cheers,
Louis Tian



--
Adrian Klaver
adrian.kla...@aklaver.com





Re: [EXTERNAL]: Re: UPSERT in Postgres

2023-04-07 Thread Adrian Klaver

On 4/6/23 18:27, Louis Tian wrote:

Hi Adrian,

Thank you. I think this is a better approach than trigger-based 
solution, at least for my taste.
That being said, it does require some logic to push to the client side 
(figuring out which required column value is missing and set it value to 
the existing one via reference of the table name).
Still wish there would be UPSERT statement that can handle this and make 
dev experience better.


Another way to make the experience easier:

alter table person alter COLUMN name set default 'cat';

 \d person
  Table "public.person"
  Column   |  Type   | Collation | Nullable |   Default
---+-+---+--+-
 id| integer |   | not null |
 name  | text|   | not null | 'cat'::text
 is_active | boolean |   |  |

select * from person;
 id | name | is_active
+--+---
  0 | foo  | NULL


insert into person(id, is_active) values (0,  true) on conflict ("id") 
do update set (id, is_active) = (excluded.id,  excluded.is_active);

INSERT 0 1

select * from person;
 id | name | is_active
+--+---
  0 | foo  | t




Cheers,
Louis Tian


--
Adrian Klaver
adrian.kla...@aklaver.com





pg_ctlcluster is not stopping cluster

2023-04-07 Thread Telium Technical Support
I am string to stop my PostgreSQL (on debian 11) server using the following
command

 

root@d11:/# sudo -u postgres /usr/bin/pg_ctlcluster 15 main stop -- -m fast
-D /var/lib/postgresql/13/main

Notice: extra pg_ctl/postgres options given, bypassing systemctl for stop
operation

pg_ctl: PID file "/var/lib/postgresql/13/main/postmaster.pid" does not exist

Is server running?

 

The notice is correct, the is no such postmaster.pid file in that directory,
but yes the service is running. I can confirm it's running with:

 

root@d11:/# sudo -u postgres /usr/bin/pg_ctlcluster 15 main status -- -D
/var/lib/postgresql/13/main

pg_ctl: server is running (PID: 2701882)

/usr/lib/postgresql/15/bin/postgres "-D" "/var/lib/postgresql/13/main" "-c"
"config_file=/etc/postgresql/15/main/postgresql.conf"

 

So why is my stop command being ignored? (I tried without the -m fast option
but no change, and I'd like to keep that for other reasons). I confirmed
with 'ps ax' that postgresql 15 is running, despite the directory suggesting
it might be 13.  Coincidentally, there is a postmaster.pid file in a
directory OTHER than the data directory:

 

/var/lib/postgresql/15/main/postmaster.pid

 

(and notice the 15). Is this a clue?

 



Re: pg_ctlcluster is not stopping cluster

2023-04-07 Thread Boris Epstein
I wonder if the best way to proceed would be to go on to individual nodes
in the cluster and use OS level commands (such as ps) to track individual
processes and stop them individually.

On Fri, Apr 7, 2023 at 6:27 PM Telium Technical Support 
wrote:

> I am string to stop my PostgreSQL (on debian 11) server using the
> following command
>
>
>
> root@d11:/# sudo -u postgres /usr/bin/pg_ctlcluster 15 main stop -- -m
> fast -D /var/lib/postgresql/13/main
>
> Notice: extra pg_ctl/postgres options given, bypassing systemctl for stop
> operation
>
> pg_ctl: PID file "/var/lib/postgresql/13/main/postmaster.pid" does not
> exist
>
> Is server running?
>
>
>
> The notice is correct, the is no such postmaster.pid file in that
> directory, but yes the service is running. I can confirm it's running with:
>
>
>
> root@d11:/# sudo -u postgres /usr/bin/pg_ctlcluster 15 main status -- -D
> /var/lib/postgresql/13/main
>
> pg_ctl: server is running (PID: 2701882)
>
> /usr/lib/postgresql/15/bin/postgres "-D" "/var/lib/postgresql/13/main"
> "-c" "config_file=/etc/postgresql/15/main/postgresql.conf"
>
>
>
> So why is my stop command being ignored? (I tried without the -m fast
> option but no change, and I'd like to keep that for other reasons). I
> confirmed with 'ps ax' that postgresql 15 is running, despite the directory
> suggesting it might be 13.  Coincidentally, there is a postmaster.pid file
> in a directory OTHER than the data directory:
>
>
>
> /var/lib/postgresql/15/main/postmaster.pid
>
>
>
> (and notice the 15). Is this a clue?
>
>
>


Re: pg_ctlcluster is not stopping cluster

2023-04-07 Thread Adrian Klaver

On 4/7/23 15:27, Telium Technical Support wrote:
I am string to stop my PostgreSQL (on debian 11) server using the 
following command


root@d11:/# sudo -u postgres /usr/bin/pg_ctlcluster 15 main stop -- -m 
fast -D /var/lib/postgresql/13/main


Notice: extra pg_ctl/postgres options given, bypassing systemctl for 
stop operation


pg_ctl: PID file "/var/lib/postgresql/13/main/postmaster.pid" does not exist

Is server running?

The notice is correct, the is no such postmaster.pid file in that 
directory, but yes the service is running. I can confirm it's running with:


root@d11:/# sudo -u postgres /usr/bin/pg_ctlcluster 15 main status -- -D 
/var/lib/postgresql/13/main


pg_ctl: server is running (PID: 2701882)

/usr/lib/postgresql/15/bin/postgres "-D" "/var/lib/postgresql/13/main" 
"-c" "config_file=/etc/postgresql/15/main/postgresql.conf"


So why is my stop command being ignored? (I tried without the -m fast 
option but no change, and I'd like to keep that for other reasons). I 
confirmed with 'ps ax' that postgresql 15 is running, despite the 
directory suggesting it might be 13.  Coincidentally, there is a 
postmaster.pid file in a directory OTHER than the data directory:


/var/lib/postgresql/15/main/postmaster.pid

(and notice the 15). Is this a clue?


Yes that this:

 sudo -u postgres /usr/bin/pg_ctlcluster 15 main stop -- -m fast -D 
/var/lib/postgresql/13/main


is not correct.

First do:

pg_lsclusters

to determine what is actually running.

Then do

sudo -u postgres /usr/bin/pg_ctlcluster  main stop -- -m fast 
-D /var/lib/postgresql/13/main stop -m fast


for whatever version is running.





--
Adrian Klaver
adrian.kla...@aklaver.com





RE: pg_ctlcluster is not stopping cluster

2023-04-07 Thread Telium Technical Support
These commands are actually run from a C++ program (which does a lot of other 
things)…so not easy to change.  I’m assuming something is misconfigured on the 
host that’s causing this unusual behavior….and that’s what I need to understand

 

From: Boris Epstein [mailto:borepst...@gmail.com] 
Sent: Friday, April 7, 2023 6:46 PM
To: Telium Technical Support 
Cc: pgsql-general@lists.postgresql.org
Subject: Re: pg_ctlcluster is not stopping cluster

 

I wonder if the best way to proceed would be to go on to individual nodes in 
the cluster and use OS level commands (such as ps) to track individual 
processes and stop them individually.

 

On Fri, Apr 7, 2023 at 6:27 PM Telium Technical Support mailto:supp...@telium.io> > wrote:

I am string to stop my PostgreSQL (on debian 11) server using the following 
command

 

root@d11:/# sudo -u postgres /usr/bin/pg_ctlcluster 15 main stop -- -m fast -D 
/var/lib/postgresql/13/main

Notice: extra pg_ctl/postgres options given, bypassing systemctl for stop 
operation

pg_ctl: PID file "/var/lib/postgresql/13/main/postmaster.pid" does not exist

Is server running?

 

The notice is correct, the is no such postmaster.pid file in that directory, 
but yes the service is running. I can confirm it's running with:

 

root@d11:/# sudo -u postgres /usr/bin/pg_ctlcluster 15 main status -- -D 
/var/lib/postgresql/13/main

pg_ctl: server is running (PID: 2701882)

/usr/lib/postgresql/15/bin/postgres "-D" "/var/lib/postgresql/13/main" "-c" 
"config_file=/etc/postgresql/15/main/postgresql.conf"

 

So why is my stop command being ignored? (I tried without the -m fast option 
but no change, and I'd like to keep that for other reasons). I confirmed with 
'ps ax' that postgresql 15 is running, despite the directory suggesting it 
might be 13.  Coincidentally, there is a postmaster.pid file in a directory 
OTHER than the data directory:

 

/var/lib/postgresql/15/main/postmaster.pid

 

(and notice the 15). Is this a clue?

 



Re: pg_ctlcluster is not stopping cluster

2023-04-07 Thread Thorsten Glaser
On Fri, 7 Apr 2023, Telium Technical Support wrote:

>Notice: extra pg_ctl/postgres options given, bypassing systemctl for stop

>it might be 13.  Coincidentally, there is a postmaster.pid file in a
>directory OTHER than the data directory:
>
>/var/lib/postgresql/15/main/postmaster.pid
>
>(and notice the 15). Is this a clue?

Maybe the pidfile is written into the cluster version-based directory
instead of the data directory. Best figure out what exactly writes the
pidfile, whether systemd is used for starting (which would of course be
a prime suspect as it says it’s explicitly not used for stopping), etc.

Maybe this is indeed a bug in whatever determines the pidfile path,
perhaps not; is it supposed to live within the data directory?

bye,
//mirabilos
-- 
15:41⎜ Somebody write a testsuite for helloworld :-)




RE: pg_ctlcluster is not stopping cluster

2023-04-07 Thread Thorsten Glaser
On Fri, 7 Apr 2023, Telium Technical Support wrote:

>I’m assuming something is misconfigured on the host that’s causing this
>unusual behavior….and that’s what I need to understand

The mix between 13 and 15 here is what I’d consider a misconfiguration.

Also, please don’t top-post and full-quote.

bye,
//mirabilos
-- 
15:41⎜ Somebody write a testsuite for helloworld :-)




RE: pg_ctlcluster is not stopping cluster

2023-04-07 Thread Telium Technical Support
I tried the command you suggested, and it shows that data directory status as 
"DOWN".  Yet when I ask pg_ctlcluster for it's status it says the server is 
running.

What does this mean?

root@d11:/var/tmp/myapp# pg_lsclusters 
Ver Cluster Port Status OwnerData directory  Log file
13  main5432 down   postgres /var/lib/postgresql/13/main 
/var/log/postgresql/postgresql-13-main.log
15  main5433 down   postgres /var/lib/postgresql/15/main 
/var/log/postgresql/postgresql-15-main.log
root@d11:/var/tmp/myapp# sudo -u postgres /usr/bin/pg_ctlcluster 15 main status 
-- -D /var/lib/postgresql/13/main
pg_ctl: server is running (PID: 2701882)
/usr/lib/postgresql/15/bin/postgres "-D" "/var/lib/postgresql/13/main" "-c" 
"config_file=/etc/postgresql/15/main/postgresql.conf"
root@d11:/var/tmp/myapp#

-Original Message-
From: Adrian Klaver [mailto:adrian.kla...@aklaver.com] 
Sent: Friday, April 7, 2023 6:47 PM
To: Telium Technical Support ; 
pgsql-general@lists.postgresql.org
Subject: Re: pg_ctlcluster is not stopping cluster

On 4/7/23 15:27, Telium Technical Support wrote:
> I am string to stop my PostgreSQL (on debian 11) server using the 
> following command
> 
> root@d11:/# sudo -u postgres /usr/bin/pg_ctlcluster 15 main stop -- -m 
> fast -D /var/lib/postgresql/13/main
> 
> Notice: extra pg_ctl/postgres options given, bypassing systemctl for 
> stop operation
> 
> pg_ctl: PID file "/var/lib/postgresql/13/main/postmaster.pid" does not 
> exist
> 
> Is server running?
> 
> The notice is correct, the is no such postmaster.pid file in that 
> directory, but yes the service is running. I can confirm it's running with:
> 
> root@d11:/# sudo -u postgres /usr/bin/pg_ctlcluster 15 main status -- 
> -D /var/lib/postgresql/13/main
> 
> pg_ctl: server is running (PID: 2701882)
> 
> /usr/lib/postgresql/15/bin/postgres "-D" "/var/lib/postgresql/13/main" 
> "-c" "config_file=/etc/postgresql/15/main/postgresql.conf"
> 
> So why is my stop command being ignored? (I tried without the -m fast 
> option but no change, and I'd like to keep that for other reasons). I 
> confirmed with 'ps ax' that postgresql 15 is running, despite the 
> directory suggesting it might be 13.  Coincidentally, there is a 
> postmaster.pid file in a directory OTHER than the data directory:
> 
> /var/lib/postgresql/15/main/postmaster.pid
> 
> (and notice the 15). Is this a clue?

Yes that this:

  sudo -u postgres /usr/bin/pg_ctlcluster 15 main stop -- -m fast -D 
/var/lib/postgresql/13/main

is not correct.

First do:

pg_lsclusters

to determine what is actually running.

Then do

sudo -u postgres /usr/bin/pg_ctlcluster  main stop -- -m fast -D 
/var/lib/postgresql/13/main stop -m fast

for whatever version is running.

> 

--
Adrian Klaver
adrian.kla...@aklaver.com






RE: pg_ctlcluster is not stopping cluster

2023-04-07 Thread Telium Technical Support
>The mix between 13 and 15 here is what I’d consider a misconfiguration.

As I inherited this (and I'm somewhat new to pgsql), I'm trying to understand 
this.  From the docs I read online, the postmaster.pid file is supposed to 
reside in the data directory.  Which it does.  So that's ok.

Does the fact that the database resides in the /var/lib/postgresql/13 mean I 
have multiple pgsql servers running?  (Does the directory name make a big 
difference)  Can I just kill all pgsql processes and move the directory into 
the /15 directory and problem solved?

It feels like I'm missing something obvious...why would the directory matter so 
much (since pgsql is clearly tracking it in the right dir)






Re: pg_ctlcluster is not stopping cluster

2023-04-07 Thread Adrian Klaver

On 4/7/23 15:51, Telium Technical Support wrote:

I tried the command you suggested, and it shows that data directory status as 
"DOWN".  Yet when I ask pg_ctlcluster for it's status it says the server is 
running.

What does this mean?

root@d11:/var/tmp/myapp# pg_lsclusters
Ver Cluster Port Status OwnerData directory  Log file
13  main5432 down   postgres /var/lib/postgresql/13/main 
/var/log/postgresql/postgresql-13-main.log
15  main5433 down   postgres /var/lib/postgresql/15/main 
/var/log/postgresql/postgresql-15-main.log


Neither cluster is running.

To confirm do:

ps ax | grep postgres


root@d11:/var/tmp/myapp# sudo -u postgres /usr/bin/pg_ctlcluster 15 main status 
-- -D /var/lib/postgresql/13/main
pg_ctl: server is running (PID: 2701882)


Best guess, is that because of this:

sudo -u postgres /usr/bin/pg_ctlcluster 15 main stop -- -m fast -D 
/var/lib/postgresql/13/main


the Postgres pid did not get removed on shutdown.

Bottom line you should not use

pg_ctlcluster 15 main stop

to

shut down a 13 cluster located at:

-D /var/lib/postgresql/13/main

Just do:

sudo -u postgres /usr/bin/pg_ctlcluster 15 main stop -m fast

or

sudo -u postgres /usr/bin/pg_ctlcluster 13 main stop -- -m fast

depending on which cluster you want to shut down.



/usr/lib/postgresql/15/bin/postgres "-D" "/var/lib/postgresql/13/main" "-c" 
"config_file=/etc/postgresql/15/main/postgresql.conf"
root@d11:/var/tmp/myapp#



--
Adrian Klaver
adrian.kla...@aklaver.com





Re: pg_ctlcluster is not stopping cluster

2023-04-07 Thread Jerry Sievers
Thorsten Glaser  writes:

> On Fri, 7 Apr 2023, Telium Technical Support wrote:
>
>>Notice: extra pg_ctl/postgres options given, bypassing systemctl for stop
>
>>it might be 13.  Coincidentally, there is a postmaster.pid file in a
>>directory OTHER than the data directory:
>>
>>/var/lib/postgresql/15/main/postmaster.pid
>>
>>(and notice the 15). Is this a clue?
>
> Maybe the pidfile is written into the cluster version-based directory
> instead of the data directory. Best figure out what exactly writes the
> pidfile, whether systemd is used for starting (which would of course be



Some distros in their wrapper foo, I believe including Debian use this
setting to put a PID file under /var/run/postgresql IIRC but YMMV...

external_pid_file (string)

 Specifies the name of an additional process-ID (PID) file that the
 server should create for use by server administration programs. This
 parameter can only be set at server start.


> a prime suspect as it says it’s explicitly not used for stopping), etc.
>
> Maybe this is indeed a bug in whatever determines the pidfile path,
> perhaps not; is it supposed to live within the data directory?
>
> bye,
> //mirabilos