Re: how to slow down parts of Pg

2020-04-22 Thread Laurenz Albe
On Tue, 2020-04-21 at 20:30 +, Kevin Brannen wrote:
> I have an unusual need:  I need Pg to slow down. I know, we all want our DB 
> to go faster,
> but in this case it's speed is working against me in 1 area.
> 
> We have systems that are geo-redundant for HA, with the redundancy being 
> handled by DRBD to keep the disks in sync,
> which it does at the block level. For normal operations, it actually works 
> out fairly well. [...]
> The part that hurts so bad is when we do maintenance operations that are DB 
> heavy, like deleting really old records out of
> archives (weekly), moving older records from current tables to archive tables 
> plus an analyze (every night),
> running pg_backup (every night), other archiving (weekly), and vacuum full to 
> remove bloat (once a quarter).
> All of this generates a lot of disk writes, to state the obvious.
> 
> The local server can handle it all just fine, but the network can't handle it 
> as it tries to sync to the other server.

The obvious and best answer is: get a faster network, or choose a different
storage solution.

Other than that, you can try to make the maintainance operations less
resource intense:

- partition the tables so that you can get rid of old data with DROP TABLE.
  The ANALYZE won't hurt, if you treat only the required tables.
- use "pg_basebackup" with the "--max-rate" option

About VACUUM, you may have a problem.  Avoid the need for VACUUM (FULL) at any 
price.
That usually requires tuning autovacuum to be faster, which means using more 
I/O.

If you cannot find a sweet spot there, you have no alternative but getting 
better I/O
(which, as I said in the beginning, would be the correct solution anyway).

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com





Using GIN index to retrieve distinct values

2020-04-22 Thread Alexander Hill
Hi all,

My understanding of a GIN index is that it maps each unique key to the set
of items containing that key. That suggests to me that enumerating the
distinct keys efficiently using the index should be possible in some cases.
I understand that given the nature of the GIN framework, the original key
won't always be able to be retrieved. But in certain cases like int array,
for example, would it be possible?

While on the topic: what's the canonical way to do that kind of query?

CREATE TABLE tagged_items
(
id serial PRIMARY KEY,
tag_ids integer[]
);

select array_agg(tag_id)
from (select distinct unnest(tag_ids) from items) tag_id;

Thanks!
Alex


How do work tercile, percentile & funcion percentile_cont() ?

2020-04-22 Thread PALAYRET Jacques
Hello, 

>From a table x(c1) containing 30 lines with integer values (column c1) from 1 
>to 30 : 

SELECT percentile_cont(1./3) WITHIN GROUP (ORDER BY c1) FROM x ; 
percentile_cont 
-- 
10.7 
(1 ligne) 
SELECT percentile_cont(2./3) WITHIN GROUP (ORDER BY c1) FROM x ; 
percentile_cont 
-- 
20.3 
(1 ligne) 

If ordering, 
the first 10 lines (c1 from 1 to 10) have ntile(3) OVER(ORDER BY c1) = 1, 
the 10 following lines (c1 from 11 to 20) have ntile(3) OVER(ORDER BY c1) = 2, 
the last 10 lines (c1 from 21 to 30) have ntile(3) OVER(ORDER BY c1) = 3. 
So, I though it should be : 
percentile_cont(1./3) = 10.5 (the arithmetic average between 10 et 11) and not 
10.7 
percentile_cont(2./3) = 20.5 (the arithmetic average between 20 et 21) and not 
20.3 

Thank in advance for explanation 

- Météo-France - 
PALAYRET JACQUES 
DCSC/MBD 
jacques.palay...@meteo.fr 
Fixe : +33 561078319 


Re: How do work tercile, percentile & funcion percentile_cont() ?

2020-04-22 Thread PALAYRET Jacques
Of course, the same result with this table x(c1) for example : 
SELECT c1, ntile(3) OVER(ORDER BY c1) FROM x ORDER BY c1 ; 
c1 | ntile 
+--- 
1 | 1 
2 | 1 
3 | 1 
4 | 1 
5 | 1 
6 | 1 
7 | 1 
8 | 1 
9 | 1 
10 | 1 
11 | 2 
11 | 2 
11 | 2 
11 | 2 
11 | 2 
11 | 2 
11 | 2 
11 | 2 
11 | 2 
20 | 2 
21 | 3 
21 | 3 
21 | 3 
21 | 3 
21 | 3 
21 | 3 
21 | 3 
21 | 3 
21 | 3 
21 | 3 
(30 lignes) 

SELECT percentile_cont(1./3) WITHIN GROUP (ORDER BY c1) FROM x ; 
percentile_cont 
-- 
10.7 
(1 ligne) 

SELECT percentile_cont(2./3) WITHIN GROUP (ORDER BY c1) FROM x ; 
percentile_cont 
-- 
20.3 
(1 ligne) 




De: "PALAYRET Jacques"  
À: pgsql-general@lists.postgresql.org 
Envoyé: Mercredi 22 Avril 2020 10:00:49 
Objet: How do work tercile, percentile & funcion percentile_cont() ? 

Hello, 

>From a table x(c1) containing 30 lines with integer values (column c1) from 1 
>to 30 : 

SELECT percentile_cont(1./3) WITHIN GROUP (ORDER BY c1) FROM x ; 
percentile_cont 
-- 
10.7 
(1 ligne) 
SELECT percentile_cont(2./3) WITHIN GROUP (ORDER BY c1) FROM x ; 
percentile_cont 
-- 
20.3 
(1 ligne) 

If ordering, 
the first 10 lines (c1 from 1 to 10) have ntile(3) OVER(ORDER BY c1) = 1, 
the 10 following lines (c1 from 11 to 20) have ntile(3) OVER(ORDER BY c1) = 2, 
the last 10 lines (c1 from 21 to 30) have ntile(3) OVER(ORDER BY c1) = 3. 
So, I though it should be : 
percentile_cont(1./3) = 10.5 (the arithmetic average between 10 et 11) and not 
10.7 
percentile_cont(2./3) = 20.5 (the arithmetic average between 20 et 21) and not 
20.3 

Thank in advance for explanation 

- Météo-France - 
PALAYRET JACQUES 
DCSC/MBD 
jacques.palay...@meteo.fr 
Fixe : +33 561078319 



Re: How do work tercile, percentile & funcion percentile_cont() ?

2020-04-22 Thread PALAYRET Jacques
With Libre Office Calc, same result than with PostgreSQL : 
=CENTILE({1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;23;24;25;26;27;28;29;30};
 0.33) 
gives 
10.57 (and NOT 10.5) 

I didn't think so. 

De: "PALAYRET Jacques"  
À: pgsql-general@lists.postgresql.org 
Envoyé: Mercredi 22 Avril 2020 10:15:26 
Objet: Re: How do work tercile, percentile & funcion percentile_cont() ? 

Of course, the same result with this table x(c1) for example : 
SELECT c1, ntile(3) OVER(ORDER BY c1) FROM x ORDER BY c1 ; 
c1 | ntile 
+--- 
1 | 1 
2 | 1 
3 | 1 
4 | 1 
5 | 1 
6 | 1 
7 | 1 
8 | 1 
9 | 1 
10 | 1 
11 | 2 
11 | 2 
11 | 2 
11 | 2 
11 | 2 
11 | 2 
11 | 2 
11 | 2 
11 | 2 
20 | 2 
21 | 3 
21 | 3 
21 | 3 
21 | 3 
21 | 3 
21 | 3 
21 | 3 
21 | 3 
21 | 3 
21 | 3 
(30 lignes) 

SELECT percentile_cont(1./3) WITHIN GROUP (ORDER BY c1) FROM x ; 
percentile_cont 
-- 
10.7 
(1 ligne) 

SELECT percentile_cont(2./3) WITHIN GROUP (ORDER BY c1) FROM x ; 
percentile_cont 
-- 
20.3 
(1 ligne) 




De: "PALAYRET Jacques"  
À: pgsql-general@lists.postgresql.org 
Envoyé: Mercredi 22 Avril 2020 10:00:49 
Objet: How do work tercile, percentile & funcion percentile_cont() ? 

Hello, 

>From a table x(c1) containing 30 lines with integer values (column c1) from 1 
>to 30 : 

SELECT percentile_cont(1./3) WITHIN GROUP (ORDER BY c1) FROM x ; 
percentile_cont 
-- 
10.7 
(1 ligne) 
SELECT percentile_cont(2./3) WITHIN GROUP (ORDER BY c1) FROM x ; 
percentile_cont 
-- 
20.3 
(1 ligne) 

If ordering, 
the first 10 lines (c1 from 1 to 10) have ntile(3) OVER(ORDER BY c1) = 1, 
the 10 following lines (c1 from 11 to 20) have ntile(3) OVER(ORDER BY c1) = 2, 
the last 10 lines (c1 from 21 to 30) have ntile(3) OVER(ORDER BY c1) = 3. 
So, I though it should be : 
percentile_cont(1./3) = 10.5 (the arithmetic average between 10 et 11) and not 
10.7 
percentile_cont(2./3) = 20.5 (the arithmetic average between 20 et 21) and not 
20.3 

Thank in advance for explanation 

- Météo-France - 
PALAYRET JACQUES 
DCSC/MBD 
jacques.palay...@meteo.fr 
Fixe : +33 561078319 



Re: How do work tercile, percentile & funcion percentile_cont() ?

2020-04-22 Thread Tom Lane
PALAYRET Jacques  writes:
> SELECT percentile_cont(1./3) WITHIN GROUP (ORDER BY c1) FROM x ; 
> percentile_cont 
> -- 
> 10.7 
> (1 ligne) 

> So, I though it should be : 
> percentile_cont(1./3) = 10.5 (the arithmetic average between 10 et 11) and 
> not 10.7 

As far as I can see from looking at the SQL spec, this is correct;
at least, it's clear that we're *not* supposed to just interpolate
halfway between the nearest rows.  The result is defined to be

T0.Y + FACTOR * (T1.Y - T0.Y)

where T0.Y and T1.Y are the nearest row values and FACTOR is

NVE*(N–1) – floor(NVE*(N–1))

where NVE is the argument of percentile_cont and N is the number of rows.
(In SQL:2003, see 10.9  general rule 7)h)i).)

regards, tom lane




RE: how to slow down parts of Pg

2020-04-22 Thread Kevin Brannen
>From: Laurenz Albe 
>
>>On Tue, 2020-04-21 at 20:30 +, Kevin Brannen wrote:
>> I have an unusual need:  I need Pg to slow down. I know, we all want
>> our DB to go faster, but in this case it's speed is working against me in 1 
>> area.
>>
>> [...] The part that hurts so bad is when we do maintenance operations
>> that are DB heavy, like deleting really old records out of archives 
>> (weekly), moving older records from current tables
>> to archive tables plus an analyze (every night), running pg_backup (every 
>> night), other archiving (weekly), and vacuum full to remove bloat (once a 
>> quarter).
>> All of this generates a lot of disk writes, to state the obvious.
>>
>> The local server can handle it all just fine, but the network can't handle 
>> it as it tries to sync to the other server.
>
>The obvious and best answer is: get a faster network, or choose a different 
>storage solution.

I believe I mention originally that the network is controlled by others (the 
customer). I've pointed out the results of their choice
repeatedly, but their reply is always "budgets", and I reply as politely as I 
can, "faster network or live with the slowness as I've
done all I can for now". It's a somewhat frustrating conversation as you can 
imagine.


>Other than that, you can try to make the maintainance operations less resource 
>intense:
>
>- partition the tables so that you can get rid of old data with DROP TABLE.
>  The ANALYZE won't hurt, if you treat only the required tables.
>- use "pg_basebackup" with the "--max-rate" option

Yes, this was the heart of the post, how to use less resources. I'd always 
thought of partitioning the larger tables as an optimization
for running reports, but my eyes have been opened that it has other benefits 
too.

I'm not sure changing the backup program will help when it's sitting on top of 
DRBD, but I can limit DRBD's rate to create the same effect.
Still, it doesn't hurt to spend a little time researching this. OTOH, you did 
just prompt an idea, so that's helpful too.


>About VACUUM, you may have a problem.  Avoid the need for VACUUM (FULL) at any 
>price.
>That usually requires tuning autovacuum to be faster, which means using more 
>I/O.

OK, I've never really liked doing a "full", but I perceived it as helpful to 
us. I'll see about making autovacuum more aggressive.

Thanks!
Kevin
.
This e-mail transmission, and any documents, files or previous e-mail messages 
attached to it, may contain confidential information. If you are not the 
intended recipient, or a person responsible for delivering it to the intended 
recipient, you are hereby notified that any disclosure, distribution, review, 
copy or use of any of the information contained in or attached to this message 
is STRICTLY PROHIBITED. If you have received this transmission in error, please 
immediately notify us by reply e-mail, and destroy the original transmission 
and its attachments without reading them or saving them to disk. Thank you.


Re: how to slow down parts of Pg

2020-04-22 Thread Adrian Klaver

On 4/21/20 7:43 PM, Virendra Kumar wrote:

Hi Adrian,

Here is test case, basically when autovacuum runs it did release the 
space to disk since it had may be continuous blocks which can be 
released to disk but the space used by index is still being held until I 
ran the reindex on the table (I assume reindex for index would work as 
well). Subsequent insert statement did not utilize the blocks in index 
segment as we can see below so index bloats are still not addressed or 
may be I am doing something wrong:





Well I learned something. I replicated your commands on my 12 instance 
and got the exact same results. I should have read this:


https://www.postgresql.org/docs/12/routine-reindex.html

before.


--
Adrian Klaver
adrian.kla...@aklaver.com




PG based ETL tools

2020-04-22 Thread brajmohan saxena
Hi,

What are the ETL tools there based on PostgreSQL.
I am looking for PG based MPP + ETL open source databases.

Thanks
Braj


Re: how to slow down parts of Pg

2020-04-22 Thread Ron

What you need is async replication instead of synchronous replication.

On 4/21/20 3:30 PM, Kevin Brannen wrote:


I have an unusual need:  I need Pg to slow down. I know, we all want our 
DB to go faster, but in this case it's speed is working against me in 1 area.


We have systems that are geo-redundant for HA, with the redundancy being 
handled by DRBD to keep the disks in sync, which it does at the block 
level. For normal operations, it actually works out fairly well. That 
said, we recognize that what we really need to do is one of the forms of 
streaming (ch 26 of the manual) which I believe would help this problem a 
lot if not solve it -- but we don't have the time to do that at the 
moment. I plan and hope to get there by the end of the year. The part that 
hurts so bad is when we do maintenance operations that are DB heavy, like 
deleting really old records out of archives (weekly), moving older records 
from current tables to archive tables plus an analyze (every night), 
running pg_backup (every night), other archiving (weekly), and vacuum full 
to remove bloat (once a quarter). All of this generates a lot of disk 
writes, to state the obvious.


The local server can handle it all just fine, but the network can't handle 
it as it tries to sync to the other server. Sometimes we can add network 
bandwidth, many times we can't as it depends on others. To borrow a phrase 
from the current times, we need to flatten the curve. 😊


A few parts of our maintenance process I've tamed by doing "nice -20" on 
the process (e.g. log rotation); but I can't really do that for Pg because 
the work gets handed off to a background process that's not a direct child 
process … and I don't want to slow the DB as a whole because other work is 
going on (like handling incoming data).


Part of the process I've slowed down by doing the work in chunks of 10K 
rows at a time with a pause between each chunk to allow the network to 
catch up (instead of an entire table in 1 statement). This sort of works, 
but some work/SQL is between hard to next-to-impossible to break up like 
that. That also produces some hard spikes, but that's better than the 
alternative (next sentence). Still, large portions of the process are hard 
to control and just punch the network to full capacity and hold it there 
for far too long.


So, do I have any other options to help slow down some of the Pg 
operations? Or maybe some other short-term mitigations we can do with Pg 
configurations? Or is this a case where we've already done all we can do 
and the only answer is move to WAL streaming as fast as possible?


If it matters, this is being run on Linux servers. Pg 12.2 is in final 
testing and will be rolled out to production soon -- so feel free to offer 
suggestions that only apply to 12.x.


Thanks,

Kevin

This e-mail transmission, and any documents, files or previous e-mail 
messages attached to it, may contain confidential information. If you are 
not the intended recipient, or a person responsible for delivering it to 
the intended recipient, you are hereby notified that any disclosure, 
distribution, review, copy or use of any of the information contained in 
or attached to this message is STRICTLY PROHIBITED. If you have received 
this transmission in error, please immediately notify us by reply e-mail, 
and destroy the original transmission and its attachments without reading 
them or saving them to disk. Thank you. 


--
Angular momentum makes the world go 'round.


Can I tell libpq to connect to the primary?

2020-04-22 Thread Christian Ramseyer
Hi

I'm setting up a new environment with a primary/hot standby replication
pair.

For read-only clients, I have found the host=host1,host2 connection
string[1] which allows a connection when any one of the servers is up.

However I'm unsure how to achieve something similar for read/write
clients. If I'm providing this list, the client will connect to the
first host, which might have become the standby in the meantime.

I see that in pgjdbc there are additional options for targetServerType =
any, primary, secondary, preferSlave and preferSecondary[2]. However
this seems to be java-specific and not implemented in libpq? Is there a
way to get this behaviour in PHP/Psycopg/Perl DBI?

If not, what is the best alternative to achieve this? My primary &
standby are not in the same L3 network, so moving around and IP address
with Linux OS clustering is not an option. I'm tending to scripting the
REST API of our DNS service to point a CNAME at the new primary during
promotion, but maybe there is an easier way I haven't found yet? Maybe
something in pgpool/pgbouncer etc?


Cheers
Christian

[1] https://www.postgresql.org/docs/current/libpq-connect.html
[2] https://jdbc.postgresql.org/documentation/head/connect.html




Re: how to slow down parts of Pg

2020-04-22 Thread Peter J. Holzer
On 2020-04-21 21:16:57 +, Kevin Brannen wrote:
> From: Michael Loftis 
> > drbdsetup allows you to control the sync rates.
> 
> I was hoping not to have to do that, but the more I think about this I'm
> realizing that it won't hurt because the network cap is effectively limiting 
> me
> anyway. :)

Alternatively you might consider traffic shaping. DRBD can only set a
fixed limit (because it knows only about its own traffic). Traffic
shaping can adjust the limit depending on other traffic (it can also
prioritize traffic, etc.). However, to be effective, it needs to run on
a router as close to the bottleneck as possible - typically that means
either the border router or the firewall. So it is something the
customer's network guy should set up.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: How do work tercile, percentile & funcion percentile_cont() ?

2020-04-22 Thread Peter J. Holzer
On 2020-04-22 10:00:49 +0200, PALAYRET Jacques wrote:
> From a table x(c1) containing 30 lines with integer values (column c1) from 1
> to 30 :
> 
> SELECT percentile_cont(1./3) WITHIN GROUP (ORDER BY c1) FROM x ;
>  percentile_cont
> --
>  10.7
> (1 ligne)
> SELECT percentile_cont(2./3) WITHIN GROUP (ORDER BY c1) FROM x ;
>  percentile_cont
> --
>  20.3
> (1 ligne)

Think of the 30 values as points in a graph: The x values run from 0 to
29, the y values are your values. The points are connected with straight
lines. Since your y values just increase by 1 you have a straight line
from (0, 1) to (29, 30)

Then percentile_cont(1./3) is the value y value 1/3 along that line and
percentile_cont(2./3) is the y value 2/3 along that line. So you need to
find the y values corresponding to 29*(1/3) = 9.667 and 29*(2/3) = 19.333.
These are obviously 10.667 and 20.333 respectively.

hp

-- 
   _  | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| |   | h...@hjp.at |-- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |   challenge!"


signature.asc
Description: PGP signature


Re: Can I tell libpq to connect to the primary?

2020-04-22 Thread Christian Ramseyer
On 22.04.20 21:10, Christian Ramseyer wrote:
> 
> I see that in pgjdbc there are additional options for targetServerType =
> any, primary, secondary, preferSlave and preferSecondary[2]. However
> this seems to be java-specific and not implemented in libpq? Is there a
> way to get this behaviour in PHP/Psycopg/Perl DBI?
> 

Never mind, after RTFM'ing to the very end of
https://www.postgresql.org/docs/current/libpq-connect.html I have
discovered target_session_attrs=read-write|any which seems to do exactly
what I want.

Cheers
Christian