Re: [PERFORM] Query planner wants to use seq scan

2015-10-27 Thread Alex Ignatov

On 27.10.2015 12:35, Bertrand Paquet wrote:

Hi all,

We have a slow query. After analyzing, the planner decision seems to 
be discutable : the query is faster when disabling seqscan. See below 
the two query plan, and an extract from pg_stats.


Any idea about what to change to help the planner ?

An information which can be useful : the number on distinct value on 
organization_id is very very low, may be the planner does not known 
that, and take the wrong decision.


Regards,

Bertrand

# explain analyze SELECT  1 AS one FROM "external_sync_messages"  
WHERE "external_sync_messages"."organization_id" = 1612 AND 
("external_sync_messages"."status" NOT IN ('sent_to_proxy', 
'in_progress', 'ok')) AND "external_sync_messages"."handled_by" IS 
NULL LIMIT 1;


  QUERY PLAN



 Limit  (cost=0.00..12.39 rows=1 width=0) (actual 
time=232.212..232.213 rows=1 loops=1)


   ->  Seq Scan on external_sync_messages  (cost=0.00..79104.69 
rows=6385 width=0) (actual time=232.209..232.209 rows=1 loops=1)


 Filter: ((handled_by IS NULL) AND (organization_id = 1612) 
AND ((status)::text <> ALL ('{sent_to_proxy,in_progress,ok}'::text[])))


 Rows Removed by Filter: 600140

 Planning time: 0.490 ms

 Execution time: 232.246 ms

(6 rows)

# set enable_seqscan = off;

SET

# explain analyze SELECT  1 AS one FROM "external_sync_messages"  
WHERE "external_sync_messages"."organization_id" = 1612 AND 
("external_sync_messages"."status" NOT IN ('sent_to_proxy', 
'in_progress', 'ok')) AND "external_sync_messages"."handled_by" IS 
NULL LIMIT 1;


QUERY PLAN

--

 Limit  (cost=0.42..39.88 rows=1 width=0) (actual time=0.030..0.030 
rows=1 loops=1)


   ->  Index Scan using 
index_external_sync_messages_on_organization_id on 
external_sync_messages  (cost=0.42..251934.05 rows=6385 width=0) 
(actual time=0.028..0.028 rows=1 loops=1)


 Index Cond: (organization_id = 1612)

 Filter: ((handled_by IS NULL) AND ((status)::text <> ALL 
('{sent_to_proxy,in_progress,ok}'::text[])))


 Planning time: 0.103 ms

 Execution time: 0.052 ms

(6 rows)

# SELECT attname, inherited, 
n_distinct, array_to_string(most_common_vals, E'\n') as 
most_common_vals FROM pg_stats WHERE tablename = 
'external_sync_messages' and attname IN ('status', 'organization_id', 
'handled_by');


 attname | inherited | n_distinct | most_common_vals

-+---++--

 handled_by  | f |   3 | 3   +

 |   | | 236140  +

 |   | | 54413

 organization_id | f | 22 | 1612+

 |   | | 287 +

 |   | | 967 +

 |   | | 1223+

 |   | | 1123+

 |   | | 1930+

 |   | | 841 +

 |   | | 1814+

 |   | | 711 +

 |   | | 1513+

 |   | | 1794+

 |   | | 1246+

 |   | | 1673+

 |   | | 1552+

 |   | | 1747+

 |   | | 2611+

 |   | | 2217+

 |   | | 2448+

 |   | | 2133+

 |   | | 1861+

 |   | | 2616+

 |   | | 2796

 status  | f |   6 | ok  +

 |   | | ignored +

 |   | | channel_error   +

 |   | | in_progress +

 |   | | error   +

 |   | | sent_to_proxy

(3 rows)

# select count(*) from external_sync_mes

Re: [PERFORM] Query planner wants to use seq scan

2015-10-27 Thread Alex Ignatov

On 27.10.2015 14:10, Bertrand Paquet wrote:

Yes, I have run VACUUM ANALYZE, no effect.

Bertrand

2015-10-27 12:08 GMT+01:00 Alex Ignatov <mailto:a.igna...@postgrespro.ru>>:


On 27.10.2015 12:35, Bertrand Paquet wrote:

Hi all,

We have a slow query. After analyzing, the planner decision
seems to be discutable : the query is faster when disabling
seqscan. See below the two query plan, and an extract from
pg_stats.

Any idea about what to change to help the planner ?

An information which can be useful : the number on distinct
value on organization_id is very very low, may be the planner
does not known that, and take the wrong decision.

Regards,

Bertrand

# explain analyze SELECT  1 AS one FROM
"external_sync_messages"  WHERE
"external_sync_messages"."organization_id" = 1612 AND
("external_sync_messages"."status" NOT IN ('sent_to_proxy',
'in_progress', 'ok')) AND
"external_sync_messages"."handled_by" IS NULL LIMIT 1;

  QUERY PLAN




 Limit  (cost=0.00..12.39 rows=1 width=0) (actual
time=232.212..232.213 rows=1 loops=1)

   ->  Seq Scan on external_sync_messages (cost=0.00..79104.69
rows=6385 width=0) (actual time=232.209..232.209 rows=1 loops=1)

 Filter: ((handled_by IS NULL) AND (organization_id =
1612) AND ((status)::text <> ALL
('{sent_to_proxy,in_progress,ok}'::text[])))

 Rows Removed by Filter: 600140

 Planning time: 0.490 ms

 Execution time: 232.246 ms

(6 rows)

# set enable_seqscan = off;

SET

# explain analyze SELECT  1 AS one FROM
"external_sync_messages"  WHERE
"external_sync_messages"."organization_id" = 1612 AND
("external_sync_messages"."status" NOT IN ('sent_to_proxy',
'in_progress', 'ok')) AND
"external_sync_messages"."handled_by" IS NULL LIMIT 1;

QUERY PLAN


--

 Limit  (cost=0.42..39.88 rows=1 width=0) (actual
time=0.030..0.030 rows=1 loops=1)

   ->  Index Scan using
index_external_sync_messages_on_organization_id on
external_sync_messages  (cost=0.42..251934.05 rows=6385
width=0) (actual time=0.028..0.028 rows=1 loops=1)

 Index Cond: (organization_id = 1612)

 Filter: ((handled_by IS NULL) AND ((status)::text <>
ALL ('{sent_to_proxy,in_progress,ok}'::text[])))

 Planning time: 0.103 ms

 Execution time: 0.052 ms

(6 rows)

# SELECT attname, inherited, n_distinct,
array_to_string(most_common_vals, E'\n') as most_common_vals
FROM pg_stats WHERE tablename = 'external_sync_messages' and
attname IN ('status', 'organization_id', 'handled_by');

 attname | inherited | n_distinct | most_common_vals

-+---++--

 handled_by  | f |   3 | 3+

 |   | | 236140   +

 |   | | 54413

 organization_id | f | 22 | 1612   +

 |   | | 287+

 |   | | 967+

 |   | | 1223   +

 |   | | 1123   +

 |   | | 1930   +

 |   | | 841+

 |   | | 1814   +

 |   | | 711+

 |   | | 1513   +

 |   | | 1794   +

 |   | | 1246   +

 |   | | 1673   +

 |   | | 1552   +

 |   | | 1747   +

 |   | | 2611   +

 |   | | 2217   +

 |   | | 2448   +

 |   | |

Re: [PERFORM] Query planner wants to use seq scan

2015-10-27 Thread Alex Ignatov



On 27.10.2015 14:19, Bertrand Paquet wrote:


relname | n_live_tup | n_dead_tup |  
last_vacuum  | last_autovacuum| last_analyze  
|   last_autoanalyze


+++---+---+---+---

 external_sync_messages | 998105 |  11750 | 2015-10-26 
20:15:17.484771+00 | 2015-10-02 15:04:25.944479+00 | 2015-10-26 
20:15:19.465308+00 | 2015-10-22 12:24:26.947616+00


(1 row)


2015-10-27 12:17 GMT+01:00 Alex Ignatov <mailto:a.igna...@postgrespro.ru>>:


On 27.10.2015 14:10, Bertrand Paquet wrote:

Yes, I have run VACUUM ANALYZE, no effect.

Bertrand

2015-10-27 12:08 GMT+01:00 Alex Ignatov mailto:a.igna...@postgrespro.ru>>:

On 27.10.2015 12:35, Bertrand Paquet wrote:

Hi all,

We have a slow query. After analyzing, the planner
decision seems to be discutable : the query is faster
when disabling seqscan. See below the two query plan, and
an extract from pg_stats.

Any idea about what to change to help the planner ?

An information which can be useful : the number on
distinct value on organization_id is very very low, may
be the planner does not known that, and take the wrong
decision.

Regards,

Bertrand

# explain analyze SELECT  1 AS one FROM
"external_sync_messages"  WHERE
"external_sync_messages"."organization_id" = 1612 AND
("external_sync_messages"."status" NOT IN
('sent_to_proxy', 'in_progress', 'ok')) AND
"external_sync_messages"."handled_by" IS NULL LIMIT 1;

  QUERY PLAN




 Limit  (cost=0.00..12.39 rows=1 width=0) (actual
time=232.212..232.213 rows=1 loops=1)

   ->  Seq Scan on external_sync_messages
(cost=0.00..79104.69 rows=6385 width=0) (actual
time=232.209..232.209 rows=1 loops=1)

 Filter: ((handled_by IS NULL) AND
(organization_id = 1612) AND ((status)::text <> ALL
('{sent_to_proxy,in_progress,ok}'::text[])))

 Rows Removed by Filter: 600140

 Planning time: 0.490 ms

 Execution time: 232.246 ms

(6 rows)

# set enable_seqscan = off;

SET

# explain analyze SELECT  1 AS one FROM
"external_sync_messages"  WHERE
"external_sync_messages"."organization_id" = 1612 AND
("external_sync_messages"."status" NOT IN
('sent_to_proxy', 'in_progress', 'ok')) AND
"external_sync_messages"."handled_by" IS NULL LIMIT 1;

QUERY PLAN


--

 Limit  (cost=0.42..39.88 rows=1 width=0) (actual
time=0.030..0.030 rows=1 loops=1)

   ->  Index Scan using
index_external_sync_messages_on_organization_id on
external_sync_messages (cost=0.42..251934.05 rows=6385
width=0) (actual time=0.028..0.028 rows=1 loops=1)

 Index Cond: (organization_id = 1612)

 Filter: ((handled_by IS NULL) AND
((status)::text <> ALL
('{sent_to_proxy,in_progress,ok}'::text[])))

 Planning time: 0.103 ms

 Execution time: 0.052 ms

(6 rows)

# SELECT attname, inherited, n_distinct,
array_to_string(most_common_vals, E'\n') as
most_common_vals FROM pg_stats WHERE tablename =
'external_sync_messages' and attname IN ('status',
'organization_id', 'handled_by');

 attname | inherited | n_distinct | most_common_vals

-+---++--

 handled_by  | f |   3 | 3   +

 |   | | 236140  +

 |   | | 54413

 organization_id | f | 22 | 1612+

 |   |

Re: [PERFORM] Query planner wants to use seq scan

2015-10-29 Thread Alex Ignatov



On 27.10.2015 23:56, Bertrand Paquet wrote:

So,

Tonight, the index on the three field is used, may be my yesterday 
vacuum updated stats.


Thx you for your help.

Regards,

Bertrand




2015-10-27 18:33 GMT+01:00 Bertrand Paquet 
mailto:bertrand.paq...@doctolib.fr>>:


Hi tom,

I did the test yesterday with an index on the three fields, and
with a partial index on organization and status and where is null
condition on handled. I saw no modification on query plan.
May be I forgot to analyze vacuum after. I will retry tonight.

I use a btree index. Is it the good solution, even with the In
clause ?

Regards,

Bertrand

Le mardi 27 octobre 2015, Tom Lane mailto:t...@sss.pgh.pa.us>> a écrit :

Bertrand Paquet  writes:
> We have a slow query. After analyzing, the planner decision
seems to be
> discutable : the query is faster when disabling seqscan. See
below the two
> query plan, and an extract from pg_stats.

> Any idea about what to change to help the planner ?

Neither one of those plans is very good: you're just hoping
that the
Filter condition will let a tuple through sooner rather than
later.

If you care about the performance of this type of query, I'd
consider
creating an index on (organization_id, status, handled_by) so
that all
the conditions can be checked in the index.

regards, tom lane



Hello Bertrand once again!
What's your status? Does the plan changed after deploying three field 
index ?


--
Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [PERFORM] Scalability to more than 64 cores With PG 9.4 and RHEL 7.1 Kernel 3.10

2015-10-29 Thread Alex Ignatov


On 28.10.2015 8:04, Javier Muro wrote:
Has PostgreSQL 9.4  official Support on a system with more than 64 
legitimate cores? ( 72 Cores , 4 CPUs Intel(R) Xeon(R) CPU E7-8890 )




The  work Robert Haas did to fix the CPU locking way back when showed
significant improvements up to 64, but so far as I know.


Thanks in Advance.

Hello Javier!
Our tests shows that PG 9.4  scales well up to 60 Intel cores. I.E 
pgbech -S and  DB on tmpfs gave us 700 000 tps. After 60 соres s_lock is 
dominating  in cpu usage%. 9.5 scales way better.


--
Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Query that took a lot of time in Postgresql when not using trim in order by

2015-11-25 Thread Alex Ignatov
n on public.dim_cliente  (cost=0.00..618.90 
rows=16890 width=86) (actual time=0.008..16.537 rows=16890 loops=1)



 Output: dim_cliente.tipocliente, dim_cliente.a1_ibge, 
dim_cliente.a1_cod, dim_cliente.a1_nome, dim_cliente.key_cliente



 ->  Hash  (cost=18.90..18.90 rows=590 width=59) (actual 
time=0.747..0.747 rows=590 loops=1)



 Output: dim_vendedor.a3_nome, dim_vendedor.key_vendedor


 Buckets: 1024  Batches: 1  Memory Usage: 56kB


 ->  Seq Scan on public.dim_vendedor  (cost=0.00..18.90 rows=590 
width=59) (actual time=0.026..0.423 rows=590 loops=1)



   Output: dim_vendedor.a3_nome, dim_vendedor.key_vendedor


 Total runtime: 3845.895 ms


(25 filas)



___


Query without trim

SELECT "dim_cliente"."tipocliente" AS "c0",

 "dim_cliente"."a1_ibge" AS "c1",

 "dim_cliente"."a1_cod" AS "c2",

 "dim_cliente"."a1_nome" AS "c3",

 "dim_vendedor"."a3_nome" AS "c4"

  FROM "public"."dim_cliente" AS "dim_cliente",

 "public"."fato_ventas_productos" AS "fato_ventas_productos",

 "public"."dim_vendedor" AS "dim_vendedor"

 WHERE "fato_ventas_productos"."key_cliente" = 
"dim_cliente"."key_cliente"


   AND "fato_ventas_productos"."key_vendedor" = 
"dim_vendedor"."key_vendedor"


 GROUP

BY "dim_cliente"."tipocliente" ,

 "dim_cliente"."a1_ibge",

 "dim_cliente"."a1_cod",

 "dim_cliente"."a1_nome",

 "dim_vendedor"."a3_nome"

 ORDER

BY "dim_cliente"."tipocliente" ASC NULLS LAST,

 "dim_cliente"."a1_ibge" ASC NULLS LAST,

 "dim_cliente"."a1_cod" ASC NULLS LAST,

 "dim_cliente"."a1_nome" ASC NULLS LAST;

-- this query took 37249.268 ms


___


Query Plan when not using trim

QUERY PLAN

---

 Group  (cost=170417.48..184027.84 rows=907357 width=129) (actual 
time=36649.329..37235.158 rows=43615 loops=1)


   Output: dim_cliente.tipocliente, dim_cliente.a1_ibge, 
dim_cliente.a1_cod, dim_cliente.a1_nome, dim_vendedor.a3_nome


   ->  Sort  (cost=170417.48..172685.88 rows=907357 width=129) (actual 
time=36649.315..36786.760 rows=907357 loops=1)


 Output: dim_cliente.tipocliente, dim_cliente.a1_ibge, 
dim_cliente.a1_cod, dim_cliente.a1_nome, dim_vendedor.a3_nome


 Sort Key: dim_cliente.tipocliente, dim_cliente.a1_ibge, 
dim_cliente.a1_cod, dim_cliente.a1_nome, dim_vendedor.a3_nome


 Sort Method: quicksort  Memory: 265592kB

 ->  Hash Join  (cost=856.30..80628.56 rows=907357 width=129) 
(actual time=26.719..1593.693 rows=907357 loops=1)


 Output: dim_cliente.tipocliente, dim_cliente.a1_ibge, 
dim_cliente.a1_cod, dim_cliente.a1_nome, dim_vendedor.a3_nome


 Hash Cond: (fato_ventas_productos.key_vendedor = 
dim_vendedor.key_vendedor)


 ->  Hash Join  (cost=830.02..68126.13 rows=907357 width=86) (actual 
time=25.980..1203.775 rows=907357 loops=1)


 Output: dim_cliente.tipocliente, dim_cliente.a1_ibge, 
dim_cliente.a1_cod, dim_cliente.a1_nome, 
fato_ventas_productos.key_vendedor


 Hash Cond: (fato_ventas_productos.key_cliente = 
dim_cliente.key_cliente)


 ->  Seq Scan on public.fato_ventas_productos 
 (cost=0.00..46880.57 rows=907357 width=16) (actual 
time=0.004..680.283 rows=907357 loops=1)


   Output: fato_ventas_productos.key_cliente, 
fato_ventas_productos.key_vendedor


 ->  Hash  (cost=618.90..618.90 rows=16890 width=86) (actual 
time=25.931..25.931 rows=16890 loops=1)


   Output: dim_cliente.tipocliente, dim_cliente.a1_ibge, 
dim_cliente.a1_cod, dim_cliente.a1_nome, dim_cliente.key_cliente


   Buckets: 2048  Batches: 1  Memory Usage: 1980kB

   ->  Seq Scan on public.dim_cliente  (cost=0.00..618.90 
rows=16890 width=86) (actual time=0.005..13.736 rows=16890 loops=1)


     Output: dim_cliente.tipocliente, dim_cliente.a1_ibge, 
dim_cliente.a1_cod, dim_cliente.a1_nome, dim_cliente.key_cliente


 ->  Hash  (cost=18.90..18.90 rows=590 width=59) (actual 
time=0.715..0.715 rows=590 loops=1)


 Output: dim_vendedor.a3_nome, dim_vendedor.key_vendedor

 Buckets: 1024  Batches: 1  Memory Usage: 56kB

 ->  Seq Scan on public.dim_vendedor  (cost=0.00..18.90 rows=590 
width=59) (actual time=0.024..0.405 rows=590 loops=1)


   Output: dim_vendedor.a3_nome, dim_vendedor.key_vendedor

 Total runtime: 37249.268 ms

(25 filas)


___


Is anything that I can do to solve this problem, is that a bug or a 
config problem?



Here the link with a dump of the tables

https://drive.google.com/file/d/0Bwupj61i9BtWZ1NiVXltaWc0dnM/view?usp=sharing


I appreciate your help


Hello!
What is your Postgres version?
Do you have correct statistics on this tables?
Please show  yours execution plans with buffers i.e. explain 
(analyze,buffers) ...


--
Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [PERFORM] Many-to-many performance problem

2016-06-10 Thread Alex Ignatov
 involving the array column, but that doesn't appear to be
very useful for these ordered queries because it's not an ordered
index.
  * I tried adding created_on to msgs_message_labels table but I
couldn't find a way of avoiding the in-memory sort.
  * Have thought about dynamically creating partial indexes for each
label using an array column on msgs_message to hold label ids, and
index condition like WHERE label_ids && ARRAY[123] but not sure
what other problems I'll run into with hundreds of indexes on the
same table.

Server is an Amazon RDS instance with default settings and Postgres 
9.3.10, with one other database in the instance.


All advice very much appreciated, thanks

--
*Rowan Seymour* | +260 964153686 

Hello! What do you mean by
"Server is an Amazon RDS instance with default settings and Postgres 
9.3.10, with one other database in the instance."

PG is with default config or smth else?
Is it  with default config as it is as from compile version? If so you 
should definitely have to do some tuning on it.
By looking on plan i saw a lot of disk read. It can be linked to small 
shared memory dedicated to PG exactly what Tom said.

Can you share pg config or raise for example shared_buffers parameter?


Alex Ignatov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company




Re: [PERFORM] Millions of tables

2016-09-29 Thread Alex Ignatov (postgrespro)
 

 

 

From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Greg Spiegelberg
Sent: Tuesday, September 27, 2016 7:28 PM
To: Terry Schmitt 
Cc: pgsql-performa. 
Subject: Re: [PERFORM] Millions of tables

 

On Tue, Sep 27, 2016 at 10:15 AM, Terry Schmitt mailto:tschm...@schmittworks.com> > wrote:

 

 

On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg mailto:gspiegelb...@gmail.com> > wrote:

Hey all,

 

Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time has 
said not to have millions of tables.  I too have long believed it until 
recently.

 

AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for 
PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those 
tables.  Table creation initially took 0.018031 secs, average 0.027467 and 
after tossing out outliers (qty 5) the maximum creation time found was 0.66139 
seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.  Tables were 
created by a single process.  Do note that table creation is done via plpgsql 
function as there are other housekeeping tasks necessary though minimal.

 

No system tuning but here is a list of PostgreSQL knobs and switches:

shared_buffers = 2GB

work_mem = 48 MB

max_stack_depth = 4 MB

synchronous_commit = off

effective_cache_size = 200 GB

pg_xlog is on it's own file system

 

There are some still obvious problems.  General DBA functions such as VACUUM 
and ANALYZE should not be done.  Each will run forever and cause much grief.  
Backups are problematic in the traditional pg_dump and PITR space.  Large 
JOIN's by VIEW, SELECT or via table inheritance (I am abusing it in my test 
case) are no-no's.  A system or database crash could take potentially hours to 
days to recover.  There are likely other issues ahead.

 

You may wonder, "why is Greg attempting such a thing?"  I looked at DynamoDB, 
BigTable, and Cassandra.  I like Greenplum but, let's face it, it's antiquated 
and don't get me started on "Hadoop".  I looked at many others and ultimately 
the recommended use of each vendor was to have one table for all data.  That 
overcomes the millions of tables problem, right?

 

Problem with the "one big table" solution is I anticipate 1,200 trillion 
records.  Random access is expected and the customer expects <30ms reads for a 
single record fetch.

 

No data is loaded... yet  Table and index creation only.  I am interested in 
the opinions of all including tests I may perform.  If you had this setup, what 
would you capture / analyze?  I have a job running preparing data.  I did this 
on a much smaller scale (50k tables) and data load via function allowed close 
to 6,000 records/second.  The schema has been simplified since and last test 
reach just over 20,000 records/second with 300k tables.

 

I'm not looking for alternatives yet but input to my test.  Takers?

 

I can't promise immediate feedback but will do my best to respond with results.

 

TIA,

-Greg

 

I have not seen any mention of transaction ID wraparound mentioned in this 
thread yet. With the numbers that you are looking at, I could see this as a 
major issue.

 

T

 

Thank you Terry.  You get the gold star.  :)   I was waiting for that to come 
up.

 

Success means handling this condition.  A whole database vacuum and 
dump-restore is out of the question.  Can a properly tuned autovacuum prevent 
the situation?

 

-Greg

 

Hi!

With millions of tables you have to setautovacuum_max_workers  sky-high =). 
We have some situation when at thousands of tables autovacuum can’t vacuum all 
tables that need it. Simply it vacuums some of most modified table and never 
reach others. Only manual vacuum can help with this situation. With wraparound 
issue it can be a nightmare 

 

--

Alex Ignatov 
Postgres Professional:  <http://www.postgrespro.com> http://www.postgrespro.com 
The Russian Postgres Company