JSON down performacen when id:1

2022-12-16 Thread Render Comunicacion S . L .
Hello to everyone

I'm new here, and I hope that my question is on the correct mail-list.

We use PostgreSQL to store JSON-B in different tables, all tables have the same 
schema and all tables are indexed with GIN index for the JSON data.

We use two properties of the JSON to locate data:

{
"section_id":"1",
"section_tipo":"numisdata3"
}

The issue:
When we search our locator with section_id: 1 (or any number < 4), PostgreSQL 
takes around 4, 5000, 8000ms or more.
When we search our locator with section_id: 4 (or any other bigger number), 
PostgreSQL takes around 100 ms. ( ~ expected time)

Next queries are done in a database with +/- 1 million of rows in total, and we 
tested in PostgreSQL 13,14 and 15 with similar results.

_

The query for section_id: 1 (13 rows)

EXPLAIN ANALYZE SELECT section_tipo, section_id, datos
FROM "matrix"
WHERE (
datos#>'{relations}' @> 
'[{"section_id":"1","section_tipo":"numisdata3"}]'::jsonb)
UNION ALL
SELECT section_tipo, section_id, datos
FROM "matrix_activities"
WHERE (
datos#>'{relations}' @> 
'[{"section_id":"1","section_tipo":"numisdata3"}]'::jsonb)
UNION ALL
SELECT section_tipo, section_id, datos
FROM "matrix_hierarchy"
WHERE (
datos#>'{relations}' @> 
'[{"section_id":"1","section_tipo":"numisdata3"}]'::jsonb)
UNION ALL
 SELECT section_tipo, section_id, datos
FROM "matrix_list"
WHERE (
datos#>'{relations}' @> 
'[{"section_id":"1","section_tipo":"numisdata3"}]'::jsonb)
ORDER BY section_tipo, section_id ASC
LIMIT ALL;

QUERY PLAN
Sort  (cost=8984.49..8991.16 rows=2669 width=1357) (actual 
time=8752.794..8752.797 rows=13 loops=1)
  Sort Key: matrix.section_tipo, matrix.section_id
  Sort Method: quicksort  Memory: 47kB
  ->  Append  (cost=92.21..8832.59 rows=2669 width=1357) (actual 
time=415.709..8741.565 rows=13 loops=1)
->  Bitmap Heap Scan on matrix  (cost=92.21..199.36 rows=27 width=1144) 
(actual time=415.708..8325.296 rows=11 loops=1)
  Recheck Cond: ((datos #> '{relations}'::text[]) @> 
'[{"section_id": "1", "section_tipo": "numisdata3"}]'::jsonb)
  Rows Removed by Index Recheck: 272037
  Heap Blocks: exact=34164 lossy=33104
  ->  Bitmap Index Scan on matrix_relations_idx  (cost=0.00..92.20 
rows=27 width=0) (actual time=61.462..61.462 rows=155031 loops=1)
Index Cond: ((datos #> '{relations}'::text[]) @> 
'[{"section_id": "1", "section_tipo": "numisdata3"}]'::jsonb)
->  Seq Scan on matrix_activities  (cost=0.00..0.00 rows=1 width=68) 
(actual time=0.012..0.012 rows=0 loops=1)
  Filter: ((datos #> '{relations}'::text[]) @> '[{"section_id": 
"1", "section_tipo": "numisdata3"}]'::jsonb)
->  Bitmap Heap Scan on matrix_hierarchy  (cost=52.26..8492.67 
rows=2614 width=1362) (actual time=269.624..414.954 rows=2 loops=1)
  Recheck Cond: ((datos #> '{relations}'::text[]) @> 
'[{"section_id": "1", "section_tipo": "numisdata3"}]'::jsonb)
  Rows Removed by Index Recheck: 5043
  Heap Blocks: exact=3034
  ->  Bitmap Index Scan on matrix_hierarchy_relations_idx  
(cost=0.00..51.61 rows=2614 width=0) (actual time=9.529..9.529 rows=5049 
loops=1)
Index Cond: ((datos #> '{relations}'::text[]) @> 
'[{"section_id": "1", "section_tipo": "numisdata3"}]'::jsonb)
->  Bitmap Heap Scan on matrix_list  (cost=12.21..100.53 rows=27 
width=1161) (actual time=1.260..1.260 rows=0 loops=1)
  Recheck Cond: ((datos #> '{relations}'::text[]) @> 
'[{"section_id": "1", "section_tipo": "numisdata3"}]'::jsonb)
  ->  Bitmap Index Scan on matrix_list_relations_idx  
(cost=0.00..12.21 rows=27 width=0) (actual time=1.258..1.258 rows=0 loops=1)
Index Cond: ((datos #> '{relations}'::text[]) @> 
'[{"section_id": "1", "section_tipo": "numisdata3"}]'::jsonb)
Planning Time: 33.625 ms
Execution Time: 8753.461 ms

_

The query for section_id: 2 (18 rows)


EXPLAIN ANALYZE SELECT section_tipo, section_id, datos
FROM "matrix"
WHERE (
datos#>'{relations}' @> 
'[{"section_id":"2","section_tipo":"numisdata3"}]'::jsonb)
UNION ALL
SELECT section_tipo, section_id, datos
FROM "matrix_activities"
WHERE (
datos#>'{relations}' @> 
'[{"section_id":"2","section_tipo":"numisdata3"}]'::jsonb)
UNION ALL
SELECT section_tipo, section_id, datos
FROM "matrix_hierarchy"
WHERE (
datos#>'{relations}' @> 
'[{"section_id":"2","section_tipo":"numisdata3"}]'::jsonb)
UNION ALL
 SELECT section_tipo, section_id, datos
FROM "matrix_list"
WHERE (
datos#>'{relations}' @> 
'[{"section_id":"2","section_tipo":"numisdata3"}]'::jsonb)
ORDER BY section_tipo, section_id ASC
LIMIT ALL;

Sort  (cost=8984.49..8991.16 rows=2669 width=1357) (actual 
time=5236.090..5236.097 rows=18 loops=1)
  Sort Key: matrix.section_tipo, matrix.section_id
  Sort Method: quicksort  Memory: 57kB
  ->  Append  (cost=92.21..8832.59 rows=2669 width=135

Re: JSON down performacen when id:1

2022-12-16 Thread Tom Lane
"Render Comunicacion S.L."  writes:
> The issue:
> When we search our locator with section_id: 1 (or any number < 4), PostgreSQL 
> takes around 4, 5000, 8000ms or more.
> When we search our locator with section_id: 4 (or any other bigger number), 
> PostgreSQL takes around 100 ms. ( ~ expected time)

Your index is providing pretty awful performance:

> ->  Bitmap Heap Scan on matrix  (cost=92.21..199.36 rows=27 
> width=1144) (actual time=415.708..8325.296 rows=11 loops=1)
>   Recheck Cond: ((datos #> '{relations}'::text[]) @> 
> '[{"section_id": "1", "section_tipo": "numisdata3"}]'::jsonb)
>   Rows Removed by Index Recheck: 272037
>   Heap Blocks: exact=34164 lossy=33104
>   ->  Bitmap Index Scan on matrix_relations_idx  
> (cost=0.00..92.20 rows=27 width=0) (actual time=61.462..61.462 rows=155031 
> loops=1)
> Index Cond: ((datos #> '{relations}'::text[]) @> 
> '[{"section_id": "1", "section_tipo": "numisdata3"}]'::jsonb)

I read that as 155K hits delivered by the index, of which only 11 were
real matches.  To make matters worse, with so many hits the bitmap was
allowed to become "lossy" (ie track some hits at page-level not
tuple-level) to conserve memory, so that the executor actually had to
check even more than 155K rows.

You need a better index.  It might be that switching to a jsonb_path_ops
index would be enough to fix it, or you might need to build an expression
index matched specifically to this type of query.  See

https://www.postgresql.org/docs/current/datatype-json.html#JSON-INDEXING

Also, if any of the terminology there doesn't make sense, read

https://www.postgresql.org/docs/current/indexes.html

regards, tom lane




Postgres12 looking for possible HashAggregate issue workarounds?

2022-12-16 Thread João Paulo Luís
Hi! Sorry to post to this mailing list, but I could not find many tips working 
around HashAggregate issues.

In a research project involving text repetition analysis (on top of public 
documents)
I have a VirtualMachine (CPU AMD Epyc 7502P, 128GB RAM, 12TB HDD, 2TB SSD),
running postgres 12.12 (Ubuntu 12.12-0ubuntu0.20.04.1)
and some tables with many rows:

nsoamt=> ANALYSE VERBOSE SentenceSource;
INFO:  analyzing "public.sentencesource"
INFO:  "sentencesource": scanned 3 of 9028500 pages, containing 3811990 
live rows and 268323 dead rows; 3 rows in sample, 1147218391 estimated 
total rows
ANALYZE
nsoamt=> ANALYSE VERBOSE SentenceToolCheck;
INFO:  analyzing "public.sentencetoolcheck"
INFO:  "sentencetoolcheck": scanned 3 of 33536425 pages, containing 498508 
live rows and 25143 dead rows; 3 rows in sample, 557272538 estimated total 
rows
ANALYZE
nsoamt=> ANALYZE VERBOSE Document;
INFO:  analyzing "public.document"
INFO:  "document": scanned 3 of 34570 pages, containing 1371662 live rows 
and 30366 dead rows; 3 rows in sample, 1580612 estimated total rows
ANALYZE

The estimates for the number of rows above are accurate.

I am running this query

SELECT COUNT(*), COUNT(NULLIF(Stchk.haserrors,'f'))
FROM SentenceToolCheck Stchk
WHERE EXISTS (SELECT SSrc.sentence
  FROM SentenceSource SSrc, Document Doc
  WHERE SSrc.sentence = Stchk.id
  AND Doc.id = SSrc.document
  AND Doc.source ILIKE 
'/bigpostgres/misc/arxiv/arxiv/arxiv/pdf/%');

and I have 2 (related?) problems


1 - the query is making a postgresql project have 76.7 GB resident RAM usage.
Having a WORK_MEM setting of 2GB (and "simple" COUNT() results),
that was not expected.
(I risk oom-killer killing my postgres as soon as I run another concurrent
query.)

The memory settings are:

work_mem = 2GB
shared_buffers = 16GB
maintenance_work_mem = 1GB



2 - the query never finishes... (it is over 3x24hours execution by now,
and I have no ideia how far from finishing it is).

The EXPLAIN plan is:

   QUERY PLAN

 Aggregate  (cost=28630195.79..28630195.80 rows=1 width=16)
   ->  Nested Loop  (cost=26397220.49..28628236.23 rows=261275 width=1)
 ->  HashAggregate  (cost=26397219.92..26399832.67 rows=261275 width=8)
   Group Key: ssrc.sentence
   ->  Hash Join  (cost=73253.21..23635527.52 rows=1104676957 
width=8)
 Hash Cond: (ssrc.document = doc.id)
 ->  Seq Scan on sentencesource ssrc  
(cost=0.00..20540394.02 rows=1151189402 width=16)
 ->  Hash  (cost=54310.40..54310.40 rows=1515425 width=4)
   ->  Seq Scan on document doc  (cost=0.00..54310.40 
rows=1515425 width=4)
 Filter: (source ~~* 
'/bigpostgres/misc/arxiv/arxiv/arxiv/pdf/%'::text)
 ->  Index Scan using pk_sentencetoolcheck on sentencetoolcheck stchk  
(cost=0.57..8.53 rows=1 width=9)
   Index Cond: (id = ssrc.sentence)
 JIT:
   Functions: 20
   Options: Inlining true, Optimization true, Expressions true, Deforming true
(15 rows)

The rows=1515425 estimate on Seq Scan on document doc  (cost=0.00..54310.40 
rows=1515425 width=4) seems right.

The rows=1104676957 estimate on Hash Join  (cost=73253.21..23635527.52 
rows=1104676957 width=8) also seems right.

The rows=261275 on HashAggregate  (cost=26397219.92..26399832.67 rows=261275 
width=8) seems VERY WRONG!
I was expecting something like rows=1.0E+09 instead.


On a laptop (with just 80% of the rows, 32GB RAM, but all SSD disks),
I finish the query in a few hours (+/- 2 hours).

The EXPLAIN plan is different on the laptop:

   QUERY PLAN
-
 Aggregate  (cost=216688374.89..216688374.90 rows=1 width=16)
   ->  Nested Loop  (cost=211388557.47..216686210.27 rows=288616 width=1)
 ->  Unique  (cost=211388556.90..215889838.75 rows=288616 width=8)
   ->  Sort  (cost=211388556.90..213639197.82 rows=900256370 
width=8)
 Sort Key: ssrc.sentence
 ->  Hash Join  (cost=56351.51..28261726.31 rows=900256370 
width=8)
   Hash Cond: (ssrc.document = doc.id)
   ->  Seq Scan on sentencesource ssrc  
(cost=0.00..16453055.44 rows=948142144 width=16)
   ->  Hash  (cost=38565.65..38565.65 rows=1084069 
width=4)
 ->  Seq Scan on document doc  
(cost=0.00..38565.65 rows=1084069 width=4)
   Filter: (source ~~* 
'/bigpostgres/misc/arxiv/arxiv/arxiv/pdf/%'::text)
 -

Re: Postgres12 looking for possible HashAggregate issue workarounds?

2022-12-16 Thread Justin Pryzby
On Fri, Dec 16, 2022 at 03:24:17PM +, João Paulo Luís wrote:
> Hi! Sorry to post to this mailing list, but I could not find many tips 
> working around HashAggregate issues.
> 
> In a research project involving text repetition analysis (on top of public 
> documents)
> I have a VirtualMachine (CPU AMD Epyc 7502P, 128GB RAM, 12TB HDD, 2TB SSD),
> running postgres 12.12 (Ubuntu 12.12-0ubuntu0.20.04.1)
> and some tables with many rows:

> 1 - the query is making a postgresql project have 76.7 GB resident RAM usage.
> Having a WORK_MEM setting of 2GB (and "simple" COUNT() results),
> that was not expected.
> (I risk oom-killer killing my postgres as soon as I run another concurrent
> query.)

> The rows=261275 on HashAggregate  (cost=26397219.92..26399832.67 rows=261275 
> width=8) seems VERY WRONG!
> I was expecting something like rows=1.0E+09 instead.

> I would guess that HashAggregate is behaving very badly (using to much RAM 
> beyond WORK_MEM, amd also badly estimating the #rows and taking forever...)

Huge memory use sounds like what was fixed in postgres 13.

https://www.postgresql.org/docs/13/release-13.html

Allow hash aggregation to use disk storage for large aggregation result
sets (Jeff Davis)

Previously, hash aggregation was avoided if it was expected to use more
than work_mem memory. Now, a hash aggregation plan can be chosen despite
that. The hash table will be spilled to disk if it exceeds work_mem
times hash_mem_multiplier.

This behavior is normally preferable to the old behavior, in which once
hash aggregation had been chosen, the hash table would be kept in memory
no matter how large it got — which could be very large if the planner
had misestimated. If necessary, behavior similar to that can be obtained
by increasing hash_mem_multiplier.

-- 
Justin