jsonb and where clause?

2017-11-28 Thread Bjorn T Johansen
Hi.

Just starting to look at how to use jsonb columns and I have a question. I have 
found out that I can use the following to search for a value inside
the jsonb column:

select * from orders where info ->> 'customer' = 'John Doe'(where info is 
the jsonb column)


But what if the jsonb column contains an json array, how can I search then?

info -> [ { "customer" : "John Doe" } ]


btw, using PostgreSQL 9.6 but will be moving to 10 soon.


Regards,

BTJ



Re: jsonb and where clause?

2017-11-28 Thread Ivan E. Panchenko

Hi Bjorn,

28.11.2017 11:18, Bjorn T Johansen пишет:

Hi.

Just starting to look at how to use jsonb columns and I have a question. I have 
found out that I can use the following to search for a value inside
the jsonb column:

select * from orders where info ->> 'customer' = 'John Doe'(where info is 
the jsonb column)


But what if the jsonb column contains an json array, how can I search then?

info -> [ { "customer" : "John Doe" } ]


If you know the index in array, you can search like
info->0->>'customer'
If you want to search in any array element, you need to use JSQUERY 
extension,

see https://github.com/postgrespro/jsquery




btw, using PostgreSQL 9.6 but will be moving to 10 soon.


Regards,

BTJ


Regards,
Ivan



Plan for update ... where a is not distinct from b

2017-11-28 Thread Peter J. Holzer
[PostgreSQL 9.5.10 on x86_64-pc-linux-gnu, compiled by gcc (Debian
4.9.2-10) 4.9.2, 64-bit]

I noticed that an update was taking a long time and found this:

UPDATE public.facttable_imf_ifs p
SET [...lots of columns...]
FROM cleansing.cls_imf_ifs_facttable_imf_ifs c, cleansing.cleansing_change_type 
ct
WHERE
(p.macrobondtimeseries is not distinct from c.macrobondtimeseries) AND 
(p.date is not distinct from c.date)
AND c.cleansing_change_type_id = ct.cleansing_change_type_id
AND ct.cleansing_change_type_desc_short IN ('UPDATED_NEW')
;

╔╗
║   QUERY 
PLAN   ║
╟╢
║ Update on facttable_imf_ifs p  (cost=1.09..1978353972070.75 rows=7969398 
width=1498)   ║
║   ->  Nested Loop  (cost=1.09..1978353972070.75 rows=7969398 width=1498)  
 ║
║ Join Filter: ((NOT ((p.macrobondtimeseries)::text IS DISTINCT FROM 
(c.macrobondtimeseries)::text)) AND (NOT (p.date IS DISTINCT FROM c.date))) ║
║ ->  Seq Scan on facttable_imf_ifs p  (cost=0.00..1071317.45 
rows=20727045 width=51) 
   ║
║ ->  Materialize  (cost=1.09..2028790.72 rows=5454160 width=1472)  
 ║
║   ->  Hash Join  (cost=1.09..2001519.92 rows=5454160 width=1472)  
 ║
║ Hash Cond: (c.cleansing_change_type_id = 
ct.cleansing_change_type_id)
  ║
║ ->  Seq Scan on cls_imf_ifs_facttable_imf_ifs c  
(cost=0.00..1824258.62 rows=32724962 width=1470)
  ║
║ ->  Hash  (cost=1.07..1.07 rows=1 width=8)
 ║
║   ->  Seq Scan on cleansing_change_type ct  
(cost=0.00..1.07 rows=1 width=8)
   ║
║ Filter: 
((cleansing_change_type_desc_short)::text = 'UPDATED_NEW'::text)
   ║
╚╝

Am I correct to assume that the result of Materialize is a flat table
(in memory) without any indexes and that the nested loop has to scan
that for each of the 20 million rows of the target table?

That's going to take a long time ...

Replacing the "is not distinct from" with "=" (which is possible in this
case because both columns are not null (and indeed the primary key), gives me 
this plan:

╔╗
║ QUERY PLAN
 ║
╟╢
║ Update on facttable_imf_ifs p  (cost=2611816.89..4435860.27 rows=7969425 
width=1498)   ║
║   ->  Merge Join  (cost=2611816.89..4435860.27 rows=7969425 width=1498)   
 ║
║ Merge Cond: (((p.macrobondtimeseries)::text = 
(c.macrobondtimeseries)::text) AND (p.date = c.date))║
║ ->  Index Scan using facttable_imf_ifs_pkey on facttable_imf_ifs p  
(cost=0.56..1541107.94 rows=20727117 width=51) ║
║ ->  Sort  (cost=2611811.12..2625446.52 rows=5454160 width=1472)   
 ║
║   Sort Key: c.macrobondtimeseries, c.date 
 ║
║   ->  Hash Join  (cost=1.09..2001519.92 rows=5454160 width=1472)  
 ║
║ Hash Cond: (c.cleansing_change_type_id = 
ct.cleansing_change_type_id)  ║
║ ->  Seq Scan on cls_imf_ifs_facttable_imf_ifs c  
(cost=0.00..1824258.62 rows=32724962 width=1470)  ║
║ ->  Hash  (cost=1.07..1.07 rows=1 width=8)
 ║
║   ->  Seq Scan

Re: jsonb and where clause?

2017-11-28 Thread Bjorn T Johansen
On Tue, 28 Nov 2017 11:28:55 +0300
"Ivan E. Panchenko"  wrote:

> Hi Bjorn,
> 
> 28.11.2017 11:18, Bjorn T Johansen пишет:
> > Hi.
> >
> > Just starting to look at how to use jsonb columns and I have a question. I 
> > have found out that I can use the following to search for a value
> > inside the jsonb column:
> >
> > select * from orders where info ->> 'customer' = 'John Doe'(where info 
> > is the jsonb column)
> >
> >
> > But what if the jsonb column contains an json array, how can I search then?
> >
> > info -> [ { "customer" : "John Doe" } ]  
> 
> If you know the index in array, you can search like
> info->0->>'customer'
> If you want to search in any array element, you need to use JSQUERY 
> extension,
> see https://github.com/postgrespro/jsquery
> 
> >
> >
> > btw, using PostgreSQL 9.6 but will be moving to 10 soon.
> >
> >
> > Regards,
> >
> > BTJ
> >  
> Regards,
> Ivan
> 

Thx... :)

btw, just managed to use the following sql:

select * from orders where info @> '{"recs": [{ "customer":"John Doe"}]}' 

(changed my json string to info -> { "recs: [ { "customer" : "John Doe" } ] }

And this seems to work but is this the "wrong" way of doing it or?


BTJ



Re: jsonb and where clause?

2017-11-28 Thread Ivan E. Panchenko


28.11.2017 13:25, Bjorn T Johansen пишет:

On Tue, 28 Nov 2017 11:28:55 +0300
"Ivan E. Panchenko"  wrote:


Hi Bjorn,

28.11.2017 11:18, Bjorn T Johansen пишет:

Hi.

Just starting to look at how to use jsonb columns and I have a question. I have 
found out that I can use the following to search for a value
inside the jsonb column:

select * from orders where info ->> 'customer' = 'John Doe'(where info is 
the jsonb column)


But what if the jsonb column contains an json array, how can I search then?

info -> [ { "customer" : "John Doe" } ]

If you know the index in array, you can search like
info->0->>'customer'
If you want to search in any array element, you need to use JSQUERY
extension,
see https://github.com/postgrespro/jsquery



btw, using PostgreSQL 9.6 but will be moving to 10 soon.


Regards,

BTJ
  

Regards,
Ivan


Thx... :)

btw, just managed to use the following sql:

select * from orders where info @> '{"recs": [{ "customer":"John Doe"}]}'

(changed my json string to info -> { "recs: [ { "customer" : "John Doe" } ] }

And this seems to work but is this the "wrong" way of doing it or?

Yes, definitely it works, and is be supported by GIN index.

Nevertheless, I recommend you to have a look at JSQUERY which allows 
more complex queries, also with index support.



BTJ


Regards,
Ivan



Re: jsonb and where clause?

2017-11-28 Thread Bjorn T Johansen
On Tue, 28 Nov 2017 13:52:59 +0300
"Ivan E. Panchenko"  wrote:

> 28.11.2017 13:25, Bjorn T Johansen пишет:
> > On Tue, 28 Nov 2017 11:28:55 +0300
> > "Ivan E. Panchenko"  wrote:
> >  
> >> Hi Bjorn,
> >>
> >> 28.11.2017 11:18, Bjorn T Johansen пишет:  
> >>> Hi.
> >>>
> >>> Just starting to look at how to use jsonb columns and I have a question. 
> >>> I have found out that I can use the following to search for a value
> >>> inside the jsonb column:
> >>>
> >>> select * from orders where info ->> 'customer' = 'John Doe'(where 
> >>> info is the jsonb column)
> >>>
> >>>
> >>> But what if the jsonb column contains an json array, how can I search 
> >>> then?
> >>>
> >>> info -> [ { "customer" : "John Doe" } ]  
> >> If you know the index in array, you can search like
> >> info->0->>'customer'
> >> If you want to search in any array element, you need to use JSQUERY
> >> extension,
> >> see https://github.com/postgrespro/jsquery
> >>  
> >>>
> >>> btw, using PostgreSQL 9.6 but will be moving to 10 soon.
> >>>
> >>>
> >>> Regards,
> >>>
> >>> BTJ
> >>> 
> >> Regards,
> >> Ivan
> >>  
> > Thx... :)
> >
> > btw, just managed to use the following sql:
> >
> > select * from orders where info @> '{"recs": [{ "customer":"John Doe"}]}'
> >
> > (changed my json string to info -> { "recs: [ { "customer" : "John Doe" } ] 
> > }
> >
> > And this seems to work but is this the "wrong" way of doing it or?  
> Yes, definitely it works, and is be supported by GIN index.
> 
> Nevertheless, I recommend you to have a look at JSQUERY which allows 
> more complex queries, also with index support.
> >
> >
> > BTJ
> >  
> Regards,
> Ivan
> 

Ok, will have a look at JSQUERY also... :)

BTJ



Re: [GENERAL] postgres_fdw & could not receive data from client: Connection reset by peer

2017-11-28 Thread Adam Brusselback
Just bumping this because I posted it right before Thanksgiving and it was
very easy to overlook.

Sorry if this is bad etiquette for the list... Just let me know if it is
and I won't do it in the future.


[GENERAL] - Regarding Schema ROLLBACK

2017-11-28 Thread VENKTESH GUTTEDAR
Hello All,


I have run CREATE SCHEMA xyz accidentally, and it replaced my existing
schema (xyz) with tables, and i have lost all my data.

Is there any way to rollback or get the schema back with old tables and
data.

Any help would be appreciated.

-- 
Regards :
Venktesh Guttedar.


vacuumdb fails with error pg_statistic_relid_att_inh_index constraint violation after upgrade to 9.6

2017-11-28 Thread Swapnil Vaze
Hello,

We have upgraded our database from 9.5 to 9.6 version.
After upgrade we ran vacuumdb command and we are getting following error:

vacuumdb: vacuuming of database "advdfat" failed: ERROR:  duplicate key
value violates unique constraint "pg_statistic_relid_att_inh_index"
DETAIL:  Key (starelid, staattnum, stainherit)=(1259, 1, f) already exists.

Can anyone help here?

-- 
Thanks & Regards,
Swapnil Vaze


Re: ERROR: too many dynamic shared memory segments

2017-11-28 Thread Robert Haas
On Tue, Nov 28, 2017 at 2:32 AM, Dilip Kumar  wrote:
>  I think BitmapHeapScan check whether dsa is valid or not if DSA is not
> valid then it should assume it's non-parallel plan.
>
> Attached patch should fix the issue.

So, create the pstate and then pretend we didn't?  Why not just avoid
creating it in the first place, like this?

I haven't checked whether this fixes the bug, but if it does, we can
avoid introducing an extra branch in BitmapHeapNext.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


no-pstate.patch
Description: Binary data


Refreshing materialized views

2017-11-28 Thread Henrik Uggla
Hi

I created some materialized views and set a group as owner. My problem is that 
I can't refresh the views. I get "permission denied" even when using the 
postgres super user. If I change the owner to an ordinary user I still can't 
refresh the view as the owner or postgres. Only if I change owner to postgres 
am I able to refresh the view, but only as the postgres user. I've mainly tried 
with Pgadmin4 but also a few times with psql, and the database is installed as 
a docker container. Please help.

cheers
HU


Re: vacuumdb fails with error pg_statistic_relid_att_inh_index constraint violation after upgrade to 9.6

2017-11-28 Thread Tom Lane
Swapnil Vaze  writes:
> We have upgraded our database from 9.5 to 9.6 version.
> After upgrade we ran vacuumdb command and we are getting following error:
> vacuumdb: vacuuming of database "advdfat" failed: ERROR:  duplicate key
> value violates unique constraint "pg_statistic_relid_att_inh_index"
> DETAIL:  Key (starelid, staattnum, stainherit)=(1259, 1, f) already exists.

Hmm, odd.

> Can anyone help here?

In order of increasing invasiveness:

1. REINDEX pg_statistic_relid_att_inh_index, on the theory that that index
has become corrupt.

2. If #1 fails with a similar message, there must actually be more than
one pg_statistic row with that key.  Manually DELETE those rows, then
REINDEX the index, then ANALYZE pg_class to regenerate the deleted
stats.  (I suggest reindexing because this case makes it even more likely
that that index is corrupt.)

3. If #2 fails, TRUNCATE pg_statistic, then re-analyze everything to
rebuild the stats.

regards, tom lane



Re: Refreshing materialized views

2017-11-28 Thread Tom Lane
Henrik Uggla  writes:
> I created some materialized views and set a group as owner. My problem is 
> that I can't refresh the views. I get "permission denied" even when using the 
> postgres super user. If I change the owner to an ordinary user I still can't 
> refresh the view as the owner or postgres. Only if I change owner to postgres 
> am I able to refresh the view, but only as the postgres user. I've mainly 
> tried with Pgadmin4 but also a few times with psql, and the database is 
> installed as a docker container. Please help.

Apparently, the view owner lacks select permission on some underlying
table.  Pay attention to what is being denied access to.

regards, tom lane



Re: Plan for update ... where a is not distinct from b

2017-11-28 Thread Tom Lane
"Peter J. Holzer"  writes:
> I noticed that an update was taking a long time and found this:
> [ crappy plan for join on IS NOT DISTINCT ]

Yeah, there's no optimization smarts at all for IS [NOT] DISTINCT.
It can't be converted into a merge qual, nor a hash qual, nor an
indexscan qual.

In principle this could be improved, but given how much work it'd be
and how seldom anyone complains, it's not likely to happen anytime soon.

regards, tom lane



Re: ERROR: too many dynamic shared memory segments

2017-11-28 Thread Dilip Kumar
On Tue, Nov 28, 2017 at 7:13 PM, Robert Haas  wrote:

> On Tue, Nov 28, 2017 at 2:32 AM, Dilip Kumar 
> wrote:
> >  I think BitmapHeapScan check whether dsa is valid or not if DSA is not
> > valid then it should assume it's non-parallel plan.
> >
> > Attached patch should fix the issue.
>
> So, create the pstate and then pretend we didn't?  Why not just avoid
> creating it in the first place, like this?
>

This is better way to fix it.

>
> I haven't checked whether this fixes the bug, but if it does, we can
> avoid introducing an extra branch in BitmapHeapNext.


With my test it's fixing the problem.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com


SV: Refreshing materialized views

2017-11-28 Thread Henrik Uggla
The underlying tables are foreign tables. The user has been mapped to a foreign 
user with select permission. I have no problem selecting from the foreign 
tables or the materialized views.

This is the error in the log (not very useful):
2017-11-28 13:58:03.207 UTC [347] STATEMENT:  REFRESH MATERIALIZED VIEW 
kust_havsplan.fornlamning_linje WITH DATA;
2017-11-28 13:58:47.412 UTC [296] ERROR:  permission denied for relation 
fornlamning_linje

Regards
Henrik

Från: Tom Lane 
Skickat: den 28 november 2017 15:25:47
Till: Henrik Uggla
Kopia: pgsql-general@lists.postgresql.org
Ämne: Re: Refreshing materialized views

Henrik Uggla  writes:
> I created some materialized views and set a group as owner. My problem is 
> that I can't refresh the views. I get "permission denied" even when using the 
> postgres super user. If I change the owner to an ordinary user I still can't 
> refresh the view as the owner or postgres. Only if I change owner to postgres 
> am I able to refresh the view, but only as the postgres user. I've mainly 
> tried with Pgadmin4 but also a few times with psql, and the database is 
> installed as a docker container. Please help.

Apparently, the view owner lacks select permission on some underlying
table.  Pay attention to what is being denied access to.

regards, tom lane




Re: SV: Refreshing materialized views

2017-11-28 Thread Tom Lane
Henrik Uggla  writes:
> The underlying tables are foreign tables. The user has been mapped to a 
> foreign user with select permission. I have no problem selecting from the 
> foreign tables or the materialized views.

[ shrug... ] WFM; if I can select from the foreign table then I can make
a materialized view that selects from it, and that refreshes without
complaint.  Admittedly, getting things set up to select from the foreign
table is trickier than it sounds: your local user needs SELECT on the
foreign table plus a mapping to some remote userid, and *on the remote
server* that remote userid needs SELECT on whatever the foreign table
is referencing.  I'm guessing you messed up one of these components.

regards, tom lane



Re: [GENERAL] - Regarding Schema ROLLBACK

2017-11-28 Thread Melvin Davidson
On Tue, Nov 28, 2017 at 7:23 AM, VENKTESH GUTTEDAR <
venkteshgutte...@gmail.com> wrote:

> Hello All,
>
>
> I have run CREATE SCHEMA xyz accidentally, and it replaced my existing
> schema (xyz) with tables, and i have lost all my data.
>
> Is there any way to rollback or get the schema back with old tables
> and data.
>
> Any help would be appreciated.
>
> --
> Regards :
> Venktesh Guttedar.
>
>
*>Is there any way to rollback or get the schema back with old tables and
data.*

*Hopefully you have a pg_dump of the database or schema.*
*If it in in plain format, you can edit the dump and recover the tables
that way.*

*If it is in custom format, you can use the -l option to list and edit,
then restore with -L option.*


*https://www.postgresql.org/docs/9.6/static/app-pgrestore.html
*

*Otherwise, your next option would be trying to use PITR recovery,
providing you have available WAL files.*


*https://www.postgresql.org/docs/9.6/static/continuous-archiving.html
*

-- 
*Melvin Davidson*
I reserve the right to fantasize.  Whether or not you
wish to share my fantasy is entirely up to you.


Re: [GENERAL] - Regarding Schema ROLLBACK

2017-11-28 Thread David G. Johnston
On Tue, Nov 28, 2017 at 5:23 AM, VENKTESH GUTTEDAR <
venkteshgutte...@gmail.com> wrote:

> I have run CREATE SCHEMA xyz accidentally, and it replaced my existing
> schema (xyz) with tables, and i have lost all my data.
>

​As far as I know "CREATE SCHEMA" by itself cannot "drop" data nor would it
successfully execute if a schema of the same name already exist...

As Melvin says if you've truly dropped the data you will need to resort to
backups of some form.​

David J.


Re: ERROR: too many dynamic shared memory segments

2017-11-28 Thread Robert Haas
On Tue, Nov 28, 2017 at 9:45 AM, Dilip Kumar  wrote:
>> I haven't checked whether this fixes the bug, but if it does, we can
>> avoid introducing an extra branch in BitmapHeapNext.
>
> With my test it's fixing the problem.

I tested it some more and found that, for me, it PARTIALLY fixes the
problem.  I tested like this:

--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -279,7 +279,7 @@ InitializeParallelDSM(ParallelContext *pcxt)
  * parallelism than to fail outright.
  */
 segsize = shm_toc_estimate(&pcxt->estimator);
-if (pcxt->nworkers > 0)
+if (pcxt->nworkers > 0 && false)
 pcxt->seg = dsm_create(segsize, DSM_CREATE_NULL_IF_MAXSEGMENTS);
 if (pcxt->seg != NULL)
 pcxt->toc = shm_toc_create(PARALLEL_MAGIC,

That turned out to produce more than one problem.  I find that the
select_parallel test then fails like this:

ERROR:  could not find key 18446744073709486082 in shm TOC at 0x10be98040

The fix for that problem seems to be:

--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -430,7 +430,8 @@ ReinitializeParallelDSM(ParallelContext *pcxt)

 /* Recreate error queues. */
 error_queue_space =
-shm_toc_lookup(pcxt->toc, PARALLEL_KEY_ERROR_QUEUE, false);
+shm_toc_lookup(pcxt->toc, PARALLEL_KEY_ERROR_QUEUE, true);
+Assert(pcxt->nworkers == 0 || error_queue_space != NULL);
 for (i = 0; i < pcxt->nworkers; ++i)
 {
 char   *start;

With that fix in place, I then hit a crash in parallel bitmap heap
scan.  After applying no-pstate.patch, which I just committed and
back-patched to v10, then things look OK.  I'm going to apply the fix
for the error_queue_space problem also.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



large numbers of inserts out of memory strategy

2017-11-28 Thread Ted Toth
I'm writing a migration utility to move data from non-rdbms data
source to a postgres db. Currently I'm generating SQL INSERT
statements involving 6 related tables for each 'thing'. With 100k or
more 'things' to migrate I'm generating a lot of statements and when I
try to import using psql postgres fails with 'out of memory' when
running on a Linux VM with 4G of memory. If I break into smaller
chunks say ~50K statements then thde import succeeds. I can change my
migration utility to generate multiple files each with a limited
number of INSERTs to get around this issue but maybe there's
another/better way?

Ted



Re: large numbers of inserts out of memory strategy

2017-11-28 Thread Rob Sargent

> On Nov 28, 2017, at 10:17 AM, Ted Toth  wrote:
> 
> I'm writing a migration utility to move data from non-rdbms data
> source to a postgres db. Currently I'm generating SQL INSERT
> statements involving 6 related tables for each 'thing'. With 100k or
> more 'things' to migrate I'm generating a lot of statements and when I
> try to import using psql postgres fails with 'out of memory' when
> running on a Linux VM with 4G of memory. If I break into smaller
> chunks say ~50K statements then thde import succeeds. I can change my
> migration utility to generate multiple files each with a limited
> number of INSERTs to get around this issue but maybe there's
> another/better way?
> 
> Ted
> 
what tools / languages ate you using?



Re: large numbers of inserts out of memory strategy

2017-11-28 Thread Tomas Vondra
Hi,

On 11/28/2017 06:17 PM, Ted Toth wrote:
> I'm writing a migration utility to move data from non-rdbms data
> source to a postgres db. Currently I'm generating SQL INSERT
> statements involving 6 related tables for each 'thing'. With 100k or
> more 'things' to migrate I'm generating a lot of statements and when I
> try to import using psql postgres fails with 'out of memory' when
> running on a Linux VM with 4G of memory. If I break into smaller
> chunks say ~50K statements then thde import succeeds. I can change my
> migration utility to generate multiple files each with a limited
> number of INSERTs to get around this issue but maybe there's
> another/better way?
> 

The question is what exactly runs out of memory, and how did you modify
the configuration (particularly related to memory).

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: large numbers of inserts out of memory strategy

2017-11-28 Thread Ted Toth
On Tue, Nov 28, 2017 at 11:19 AM, Rob Sargent  wrote:
>
>> On Nov 28, 2017, at 10:17 AM, Ted Toth  wrote:
>>
>> I'm writing a migration utility to move data from non-rdbms data
>> source to a postgres db. Currently I'm generating SQL INSERT
>> statements involving 6 related tables for each 'thing'. With 100k or
>> more 'things' to migrate I'm generating a lot of statements and when I
>> try to import using psql postgres fails with 'out of memory' when
>> running on a Linux VM with 4G of memory. If I break into smaller
>> chunks say ~50K statements then thde import succeeds. I can change my
>> migration utility to generate multiple files each with a limited
>> number of INSERTs to get around this issue but maybe there's
>> another/better way?
>>
>> Ted
>>
> what tools / languages ate you using?

I'm using python to read binary source files and create the text files
contains the SQL. Them I'm running psql -f .



Re: large numbers of inserts out of memory strategy

2017-11-28 Thread Ted Toth
On Tue, Nov 28, 2017 at 11:22 AM, Tomas Vondra
 wrote:
> Hi,
>
> On 11/28/2017 06:17 PM, Ted Toth wrote:
>> I'm writing a migration utility to move data from non-rdbms data
>> source to a postgres db. Currently I'm generating SQL INSERT
>> statements involving 6 related tables for each 'thing'. With 100k or
>> more 'things' to migrate I'm generating a lot of statements and when I
>> try to import using psql postgres fails with 'out of memory' when
>> running on a Linux VM with 4G of memory. If I break into smaller
>> chunks say ~50K statements then thde import succeeds. I can change my
>> migration utility to generate multiple files each with a limited
>> number of INSERTs to get around this issue but maybe there's
>> another/better way?
>>
>
> The question is what exactly runs out of memory, and how did you modify
> the configuration (particularly related to memory).
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

I'm pretty new to postgres so I haven't changed any configuration
setting and the log is a bit hard for me to make sense of :(


psql.outofmem.log
Description: Binary data


Re: large numbers of inserts out of memory strategy

2017-11-28 Thread Rob Sargent


On 11/28/2017 10:50 AM, Ted Toth wrote:

On Tue, Nov 28, 2017 at 11:19 AM, Rob Sargent  wrote:

On Nov 28, 2017, at 10:17 AM, Ted Toth  wrote:

I'm writing a migration utility to move data from non-rdbms data
source to a postgres db. Currently I'm generating SQL INSERT
statements involving 6 related tables for each 'thing'. With 100k or
more 'things' to migrate I'm generating a lot of statements and when I
try to import using psql postgres fails with 'out of memory' when
running on a Linux VM with 4G of memory. If I break into smaller
chunks say ~50K statements then thde import succeeds. I can change my
migration utility to generate multiple files each with a limited
number of INSERTs to get around this issue but maybe there's
another/better way?

Ted


what tools / languages ate you using?

I'm using python to read binary source files and create the text files
contains the SQL. Them I'm running psql -f .
If you're going out to the file system, I would use COPY of csv files 
(if number of records per table is non-trivial).  Any bulk loading 
python available?





Re: large numbers of inserts out of memory strategy

2017-11-28 Thread Steven Lembark
On Tue, 28 Nov 2017 11:17:07 -0600
Ted Toth  wrote:

> I'm writing a migration utility to move data from non-rdbms data
> source to a postgres db. Currently I'm generating SQL INSERT
> statements involving 6 related tables for each 'thing'. With 100k or
> more 'things' to migrate I'm generating a lot of statements and when I
> try to import using psql postgres fails with 'out of memory' when
> running on a Linux VM with 4G of memory. If I break into smaller
> chunks say ~50K statements then thde import succeeds. I can change my
> migration utility to generate multiple files each with a limited
> number of INSERTs to get around this issue but maybe there's
> another/better way?

Chunking the loads has a number of advantages other than avoiding
OOM errors, not the least of which are possibly parallel loading
and being able to restart after other failures without re-processing
all of the input data. 

Note that 4GiB of core is not all that much by today's standards.
You might want to run top/vmstat and ask if the PG daemons are using
all/most of the available memory. If PG is sucking up all of the core
as it is then tuning the database may not have much of an effect; if
there is lots of spare memory then it'll be worth looking at ways to
tune PG.

Note also that "out of memory" frequently means virutal memory.

Q: Does the VM have swap configured?

If not then add 8GiB and see if that solves your problem; if so then 
how much swap is in use when you get the OOM error?


-- 
Steven Lembark   1505 National Ave
Workhorse Computing Rockford, IL 61103
lemb...@wrkhors.com+1 888 359 3508



Re: large numbers of inserts out of memory strategy

2017-11-28 Thread Tomas Vondra


On 11/28/2017 06:54 PM, Ted Toth wrote:
> On Tue, Nov 28, 2017 at 11:22 AM, Tomas Vondra
>  wrote:
>> Hi,
>>
>> On 11/28/2017 06:17 PM, Ted Toth wrote:
>>> I'm writing a migration utility to move data from non-rdbms data
>>> source to a postgres db. Currently I'm generating SQL INSERT
>>> statements involving 6 related tables for each 'thing'. With 100k or
>>> more 'things' to migrate I'm generating a lot of statements and when I
>>> try to import using psql postgres fails with 'out of memory' when
>>> running on a Linux VM with 4G of memory. If I break into smaller
>>> chunks say ~50K statements then thde import succeeds. I can change my
>>> migration utility to generate multiple files each with a limited
>>> number of INSERTs to get around this issue but maybe there's
>>> another/better way?
>>>
>>
>> The question is what exactly runs out of memory, and how did you modify
>> the configuration (particularly related to memory).
>>
>> regards
>>
>> --
>> Tomas Vondra  http://www.2ndQuadrant.com
>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 
> I'm pretty new to postgres so I haven't changed any configuration
> setting and the log is a bit hard for me to make sense of :(
> 

The most interesting part of the log is this:

SPI Proc: 2464408024 total in 279 blocks; 1672 free (1 chunks);
2464406352 used
  PL/pgSQL function context: 537911352 total in 74 blocks; 2387536
free (4 chunks); 535523816 used


That is, most of the memory is allocated for SPI (2.4GB) and PL/pgSQL
procedure (500MB). How do you do the load? What libraries/drivers?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: large numbers of inserts out of memory strategy

2017-11-28 Thread Ted Toth
On Tue, Nov 28, 2017 at 12:04 PM, Steven Lembark  wrote:
> On Tue, 28 Nov 2017 11:17:07 -0600
> Ted Toth  wrote:
>
>> I'm writing a migration utility to move data from non-rdbms data
>> source to a postgres db. Currently I'm generating SQL INSERT
>> statements involving 6 related tables for each 'thing'. With 100k or
>> more 'things' to migrate I'm generating a lot of statements and when I
>> try to import using psql postgres fails with 'out of memory' when
>> running on a Linux VM with 4G of memory. If I break into smaller
>> chunks say ~50K statements then thde import succeeds. I can change my
>> migration utility to generate multiple files each with a limited
>> number of INSERTs to get around this issue but maybe there's
>> another/better way?
>
> Chunking the loads has a number of advantages other than avoiding
> OOM errors, not the least of which are possibly parallel loading
> and being able to restart after other failures without re-processing
> all of the input data.
>
> Note that 4GiB of core is not all that much by today's standards.
> You might want to run top/vmstat and ask if the PG daemons are using
> all/most of the available memory. If PG is sucking up all of the core
> as it is then tuning the database may not have much of an effect; if
> there is lots of spare memory then it'll be worth looking at ways to
> tune PG.
>
> Note also that "out of memory" frequently means virutal memory.
>
> Q: Does the VM have swap configured?
>
> If not then add 8GiB and see if that solves your problem; if so then
> how much swap is in use when you get the OOM error?
>
>
> --
> Steven Lembark   1505 National Ave
> Workhorse Computing Rockford, IL 61103
> lemb...@wrkhors.com+1 888 359 3508
>

I understand that 4G is not much ... Yeah in top I see the postmaster
process RES grow until it fails. The VM is basically a Centos 6 box
with 4G of swap.



Re: large numbers of inserts out of memory strategy

2017-11-28 Thread Ted Toth
On Tue, Nov 28, 2017 at 12:01 PM, Tomas Vondra
 wrote:
>
>
> On 11/28/2017 06:54 PM, Ted Toth wrote:
>> On Tue, Nov 28, 2017 at 11:22 AM, Tomas Vondra
>>  wrote:
>>> Hi,
>>>
>>> On 11/28/2017 06:17 PM, Ted Toth wrote:
 I'm writing a migration utility to move data from non-rdbms data
 source to a postgres db. Currently I'm generating SQL INSERT
 statements involving 6 related tables for each 'thing'. With 100k or
 more 'things' to migrate I'm generating a lot of statements and when I
 try to import using psql postgres fails with 'out of memory' when
 running on a Linux VM with 4G of memory. If I break into smaller
 chunks say ~50K statements then thde import succeeds. I can change my
 migration utility to generate multiple files each with a limited
 number of INSERTs to get around this issue but maybe there's
 another/better way?

>>>
>>> The question is what exactly runs out of memory, and how did you modify
>>> the configuration (particularly related to memory).
>>>
>>> regards
>>>
>>> --
>>> Tomas Vondra  http://www.2ndQuadrant.com
>>> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>>
>> I'm pretty new to postgres so I haven't changed any configuration
>> setting and the log is a bit hard for me to make sense of :(
>>
>
> The most interesting part of the log is this:
>
> SPI Proc: 2464408024 total in 279 blocks; 1672 free (1 chunks);
> 2464406352 used
>   PL/pgSQL function context: 537911352 total in 74 blocks; 2387536
> free (4 chunks); 535523816 used
>
>
> That is, most of the memory is allocated for SPI (2.4GB) and PL/pgSQL
> procedure (500MB). How do you do the load? What libraries/drivers?
>
> regards
>
> --
> Tomas Vondra  http://www.2ndQuadrant.com
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

I'm doing the load with 'psql -f'. I using 9.6 el6 rpms on a Centos VM
I downloaded from the postgres repo.



Re: large numbers of inserts out of memory strategy

2017-11-28 Thread Tomas Vondra

On 11/28/2017 07:26 PM, Ted Toth wrote:
> On Tue, Nov 28, 2017 at 12:01 PM, Tomas Vondra
>  wrote:
>>
>> ...
>>
>> That is, most of the memory is allocated for SPI (2.4GB) and PL/pgSQL
>> procedure (500MB). How do you do the load? What libraries/drivers?
>>
> 
> I'm doing the load with 'psql -f'. I using 9.6 el6 rpms on a Centos VM
> I downloaded from the postgres repo.
> 

So what does the script actually do? Because psql certainly is not
running pl/pgsql procedures on it's own. We need to understand why
you're getting OOM in the first place - just inserts alone should not
cause failures like that. Please show us more detailed explanation of
what the load actually does, so that we can try reproducing it.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [GENERAL] - Regarding Schema ROLLBACK

2017-11-28 Thread Alban Hertroys

> On 28 Nov 2017, at 16:43, David G. Johnston  
> wrote:
> 
> On Tue, Nov 28, 2017 at 5:23 AM, VENKTESH GUTTEDAR 
>  wrote:
> I have run CREATE SCHEMA xyz accidentally, and it replaced my existing 
> schema (xyz) with tables, and i have lost all my data.
> 
> ​As far as I know "CREATE SCHEMA" by itself cannot "drop" data nor would it 
> successfully execute if a schema of the same name already exist...
> 
> As Melvin says if you've truly dropped the data you will need to resort to 
> backups of some form.​
> 
> David J.

Just what I was thinking. Are you looking in the correct database?

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.




Re: [GENERAL] - Regarding Schema ROLLBACK

2017-11-28 Thread Rakesh Kumar

> Just what I was thinking. Are you looking in the correct database?

All:  He has already replied to me earlier that he had indeed dropped the 
schema before creating it.



Re: Plan for update ... where a is not distinct from b

2017-11-28 Thread Laurenz Albe
Peter J. Holzer wrote:
> I noticed that an update was taking a long time and found this:
> 
> UPDATE public.facttable_imf_ifs p
> SET [...lots of columns...]
> FROM cleansing.cls_imf_ifs_facttable_imf_ifs c, 
> cleansing.cleansing_change_type ct
> WHERE
> (p.macrobondtimeseries is not distinct from c.macrobondtimeseries) AND 
> (p.date is not distinct from c.date)
> AND c.cleansing_change_type_id = ct.cleansing_change_type_id
> AND ct.cleansing_change_type_desc_short IN ('UPDATED_NEW')

It is kind of ugly, and I didn't test it, but here is an idea:

Suppose we know a value that cannot occur in both p.date and c.date.

Then you could write

   WHERE coalesce(p.date, '0044-03-15 BC') = coalesce(c.date, '0044-03-15 BC')

and create an index on the coalesce expressions to facilitate
a merge join.

Yours,
Laurenz Albe



Where to troubleshoot phpPgAdmin login issues?

2017-11-28 Thread Robert Gordon
I am getting "Login failed" errors when attempting to log into phpPgAdmin as 
user 'postgres'.   In what file path would I find the relevant log files, to 
try and identify why the login is being rejected?

OS: CentOS 7
PostgreSQL: 9.6.6
pg_hba.conf settings for access:
# TYPE  DATABASEUSERADDRESS METHOD

# "local" is for Unix domain socket connections only
local   all all peer
# IPv4 local connections:
#hostall all 127.0.0.1/32ident
#host   all all 127.0.0.1/32md5
hostall all 192.168.101.1/32   md5
# IPv6 local connections:
#hostall all ::1/128 ident
hostall all ::1/128 md5
# Allow replication connections from localhost, by a user with the
# replication privilege.
#local   replication postgrespeer
#hostreplication postgres127.0.0.1/32ident
#hostreplication postgres::1/128 ident


Re: Where to troubleshoot phpPgAdmin login issues?

2017-11-28 Thread Steve Atkins

> On Nov 28, 2017, at 12:06 PM, Robert Gordon  wrote:
> 
> I am getting “Login failed” errors when attempting to log into phpPgAdmin as 
> user ‘postgres’.   In what file path would I find the relevant log files, to 
> try and identify why the login is being rejected?

Probably /var/lib/pgsql/data/pg_log or somewhere like that, if there's nothing 
obvious in /var/log.

You're not allowing connections from IPv4 localhost, though, and that's where 
I'd expect a sensibly configured phpPgAdmin to be trying to connect from 
(assuming it's on the same machine as the database).

Cheers,
  Steve

>  
> OS: CentOS 7
> PostgreSQL: 9.6.6
> pg_hba.conf settings for access:
> # TYPE  DATABASEUSERADDRESS METHOD
>  
> # "local" is for Unix domain socket connections only
> local   all all peer
> # IPv4 local connections:
> #hostall all 127.0.0.1/32ident
> #host   all all 127.0.0.1/32md5
> hostall all 192.168.101.1/32   md5
> # IPv6 local connections:
> #hostall all ::1/128 ident
> hostall all ::1/128 md5
> # Allow replication connections from localhost, by a user with the
> # replication privilege.
> #local   replication postgrespeer
> #hostreplication postgres127.0.0.1/32ident
> #hostreplication postgres::1/128 ident




Re: Plan for update ... where a is not distinct from b

2017-11-28 Thread Peter J. Holzer
On 2017-11-28 09:35:33 -0500, Tom Lane wrote:
> "Peter J. Holzer"  writes:
> > I noticed that an update was taking a long time and found this:
> > [ crappy plan for join on IS NOT DISTINCT ]
> 
> Yeah, there's no optimization smarts at all for IS [NOT] DISTINCT.
> It can't be converted into a merge qual, nor a hash qual, nor an
> indexscan qual.

Pity. I expected IS NOT DISTINCT to be treated pretty much like =, given
that it is just a more naive equality test. In particular, since
PostgreSQL stores NULL values in indexes (unlike Oracle) I expected it
to be able to use an index scan.

> In principle this could be improved, but given how much work it'd be
> and how seldom anyone complains, it's not likely to happen anytime soon.

Yeah, IS [NOT] DISTINCT is pretty obscure. I guess not many people use
it.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 


signature.asc
Description: PGP signature


Re: Plan for update ... where a is not distinct from b

2017-11-28 Thread Peter J. Holzer
On 2017-11-28 20:48:24 +0100, Laurenz Albe wrote:
> Peter J. Holzer wrote:
> > I noticed that an update was taking a long time and found this:
> > 
> > UPDATE public.facttable_imf_ifs p
> > SET [...lots of columns...]
> > FROM cleansing.cls_imf_ifs_facttable_imf_ifs c, 
> > cleansing.cleansing_change_type ct
> > WHERE
> > (p.macrobondtimeseries is not distinct from c.macrobondtimeseries) AND 
> > (p.date is not distinct from c.date)
> > AND c.cleansing_change_type_id = ct.cleansing_change_type_id
> > AND ct.cleansing_change_type_desc_short IN ('UPDATED_NEW')
> 
> It is kind of ugly, and I didn't test it, but here is an idea:
> 
> Suppose we know a value that cannot occur in both p.date and c.date.
> 
> Then you could write
> 
>WHERE coalesce(p.date, '0044-03-15 BC') = coalesce(c.date, '0044-03-15 BC')
> 
> and create an index on the coalesce expressions to facilitate
> a merge join.

Yes, that would work. However in most cases these queries are
automatically generated from the tables so determining "impossible"
values is not generally possible. I don't think it is necessary, though.
Replacing (A is not distinct from B) with (A = B or A is null and B is
null) usually produces an acceptable plan (indeed, I had that before, I
replaced it with is not distinct to clean it up) and when a column has a
not null constraint I can simply use (A = B). (I should probably do this
automatically - currently I need to pass a flag to the query generator,
and that's a bit error prone)

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 


signature.asc
Description: PGP signature


User Connecting to Remote Database

2017-11-28 Thread Susan Hurst


I would welcome your comments and suggestions for connecting a user (not 
a superuser) to a foreign server.


I have a database, named geo, in which I have geospatial and 
geopolitical data.  I want to be able to select data from geo from other 
databases.


The database that I want to connect up to geo is named stp.  I have a 
foreign data wrapper in stp that defines geo as the data source for the 
foreign server named geoserver.


User stp is defined in both geo and stp as superusers, so I am able to 
select geo data just fine from stp.  However, when I try to select geo 
data as user geo_user, I get this error:


ERROR: permission denied for relation geoadm_l0
SQL state: 42501

What am I missing?  Here are the relevant grants etc that I set up in 
both geo and stp.


-- user and user mapping in stp database
create user geo_user with login nosuperuser inherit nocreatedb 
nocreaterole noreplication password '**';
CREATE USER MAPPING FOR geo_user SERVER geoserver OPTIONS (password 
'**', "user" 'geo_user');

grant usage on foreign data wrapper postgres_fdw to geo_user;
GRANT USAGE ON FOREIGN SERVER geoserver TO geo_user;

-- user in geo database
create user geo_user with login nosuperuser inherit nocreatedb 
nocreaterole noreplication password '**';


-- grants in geo database
GRANT ALL ON TABLE public.geoadm_l0 TO susan;
GRANT SELECT ON TABLE public.geoadm_l0 TO geo_user;
GRANT SELECT ON TABLE public.geoadm_l0 TO read;
GRANT ALL ON TABLE public.geoadm_l0 TO geo;
GRANT INSERT, UPDATE, DELETE ON TABLE public.geoadm_l0 TO write;

Thanks for your help!

Sue

--

Susan E Hurst
Principal Consultant
Brookhurst Data LLC
Email: susan.hu...@brookhurstdata.com
Mobile: 314-486-3261



Re: User Connecting to Remote Database

2017-11-28 Thread Stephen Frost
Greetings Susan,

* Susan Hurst (susan.hu...@brookhurstdata.com) wrote:
> I would welcome your comments and suggestions for connecting a user
> (not a superuser) to a foreign server.
> 
> I have a database, named geo, in which I have geospatial and
> geopolitical data.  I want to be able to select data from geo from
> other databases.
> 
> The database that I want to connect up to geo is named stp.  I have
> a foreign data wrapper in stp that defines geo as the data source
> for the foreign server named geoserver.
> 
> User stp is defined in both geo and stp as superusers, so I am able
> to select geo data just fine from stp.  However, when I try to
> select geo data as user geo_user, I get this error:
> 
> ERROR: permission denied for relation geoadm_l0
> SQL state: 42501
> 
> What am I missing?  Here are the relevant grants etc that I set up
> in both geo and stp.
> 
> -- user and user mapping in stp database
> create user geo_user with login nosuperuser inherit nocreatedb
> nocreaterole noreplication password '**';
> CREATE USER MAPPING FOR geo_user SERVER geoserver OPTIONS (password
> '**', "user" 'geo_user');
> grant usage on foreign data wrapper postgres_fdw to geo_user;
> GRANT USAGE ON FOREIGN SERVER geoserver TO geo_user;
> 
> -- user in geo database
> create user geo_user with login nosuperuser inherit nocreatedb
> nocreaterole noreplication password '**';
> 
> -- grants in geo database
> GRANT ALL ON TABLE public.geoadm_l0 TO susan;
> GRANT SELECT ON TABLE public.geoadm_l0 TO geo_user;
> GRANT SELECT ON TABLE public.geoadm_l0 TO read;
> GRANT ALL ON TABLE public.geoadm_l0 TO geo;
> GRANT INSERT, UPDATE, DELETE ON TABLE public.geoadm_l0 TO write;

You don't appear to have done any GRANTs to the geo_user in the stp
database..?  Note that we check the privileges on the FOREIGN TABLE
defined in the source database too.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: large numbers of inserts out of memory strategy

2017-11-28 Thread Brian Crowell
On Tue, Nov 28, 2017 at 12:38 PM, Tomas Vondra  wrote:

> So what does the script actually do? Because psql certainly is not
> running pl/pgsql procedures on it's own. We need to understand why
> you're getting OOM in the first place - just inserts alone should not
> cause failures like that. Please show us more detailed explanation of
> what the load actually does, so that we can try reproducing it.


Perhaps the script is one giant insert statement?

—Brian


Re: SV: Refreshing materialized views

2017-11-28 Thread Daevor The Devoted
On 28 Nov 2017 5:18 pm, "Tom Lane"  wrote:

Henrik Uggla  writes:
> The underlying tables are foreign tables. The user has been mapped to a
foreign user with select permission. I have no problem selecting from the
foreign tables or the materialized views.

[ shrug... ] WFM; if I can select from the foreign table then I can make
a materialized view that selects from it, and that refreshes without
complaint.  Admittedly, getting things set up to select from the foreign
table is trickier than it sounds: your local user needs SELECT on the
foreign table plus a mapping to some remote userid, and *on the remote
server* that remote userid needs SELECT on whatever the foreign table
is referencing.  I'm guessing you messed up one of these components.

regards, tom lane

Hendrik, perhaps an easy way to check out Tom's suggestion is to create a
very simple materialized view that selects just from one of the foreign
tables, then attempt the REFRESH. If that works, then keep adding more
tables from your original materialized view until you have found the
problem.
Basically, reduce the problem to the simplest case, and if that works, then
keep adding to it until you hit the problem. You may still not know why the
problem is happening, but you'll at least know where to focus any further
investigation.

Kind regards,
Daevor, The Devoted


Re: large numbers of inserts out of memory strategy

2017-11-28 Thread Tom Lane
Brian Crowell  writes:
> On Tue, Nov 28, 2017 at 12:38 PM, Tomas Vondra > wrote:
>> So what does the script actually do? Because psql certainly is not
>> running pl/pgsql procedures on it's own. We need to understand why
>> you're getting OOM in the first place - just inserts alone should not
>> cause failures like that. Please show us more detailed explanation of
>> what the load actually does, so that we can try reproducing it.

> Perhaps the script is one giant insert statement?

It's pretty clear from the memory map that the big space consumption
is inside a single invocation of a plpgsql function:

SPI Proc: 2464408024 total in 279 blocks; 1672 free (1 chunks); 2464406352 
used
  PL/pgSQL function context: 537911352 total in 74 blocks; 2387536 free (4 
chunks); 535523816 used

So whatever's going on here, there's more to it than a giant client-issued
INSERT (or COPY), or for that matter a large number of small ones.  What
would seem to be required is a many-megabyte-sized plpgsql function body
or DO block.

Actually, the truly weird thing about that map is that the "PL/pgSQL
function context" seems to be a child of a "SPI Proc" context, whereas
it's entirely clear from the code that it ought to be a direct child of
TopMemoryContext.  I have no idea how this state of affairs came to be,
and am interested to find out.

regards, tom lane



seq vs index scan in join query

2017-11-28 Thread Emanuel Alvarez
hi all,

we're in the process of optimizing some queries and we've noted a case
where the planner prefers a sequential scan instead of using an index,
while the index scan is actually much faster. to give you some
context: we have two main tables, keywords and results. keywords has
approximately 700.000 rows; while results holds approximately one row
per keyword per day (roughly 70m at the moment, not all keywords are
active at any given day). results is currently partitioned by
(creation) time. it's also worth noting that we use SSDs in our
servers, and have random_page_cost set to 1.


the problematic query looks like this:

SELECT keywords.strategy_id, results.position, results.created_at FROM results
  JOIN  keywords ON results.keyword_id = keywords.id
  WHERE results.account_id = 1
 AND results.created_at >= '2017-10-25 00:00:00.00'
 AND results.created_at <= '2017-11-10 23:59:59.99';


as you can see in the query plan [1] a sequential scan is preferred.
as we understand it, this happens because the number of rows returned
from results is too large. if we reduce this number by either
selecting a smaller created_at range, or another account_id with fewer
keywords, the planner falls back to an index scan, confirming that the
number of rows returned from results has a direct influence in this
choice.

on the other hand, if we disable sequential scans (SET enable_seqscan
= 0), we see than not only the query runs faster but the cost seems to
be lower, as seen in the query plan [2].

in this example the gain it's not much: ~0.5s. but when we add a
second join table with additional keyword data the planner still
prefers a sequential scan on a table that has +6m rows. query looks
like this:

SELECT keywords.strategy_id, results.position, results.created_at,
keyword_data.volume FROM results
  JOIN  keywords ON results.keyword_id = keywords.id
  JOIN keyword_data ON keywords.keyword_data_id = keyword_data.id
  WHERE results.account_id = 1
 AND results.created_at >= '2017-10-25 00:00:00.00'
 AND results.created_at <= '2017-11-19 23:59:59.99';


in this case query takes up to 8s, query plan can be found in [3].
obviously dataset has to be large to prefer a sequential on a 6m rows
table. similarly, reducing the created_at range or using an account_id
with fewer keywords makes the planner prefer index scan, accelerating
the query considerably.

currently we're exploring the option of fetching keywords data within
a subquery and feed that into the main query, which works as expected,
but also complicates the design a bit.

we'd like to know:
 1. why does the planner prefers a sequential scan in these cases?
 2. is there a way we can make the planner choose a better strategy
using indexes?

thank you for your time.

[1] seq scan plan:
https://gist.github.com/emnlvrz/5e53235c82260be011d84cf264e597e7
[2] indexed plan:
https://gist.github.com/emnlvrz/8aa85edbdedcdb90d8d4f38863abc134
[3] seq scan additional join plan:
https://gist.github.com/emnlvrz/b3f13518f863f829c65f91a514f407d9