Querying PostgreSQL / PostGIS Databases in Python

2020-07-31 Thread Shaozhong SHI
Hi,

What is the advantage of querying in Python?

Has anyone got much experience?

What not just use standard query?

What is the rationale for querying in Python?

Would the performance be better?

Regards,

Shao


Re: Questions about Logical Replication - Issue ???

2020-07-31 Thread FOUTE K . Jaurès
hello Kyotaro,

thx for you feedback and clarification.



Le ven. 31 juil. 2020 à 02:13, Kyotaro Horiguchi 
a écrit :

> Hi,
>
> At Thu, 30 Jul 2020 14:54:08 +0100, FOUTE K. Jaurès 
> wrote in
> > Hi everyone,
> >
> > Situation:
> >
> >- A Master Database on the HQ
> >- i make a dump of the master database to the Subdivision Server
> >- I create à Publication like: CREATE PUBLICATION
> >iNOV_MasterData_Table_Pub FOR TABLE M_Product; On the Master Database
> >- On the Subdivision Server, I create a Subscription like: CREATE
> >SUBSCRIPTION iNOV_MasterData_Table_XXX_Sub CONNECTION ''
> >PUBLICATION  iNOV_MasterData_Table_Pub;
> >- On the log, I have this error:
> >   - 2020-07-30 14:32:59.366 WAT [8022] ERROR:  duplicate key value
> >   violates unique constraint "m_product_pkey"
> >   2020-07-30 14:32:59.366 WAT [8022] DETAIL:  Key
> >   (m_product_id)=(1001426) already exists.
> >   2020-07-30 14:32:59.366 WAT [8022] CONTEXT:  COPY m_product, line 1
> >   2020-07-30 14:32:59.369 WAT [1536] LOG:  background worker "logical
> >   replication worker" (PID 8022) exited with exit code 1
> >
> > What can I do to solve this? Is it normal ? It
> > BTW: When I create Subscription With the option  (copy_data = false), I
> am
> > able to replicate the new record.
>
> As you know, initial table copy happens defaultly at subscription
> creation (the COPY command in the above log lines was doing that). If
> you are sure that the publisher table is in-sync with the subscriber
> one, you can use copy_data=false safely and it's the proper operation.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>


-- 
Jaurès FOUTE


Re: Querying PostgreSQL / PostGIS Databases in Python

2020-07-31 Thread Allan Kamau
You may write stored procedures using PL/pgSQL,alternatively you may
write your queries in python.
You may use psycopg2 to query the DB from Python.
You may have a mix of the two, it will depend on your preference.
Ideally you may not want your users running queries against the data by
connecting to the database directly using database tools psql or pgadmin3
or pgadmin4.
This means that having a database access application written in Python to
restrict the and encapsulate data access may be advisable.
In this case you may place all the DML statements in python and execute
them or you may have much of the data access logic written into several
PL/pgSQL functions, then call these functions via Python.


Below is python code illustrating the use of psycopg2. This code has not
been run so expect some errors.
Here I am executing an SQL query on a table, you may modify this code to
execute a PL/pgSQL function.


import psycopg2;
from psycopg2 import sql;
import psycopg2.extras;
from psycopg2.extensions import AsIs;


db__pg_conn__str="host='host_name_of_computer_hosting_pg_db_service'
port=5432 dbname='your_pg_db_name' user='your_username'
password='user_password'";
db__pg_conn=psycopg2.connect(db__pg_conn__str);

query_table(
dataset_name
,some_value_2
,db__pg_conn
);


def query_table(
dataset_name
,some_value_2
,db__pg_conn
):
"""
""";
table__id=-1;
_sql_query1a="""
SELECT {}::TEXT AS some_string,a.id AS
table__id,a.*,clock_timestamp() FROM {}.{} a WHERE a.dataset_name={}::TEXT
;
""";
sqlSQL1a=None;
sqlSQL1a=sql.SQL(_sql_query1a);

pg_cursor1a=db__pg_conn.cursor(cursor_factory=psycopg2.extras.DictCursor);
_sql_query1a_processed=pg_cursor1a.mogrify(
sqlSQL1a.format(

sql.SQL(',').join(map(psycopg2.sql.Placeholder,['some_value_1']))
,psycopg2.sql.Identifier("my_schema.my_table".split(".")[0])
,psycopg2.sql.Identifier("my_schema.my_table".split(".")[1])

,sql.SQL(',').join(map(psycopg2.sql.Placeholder,['some_value_2']))
)
,{
'some_value_1':'ABC'
,'some_value_2':dataset_name
}
);

_sql_query1a_processed=_sql_query1a_processed.decode().replace("\\'","'");

#LOGGER.info(" '{0}', -- _sql_query1a_processed
is:'{1}'.".format(datetime.datetime.now().strftime('%Y-%m-%d
%H:%M:%S.%f')[:-1],_sql_query1a_processed));
pg_cursor1a.execute(
_sql_query1a_processed
);
rowcount1a=pg_cursor1a.rowcount;
rows=None;
rows=pg_cursor1a.fetchall();
row_cnt=0;
for row in rows:
pass;
row_cnt+=1;
table__id=row["table__id"];//do something with table__id
//do something with rows.
rows=None;
db__pg_conn.commit();
sqlSQL1a=None;
pg_cursor1a=None;



On Fri, Jul 31, 2020 at 12:30 PM Shaozhong SHI 
wrote:

> Hi,
>
> What is the advantage of querying in Python?
>
> Has anyone got much experience?
>
> What not just use standard query?
>
> What is the rationale for querying in Python?
>
> Would the performance be better?
>
> Regards,
>
> Shao
>


Apparent missed query optimization with self-join and inner grouping

2020-07-31 Thread Zack Weinberg
I have a table recording the results of a web crawl.  (Table
definition at the end of this message.)  The relevant part of the data
stored in it looks like this:

  id  | url_id | full_url_id | experiment_id | redirect_num
--++-+---+--
 2617 |   1312 |1312 |16 |0
 2618 |   1312 |2311 |16 |1
 2619 |   1312 |2312 |16 |2
 2620 |   1312 |2313 |16 |3
 2631 |   1320 |1320 |43 |0
 2633 |   1320 |2312 |43 |2
 2632 |   1320 |2317 |43 |1
 2634 |   1320 |2318 |43 |3

For each (experiment_id, url_id) pair for some small subset of the
experiment_ids, I need to query the full_url_id corresponding to the
*largest* value of redirect_num.  The query planner does something
reasonable with this SELECT:

=> explain (analyze, verbose)
   select b.experiment_id, b.url_id, b.full_url_id
 from blockpage b,
  (select experiment_id, url_id, max(redirect_num) as redirect_num
 from blockpage group by experiment_id, url_id) bm
where b.experiment_id = bm.experiment_id
  and b.url_id = bm.url_id
  and b.redirect_num = bm.redirect_num
  and bm.experiment_id in (16, 43);

 Nested Loop  (cost=1.14..88505.96 rows=20 width=12) (actual
time=0.041..1.723 rows=501 loops=1)
   Output: b.experiment_id, b.url_id, b.full_url_id
   ->  GroupAggregate  (cost=0.57..15442.73 rows=8543 width=12)
(actual time=0.033..0.501 rows=501 loops=1)
 Output: blockpage.experiment_id, blockpage.url_id,
max(blockpage.redirect_num)
 Group Key: blockpage.experiment_id, blockpage.url_id
 ->  Index Only Scan using
blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on
iclab.blockpage  (cost=0.57..15293.19 rows=8547 width=12) (actual
time=0.026..0.283 rows=803 loops=1)
   Output: blockpage.experiment_id, blockpage.url_id,
blockpage.full_url_id, blockpage.redirect_num, blockpage.html_tag_id
   Index Cond: (blockpage.experiment_id = ANY
('{16,43}'::integer[]))
   Heap Fetches: 803
   ->  Index Only Scan using
blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on
iclab.blockpage b  (cost=0.57..8.53 rows=1 width=16) (actual
time=0.002..0.002 rows=1 loops=501)
 Output: b.experiment_id, b.url_id, b.full_url_id,
b.redirect_num, b.html_tag_id
 Index Cond: ((b.experiment_id = blockpage.experiment_id) AND
(b.url_id = blockpage.url_id) AND (b.redirect_num =
(max(blockpage.redirect_num
 Heap Fetches: 501
 Planning Time: 0.331 ms
 Execution Time: 1.784 ms


But if I change the final part of the WHERE to reference
b.experiment_id instead of bm.experiment_id, I get this much more
expensive query plan:

=> explain (analyze, verbose)
   select b.experiment_id, b.url_id, b.full_url_id
 from blockpage b,
  (select experiment_id, url_id, max(redirect_num) as redirect_num
 from blockpage group by experiment_id, url_id) bm
where b.experiment_id = bm.experiment_id
  and b.url_id = bm.url_id
  and b.redirect_num = bm.redirect_num
  and b.experiment_id in (16, 43);

 Hash Join  (cost=2749504.19..2764864.13 rows=2 width=12) (actual
time=144028.343..144029.545 rows=501 loops=1)
   Output: b.experiment_id, b.url_id, b.full_url_id
   Inner Unique: true
   Hash Cond: ((b.experiment_id = blockpage.experiment_id) AND
(b.url_id = blockpage.url_id) AND (b.redirect_num =
(max(blockpage.redirect_num
   ->  Index Only Scan using
blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on
iclab.blockpage b  (cost=0.57..15293.19 rows=8547 width=16) (actual
time=0.039..0.387 rows=803 loops=1)
 Output: b.experiment_id, b.url_id, b.full_url_id,
b.redirect_num, b.html_tag_id
 Index Cond: (b.experiment_id = ANY ('{16,43}'::integer[]))
 Heap Fetches: 803
   ->  Hash  (cost=2595219.62..2595219.62 rows=8816229 width=12)
(actual time=143941.931..143941.931 rows=57061228 loops=1)
 Output: blockpage.experiment_id, blockpage.url_id,
(max(blockpage.redirect_num))
 Buckets: 67108864 (originally 16777216)  Batches: 1
(originally 1)  Memory Usage: 2976138kB
 ->  HashAggregate  (cost=2418895.04..2507057.33 rows=8816229
width=12) (actual time=90020.851..122656.924 rows=57061228 loops=1)
   Output: blockpage.experiment_id, blockpage.url_id,
max(blockpage.redirect_num)
   Group Key: blockpage.experiment_id, blockpage.url_id
   ->  Seq Scan on iclab.blockpage  (cost=0.00..1757677.88
rows=88162288 width=12) (actual time=0.020..32910.709 rows=88164599
loops=1)
 Output: blockpage.id, blockpage.url_id,
blockpage.full_url_id, blockpage.experiment_id,
blockpage.blockpage_reason_id, blockpage.html_tag_id,
blockpage.body_len, blockpage.block