Querying PostgreSQL / PostGIS Databases in Python
Hi, What is the advantage of querying in Python? Has anyone got much experience? What not just use standard query? What is the rationale for querying in Python? Would the performance be better? Regards, Shao
Re: Questions about Logical Replication - Issue ???
hello Kyotaro, thx for you feedback and clarification. Le ven. 31 juil. 2020 à 02:13, Kyotaro Horiguchi a écrit : > Hi, > > At Thu, 30 Jul 2020 14:54:08 +0100, FOUTE K. Jaurès > wrote in > > Hi everyone, > > > > Situation: > > > >- A Master Database on the HQ > >- i make a dump of the master database to the Subdivision Server > >- I create à Publication like: CREATE PUBLICATION > >iNOV_MasterData_Table_Pub FOR TABLE M_Product; On the Master Database > >- On the Subdivision Server, I create a Subscription like: CREATE > >SUBSCRIPTION iNOV_MasterData_Table_XXX_Sub CONNECTION '' > >PUBLICATION iNOV_MasterData_Table_Pub; > >- On the log, I have this error: > > - 2020-07-30 14:32:59.366 WAT [8022] ERROR: duplicate key value > > violates unique constraint "m_product_pkey" > > 2020-07-30 14:32:59.366 WAT [8022] DETAIL: Key > > (m_product_id)=(1001426) already exists. > > 2020-07-30 14:32:59.366 WAT [8022] CONTEXT: COPY m_product, line 1 > > 2020-07-30 14:32:59.369 WAT [1536] LOG: background worker "logical > > replication worker" (PID 8022) exited with exit code 1 > > > > What can I do to solve this? Is it normal ? It > > BTW: When I create Subscription With the option (copy_data = false), I > am > > able to replicate the new record. > > As you know, initial table copy happens defaultly at subscription > creation (the COPY command in the above log lines was doing that). If > you are sure that the publisher table is in-sync with the subscriber > one, you can use copy_data=false safely and it's the proper operation. > > regards. > > -- > Kyotaro Horiguchi > NTT Open Source Software Center > -- Jaurès FOUTE
Re: Querying PostgreSQL / PostGIS Databases in Python
You may write stored procedures using PL/pgSQL,alternatively you may write your queries in python. You may use psycopg2 to query the DB from Python. You may have a mix of the two, it will depend on your preference. Ideally you may not want your users running queries against the data by connecting to the database directly using database tools psql or pgadmin3 or pgadmin4. This means that having a database access application written in Python to restrict the and encapsulate data access may be advisable. In this case you may place all the DML statements in python and execute them or you may have much of the data access logic written into several PL/pgSQL functions, then call these functions via Python. Below is python code illustrating the use of psycopg2. This code has not been run so expect some errors. Here I am executing an SQL query on a table, you may modify this code to execute a PL/pgSQL function. import psycopg2; from psycopg2 import sql; import psycopg2.extras; from psycopg2.extensions import AsIs; db__pg_conn__str="host='host_name_of_computer_hosting_pg_db_service' port=5432 dbname='your_pg_db_name' user='your_username' password='user_password'"; db__pg_conn=psycopg2.connect(db__pg_conn__str); query_table( dataset_name ,some_value_2 ,db__pg_conn ); def query_table( dataset_name ,some_value_2 ,db__pg_conn ): """ """; table__id=-1; _sql_query1a=""" SELECT {}::TEXT AS some_string,a.id AS table__id,a.*,clock_timestamp() FROM {}.{} a WHERE a.dataset_name={}::TEXT ; """; sqlSQL1a=None; sqlSQL1a=sql.SQL(_sql_query1a); pg_cursor1a=db__pg_conn.cursor(cursor_factory=psycopg2.extras.DictCursor); _sql_query1a_processed=pg_cursor1a.mogrify( sqlSQL1a.format( sql.SQL(',').join(map(psycopg2.sql.Placeholder,['some_value_1'])) ,psycopg2.sql.Identifier("my_schema.my_table".split(".")[0]) ,psycopg2.sql.Identifier("my_schema.my_table".split(".")[1]) ,sql.SQL(',').join(map(psycopg2.sql.Placeholder,['some_value_2'])) ) ,{ 'some_value_1':'ABC' ,'some_value_2':dataset_name } ); _sql_query1a_processed=_sql_query1a_processed.decode().replace("\\'","'"); #LOGGER.info(" '{0}', -- _sql_query1a_processed is:'{1}'.".format(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-1],_sql_query1a_processed)); pg_cursor1a.execute( _sql_query1a_processed ); rowcount1a=pg_cursor1a.rowcount; rows=None; rows=pg_cursor1a.fetchall(); row_cnt=0; for row in rows: pass; row_cnt+=1; table__id=row["table__id"];//do something with table__id //do something with rows. rows=None; db__pg_conn.commit(); sqlSQL1a=None; pg_cursor1a=None; On Fri, Jul 31, 2020 at 12:30 PM Shaozhong SHI wrote: > Hi, > > What is the advantage of querying in Python? > > Has anyone got much experience? > > What not just use standard query? > > What is the rationale for querying in Python? > > Would the performance be better? > > Regards, > > Shao >
Apparent missed query optimization with self-join and inner grouping
I have a table recording the results of a web crawl. (Table definition at the end of this message.) The relevant part of the data stored in it looks like this: id | url_id | full_url_id | experiment_id | redirect_num --++-+---+-- 2617 | 1312 |1312 |16 |0 2618 | 1312 |2311 |16 |1 2619 | 1312 |2312 |16 |2 2620 | 1312 |2313 |16 |3 2631 | 1320 |1320 |43 |0 2633 | 1320 |2312 |43 |2 2632 | 1320 |2317 |43 |1 2634 | 1320 |2318 |43 |3 For each (experiment_id, url_id) pair for some small subset of the experiment_ids, I need to query the full_url_id corresponding to the *largest* value of redirect_num. The query planner does something reasonable with this SELECT: => explain (analyze, verbose) select b.experiment_id, b.url_id, b.full_url_id from blockpage b, (select experiment_id, url_id, max(redirect_num) as redirect_num from blockpage group by experiment_id, url_id) bm where b.experiment_id = bm.experiment_id and b.url_id = bm.url_id and b.redirect_num = bm.redirect_num and bm.experiment_id in (16, 43); Nested Loop (cost=1.14..88505.96 rows=20 width=12) (actual time=0.041..1.723 rows=501 loops=1) Output: b.experiment_id, b.url_id, b.full_url_id -> GroupAggregate (cost=0.57..15442.73 rows=8543 width=12) (actual time=0.033..0.501 rows=501 loops=1) Output: blockpage.experiment_id, blockpage.url_id, max(blockpage.redirect_num) Group Key: blockpage.experiment_id, blockpage.url_id -> Index Only Scan using blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on iclab.blockpage (cost=0.57..15293.19 rows=8547 width=12) (actual time=0.026..0.283 rows=803 loops=1) Output: blockpage.experiment_id, blockpage.url_id, blockpage.full_url_id, blockpage.redirect_num, blockpage.html_tag_id Index Cond: (blockpage.experiment_id = ANY ('{16,43}'::integer[])) Heap Fetches: 803 -> Index Only Scan using blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on iclab.blockpage b (cost=0.57..8.53 rows=1 width=16) (actual time=0.002..0.002 rows=1 loops=501) Output: b.experiment_id, b.url_id, b.full_url_id, b.redirect_num, b.html_tag_id Index Cond: ((b.experiment_id = blockpage.experiment_id) AND (b.url_id = blockpage.url_id) AND (b.redirect_num = (max(blockpage.redirect_num Heap Fetches: 501 Planning Time: 0.331 ms Execution Time: 1.784 ms But if I change the final part of the WHERE to reference b.experiment_id instead of bm.experiment_id, I get this much more expensive query plan: => explain (analyze, verbose) select b.experiment_id, b.url_id, b.full_url_id from blockpage b, (select experiment_id, url_id, max(redirect_num) as redirect_num from blockpage group by experiment_id, url_id) bm where b.experiment_id = bm.experiment_id and b.url_id = bm.url_id and b.redirect_num = bm.redirect_num and b.experiment_id in (16, 43); Hash Join (cost=2749504.19..2764864.13 rows=2 width=12) (actual time=144028.343..144029.545 rows=501 loops=1) Output: b.experiment_id, b.url_id, b.full_url_id Inner Unique: true Hash Cond: ((b.experiment_id = blockpage.experiment_id) AND (b.url_id = blockpage.url_id) AND (b.redirect_num = (max(blockpage.redirect_num -> Index Only Scan using blockpage_experiment_id_url_id_redirect_num_blockpage_reason__ on iclab.blockpage b (cost=0.57..15293.19 rows=8547 width=16) (actual time=0.039..0.387 rows=803 loops=1) Output: b.experiment_id, b.url_id, b.full_url_id, b.redirect_num, b.html_tag_id Index Cond: (b.experiment_id = ANY ('{16,43}'::integer[])) Heap Fetches: 803 -> Hash (cost=2595219.62..2595219.62 rows=8816229 width=12) (actual time=143941.931..143941.931 rows=57061228 loops=1) Output: blockpage.experiment_id, blockpage.url_id, (max(blockpage.redirect_num)) Buckets: 67108864 (originally 16777216) Batches: 1 (originally 1) Memory Usage: 2976138kB -> HashAggregate (cost=2418895.04..2507057.33 rows=8816229 width=12) (actual time=90020.851..122656.924 rows=57061228 loops=1) Output: blockpage.experiment_id, blockpage.url_id, max(blockpage.redirect_num) Group Key: blockpage.experiment_id, blockpage.url_id -> Seq Scan on iclab.blockpage (cost=0.00..1757677.88 rows=88162288 width=12) (actual time=0.020..32910.709 rows=88164599 loops=1) Output: blockpage.id, blockpage.url_id, blockpage.full_url_id, blockpage.experiment_id, blockpage.blockpage_reason_id, blockpage.html_tag_id, blockpage.body_len, blockpage.block