Re: Regarding Postgres - Insertion Time Getting Increased As Data Volume is getting increased

2021-02-10 Thread cen



On 10. 02. 21 09:14, Rajnish Vishwakarma wrote:

Hi Postgres Team,

The below are the scenarios which we are dealing with.

1) There are 20 Tables - On an average each having 150 columns.

2) There are 20 Threads Handled by Thread Pool Executor ( here we are 
using Python's - psycopg2 module / library to fetch the data .)


3) I am using the below statement to insert the data using Python - 
psycopg2 module - using the exceute(...) command as .


sql_stmt = "INSERT INTO " + name_Table + final_col_string + "VALUES" + 
str(tuple(array_of_curly_values))

print('Sql statement', sql_stmt)col_cursor_db = db_conn.cursor()
v = col_cursor_db.execute(sql_stmt);

But earlier the same 22 threads were running and the insertion time 
was gradually increased from 1 second to 30-35 seconds.


Requesting and urging the postgres general support team to help me out 
on this.


How can i increase the INSERTION speed to minimize the insertion time 
taken by each thread in the THREAD POOL.


Or there any different python libraries other than psycopg2 ?

Is there any different functions in python psycopg2 ?

Or what performance tuning has to be done to increaser the insertion 
speed ?





Is a single insert taking 30 seconds or do you have such a large number 
of inserts that your thread pool can't handle it and you are waiting for 
a free connection?


For single insert, one reason for slowness at large databases could be 
indexes which need to be updated for each insert. For the latter, you 
should increase the thread pool size.






Postgres undeterministically uses a bad plan, how to convince it otherwise?

2023-02-16 Thread cen

Hi,

I am running the same application (identical codebase) as two separate 
instances to index (save) different sets of data. Both run PostgreSQL 13.


The queries are the same but the content in actual databases is 
different. One database is around 1TB and the other around 300GB.



There is a problem with a paginated select query with a join and an 
order. Depending on which column you order by (primary or FK) the query 
is either instant or takes minutes.


So on one database, ordering by primary is instant but on the other it 
is slow and vice-versa. Switching the columns around on the slow case 
fixes the issue.


All relavant colums are indexed.


Simplified:

Slow: SELECT * from table1 AS t1 LEFT JOIN table2 AS t2 ON 
t2.t1_id=t1.id ORDER BY t1.id ASC LIMIT 0, 10


Fast: SELECT * from table1 AS t1 LEFT JOIN table2 AS t2 ON 
t2.t1_id=t1.id ORDER BY t2.t1_id ASC LIMIT 0, 10


(and the opposite, on the other instance the first one is fast and 
second one is slow).



I have run all the statistic recalculations but that doesn't help. As 
far as I could read the docs, there is no real way to affect the plan


other than reworking the query (I've read about fencing?) which can't be 
done because it seems to be unpredictable and depends on actual data and 
data quantity.


I haven't tried reindexing.


Before providing and diving into the specific query plans:

- does the planner take previous runs of the same query and it's 
execution time into account? If not, why?


- assuming the query to be immutable, would it be possible for the 
planner to microbenchmark a few different plans instead of trying to 
estimate the cost?
As in, actually executing the query with different plans and caching the 
best one.




PS: good job to the FOSDEM Devroom organisers, one of the best rooms, 
was a blast.


Best regards, Cen





Re: Postgres undeterministically uses a bad plan, how to convince it otherwise?

2023-02-17 Thread cen

On 16/02/2023 17:07, David G. Johnston wrote:


No, the planner may not cause execution.  While I could imagine 
extending EXPLAIN to somehow retrieve and maybe even try alternative 
plans that have been fully constructed today I'm not holding my breath.


There is little reason for the project to give any real weight to 
"assuming the query to be immutable".  We do want to fix the planner 
to behave better if it is mis-behaving, otherwise you do have access 
to cost parameters, and potentially other planner toggles if you've 
truly run into an intractable problem.


David J.

Fair on both points. I didn't know planner toggles existed, I'll play 
with that.

Re: Postgres undeterministically uses a bad plan, how to convince it otherwise?

2023-03-02 Thread cen



On 16/02/2023 17:15, Ron wrote:

On 2/16/23 09:47, cen wrote:

Hi,

I am running the same application (identical codebase) as two 
separate instances to index (save) different sets of data. Both run 
PostgreSQL 13.


The queries are the same but the content in actual databases is 
different. One database is around 1TB and the other around 300GB.



There is a problem with a paginated select query with a join and an 
order. Depending on which column you order by (primary or FK) the 
query is either instant or takes minutes.


So on one database, ordering by primary is instant but on the other 
it is slow and vice-versa. Switching the columns around on the slow 
case fixes the issue.


All relavant colums are indexed.


What does EXPLAIN ANALYZE say?


I finally managed to get back to this and have a similar sample query. 
In this case, changing the WHERE clause to use the joined table column 
slows the query down.


The initial case was on the ORDER BY column but the simptoms are the 
same I think.


I understand that even though both colums are indexed, the indexes are 
completely different but the point is, how would one know in advance 
which one will be faster when designing the query?


And as I mentioned in my initial text, the fast case of columns can 
switch around as the database grows.



Fast case:

SELECT
  t0."status",
  b1."timestamp"
FROM
  "transactions" AS t0
INNER JOIN
  "blocks" AS b1
ON
  b1."number" = t0."block_number"
WHERE
  (((t0."to_address_hash" = '\x3012c'))
    OR (t0."from_address_hash" = '\x3012c')
    OR (t0."created_contract_address_hash" = '\x3012c'))
  AND (t0."block_number" >= 30926000)
  AND (t0."block_number" <= 31957494)
ORDER BY
  t0."block_number" DESC
LIMIT
  150
OFFSET
  300;

Plan:

Limit  (cost=15911.73..23367.09 rows=150 width=16) (actual 
time=205.093..305.423 rows=150 loops=1)
   ->  Gather Merge  (cost=1001.03..812143.50 rows=16320 width=16) 
(actual time=36.140..305.333 rows=450 loops=1)

 Workers Planned: 2
 Workers Launched: 2
 ->  Nested Loop  (cost=1.00..809259.75 rows=6800 width=16) 
(actual time=2.662..155.779 rows=153 loops=3)
   ->  Parallel Index Scan Backward using 
transactions_block_number_index on transactions t0 (cost=0.56..753566.08 
rows=6800 width=8) (actual time=0.224..145.998 rows=153 loops=3)
 Index Cond: ((block_number >= 30926000) AND 
(block_number <= 31957494))
 Filter: ((to_address_hash = '\x3012c'::bytea) OR 
(from_address_hash = '\x3012c'::bytea) OR (created_contract_address_hash 
= '\x3012c'::bytea))

 Rows Removed by Filter: 22471
   ->  Index Scan using blocks_number_index on blocks b1  
(cost=0.44..8.18 rows=1 width=16) (actual time=0.059..0.060 rows=1 
loops=460)

 Index Cond: (number = t0.block_number)
 Planning Time: 0.513 ms
 Execution Time: 305.541 ms

--

Slow case:

SELECT
  t0."status",
  b1."timestamp"
FROM
  "transactions" AS t0
INNER JOIN
  "blocks" AS b1
ON
  b1."number" = t0."block_number"
WHERE
  (((t0."to_address_hash" = '\x3012c'))
    OR (t0."from_address_hash" = '\x3012c')
    OR (t0."created_contract_address_hash" = '\x3012c'))
  AND (b1."number" >= 30926000) -- using col from joined table instead
  AND (b1."number" <= 31957494) -- using col from joined table instead
ORDER BY
  t0."block_number" DESC
LIMIT
  150
OFFSET
  300;

Plan:

Limit  (cost=1867319.63..1877754.02 rows=150 width=16) (actual 
time=95830.704..95962.116 rows=150 loops=1)
   ->  Gather Merge  (cost=1846450.83..2015348.94 rows=2428 width=16) 
(actual time=95805.872..95962.075 rows=450 loops=1)

 Workers Planned: 2
 Workers Launched: 2
 ->  Merge Join  (cost=1845450.81..2014068.67 rows=1012 
width=16) (actual time=95791.362..95824.633 rows=159 loops=3)

   Merge Cond: (t0.block_number = b1.number)
   ->  Sort  (cost=1845402.63..1845823.81 rows=168474 
width=8) (actual time=95790.194..95790.270 rows=186 loops=3)

 Sort Key: t0.block_number DESC
 Sort Method: external merge  Disk: 2496kB
 Worker 0:  Sort Method: external merge  Disk: 2408kB
 Worker 1:  Sort Method: external merge  Disk: 2424kB
 ->  Parallel Bitmap Heap Scan on transactions t0  
(cost=39601.64..1828470.76 rows=168474 width=8) (actual 
time=7274.149..95431.494 rows=137658 loops=3)
   Recheck Cond: ((to_address_hash = 
'\x3012c'::bytea) OR (from_address_hash = '\x3012c&

Re: Postgres undeterministically uses a bad plan, how to convince it otherwise?

2023-03-03 Thread cen

Likely to be safe, you'd just include both. The problem is that the
query planner makes use of equivalence classes to deduce equivalence
in quals.

If you have a query such as:

select * from t1 inner join t2 on t1.x = t2.y where t1.x = 3;

then the planner can deduce that t2.y must also be 3 and that qual can
be pushed down to the scan level. If t2.y = 3 is quite selective and
there's an index on that column, then this deduction is likely going
to be a very good win, as the alternative of not using it requires
looking at all rows in t2.


Does equivalency only work for constants as in the sample you provided 
or will it also be found in b1."number" and t0."block_number" in my 
sample query?


Meaning the columns could be used interchangeably in all the WHERE 
clauses and the ORDER clause, then it is a matter of figuring out what 
costs less.








Re: Patroni vs pgpool II

2023-04-04 Thread cen


Can someone please suggest what is one (Patroni vs PGPool II) is best 
for achieving HA/Auto failover, Load balancing for DB servers. Along 
with this, can you please share the company/client names using these 
tools for large PG databases?


Having used pgpool in multiple production deployments I swore to never 
use it again, ever.


The first reason is that you need a doctorate degree to try to 
understand how it actually works, what the pcp commands do in each 
scenario and how to correctly write the failover scripts.


It is basically a daemon glued together with scripts for which you are 
entirely responsible for. Any small mistake in failover scripts and 
cluster enters  a broken state.


Even once you have it set up as it should, yes, it will fail over 
correctly but it won't autoheal without manual intervention.


You also often end up in weird situation when backends are up, pgpool 
reports down and similar scenarios and then you need to run the precise 
sequence of pcp commands to recover


or destroy your whole cluster in the process if you mistype.


I haven't used patroni yet but it surely can't be worse.


Best regards, cen


Re: Copyright vs Licence

2021-05-10 Thread cen


If MSFT is the sole holder of the copyright, then they can relicense 
it as they see fit. *I think* that they can only change the license on 
*newer* versions, so you'd be able to keep using the latest OSS version.


That is correct. If I get a version 1 of your program under license A 
you can't come back a year later and tell me the same code is now 
licensed under B and hinder the original freedoms of the license.


What can and does happen is that a new version is released under a 
different license while the old version is made obsolete. In real word 
that means you are stuck with the old version so you either need to 
upgrade to a newer version with different license or use something else. 
Real world cases are MongoDB and Redis modules license change.




Re: Timestamp with vs without time zone.

2021-09-21 Thread cen
From my experience, and some might disagree, I prefer to do db stores 
purely in UTC and handle timezones in ORM or client side.


The only time I actually needed to store timezone information in a 
dedicated column is when needing to convey that information to the end 
user, for example "your plane will arrive at this date and time in this 
destination timezone". The majority of other cases are just a 
localization issue and don't require you to store the timezone info.


Having to rely on database to muck around with timezones or doing it in 
session settings (which some advocate) is just asking for trouble in my 
opinion.



On 9/21/21 9:35 AM, Tim Uckun wrote:

It seems like it would be so much more useful if the timestamp with
time zone type actually stored the time zone in the record.


On Tue, Sep 21, 2021 at 7:25 PM Laurenz Albe  wrote:

On Tue, 2021-09-21 at 18:00 +1200, Tim Uckun wrote:

I am hoping to get some clarification on timestamp with time zone.

My understanding is that timestamp with time zone stores data in UTC
but displays it in your time zone.

That is correct.
When a timestamp is rendered as string, it it shown in the time zone
specified by the current setting of the "timezone" parameter in your
database session.


Does this also work on queries? If
I query between noon and 2:00 PM on some date in time zone XYZ does pg
translate the query to UTC before sending it to the server?

Yes.


To provide context I have the following situation.

I have a data file to import. All the dates in the time zone
pacific/auckland. My app reads the data , does some processing and
cleaning up and then saves it to the database.

The language I am using creates the time data type with the right time
zone. The processing is being done on a server which is on UTC, the
database server is also on UTC.  I am pretty sure the ORM isn't
appending "at time zone pacific/Auckland" to the data when it appends
it to the database.

So does the database know the timestamp is in auckland time when the
client is connecting from a server on UTC?

It does, but only if you set "timezone" appropriately in the database
session.  You could use ALTER ROLE to change the default setting for a
database user, but it might be best to set that from the application.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com








Re: Timestamp with vs without time zone.

2021-09-21 Thread cen


On 21. 09. 21 23:34, Michael Lewis wrote:
Related to this current discussion and exchange of ideas... is there a 
best practice for retrieving data in such a way as the rows are 
localized to a timezone for where/group by purposes. That is, if I 
have a table which has events, but those events belong to a tenant or 
some entity that has a location which implies a timezone (or at least 
an offset), is there a best way to write a query similar to the below? 
Please forgive and overlook if there is some obvious syntax error, as 
this is just a quick and dirty example. Might it make sense to store a 
"localized" version of the timestamp *without* timezone on the event 
record such that an index can be used for fast retrieval and even 
grouping?


select
date_trunc( 'month', e.event_datetime AT TIMEZONE t.time_zone_name ) 
AS event_date,

count( e.id  )
from events AS e
join tenants AS t ON t.id  = e.tenant_id
where e.event_datetime AT TIMEZONE t.time_zone_name >= 
'01/01/2021'::DATE AND e.event_datetime AT TIMEZONE t.time_zone_name < 
'09/01/2021'::DATE;



This is an interesting case. A simplified query example would be to 
"give me all events for this year".


I am not sure what the cost of shifting UTC is, probably not much, but 
depending on use case it might make sense to deconstruct into date and 
time for query optimization.