Hi all

I need ideas regarding a way to design an insert, truncate/delete, and reinsert 
process on data that is in UTC and database range partitions that are in the 
Europe/Brussels time zone. Therefore, any input is welcome. 

The issue at hand is this. We receive daily file deliveries of time-based 
measurements. The files are zipped csv's. A file covers a UTC-day (e.g. 
2021-01-15 00:00:00+00 to 2021-01-15 23:59:59+00). About 100 million rows a 
day. 

Loaded by Python into a Postgres 12 database table, range-partitioned on 
Europe/Brussels days. Database session in UTC, since timestamps in data has no 
time zone indicator. Most rows hit the same partition, but the last hour (this 
time of year, in the summer it is 2 hours) hits the next partition. Streamed 
directly from zip by psycopg2.cursor.copy_expert. Works perfect. 

Then analysis on the data can proceed. Aggregating stuff on daily or monthly 
basis. This works fine since data is partitioned on our local time zone and we 
want aggregates on the same time zone. 

Then, two weeks later a better delivery comes in covering the same UTC-period. 
Data quality is better, so we want to replace the old data with the new data. 
If data were partitioned on the same time zone boundary as the file covers, 
then a truncate partition, insert new data would be the obvious solution. That 
is no good, since the truncate will wipe an hour too much on one end and an 
hour too little on the other end. 

So what is a better solution? Delete from t where t.timecolumn between a and b, 
and reinsert the new data. My instinct says no, but I cannot really think of a 
good alternative. It is 100 million rows with 5 or 6 numeric columns.

Regards Niels Jespersen





Reply via email to