Hi Tim, thanks for you answer! The columns were just examples, but let me explain the database structure, the fields in *bold are the keys*:
*customer_id integer* *date_time timestamp* *indicator_id integer* *element_id integer* indicator_value double precision The table is partitioned per day and customer_id (it works great) the problem is just the duplicated key situation that I'm really worried about. I populate the database via a Java Application with JDBC. Maybe this info could help to provide some light! Thanks Again! Leandro Guimarães On Fri, Jun 14, 2019 at 7:39 PM Tim Cross <theophil...@gmail.com> wrote: > > Leandro Guimarães <leo.guimar...@gmail.com> writes: > > > Hi, > > > > I have a scenario with a large table and I'm trying to insert it via a > > COPY command with a csv file. > > > > Everything works, but sometimes my source .csv file has duplicated > data > > in the previously fulfilled table. If I add a check constraint and try to > > run the COPY command I have an error that stops the whole insertion. > > > > I've tried to put the data in a tmp table and fill the main using > > distinct this way (the fields and names are just examples): > > > > INSERT INTO final_table values (name, document) > > SELECT DISTINCT name, document > > FROM tmp_TABLE t1 > > WHERE NOT EXISTS ( > > SELECT 1 FROM final_table t2 > > WHERE (t2.name, t2.document) > > IS NOT DISTINCT FROM (t1.name, t1.document)) > > > > The problem is that my final_table is a large (and partitioned) table and > > this query is taking a long time to execute. > > > > Someone have any idea (really guys anything would be great) how to solve > > this situation? I need to ignore duplicates instead to have some error. > > > > I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and upgrade is > not > > an option. > > > > Explain plan would probably shed some light, but I suspect your > performance is being heavily hit by the sub query. Distinct is an > expensive operation and you are performing it once for every distinct row > in your temp table. > > It isn't clear what the primary key is for your final table - name + > document seems suspicious given these seem to be the only two columns > your inserting as well. You don't indicate what the data types are > either - it document is something like 'text' then using it in a > distinct clause is likely to have huge performance impact. > > The first thing I'd do is to eliminate duplicates from your temp table > as a separate statement or by pre-filtering the CSV before import. I > would then try something like an outer join to identify rows in your > temp table which don't exist in your final table and select from there > to insert into the final table. You don't really need the distinct in > the sub query as all you really need to know is if (name, document) > exists - it doesn't matter if more than one exists (for this test). > > If you really don't have something more specific for a primary key, > depending on what data type 'document' is and how large it is, you may > find adding a column which is a checksum of your 'document' field a > useful addition. I have done this in the past where I had an application > where name was not unique and we only wanted distinct instances of > 'document' (document was a fairly large XML document in this case). > > -- > Tim Cross > > >