Hi Adrian,

   You are right, these fields are in CHECK CONSTRAiNTS and they are not
formally defined as Primary Keys.

Thanks!
Leandro Guimarães



On Sat, Jun 15, 2019 at 10:45 AM Adrian Klaver <adrian.kla...@aklaver.com>
wrote:

> On 6/14/19 7:24 PM, Leandro Guimarães wrote:
> > Hi Tim, thanks for you answer!
> >
> > The columns were just examples, but let me explain the database
> > structure, the fields in *bold are the keys*:
> >
> > *customer_id integer*
> > *date_time timestamp*
> > *indicator_id integer*
> > *element_id integer*
> > indicator_value double precision
>
> Huh, earlier you said you had a check constraint that was causing issues.
>
> Does that also exist or where you referring to the keys above?
>
> Are the keys above formally defined as the PRIMARY KEY?
> -
> >
> > The table is partitioned per day and customer_id (it works great) the
> > problem is just the duplicated key situation that I'm really worried
> about.
> >
> > I populate the database via a Java Application with JDBC.
> >
> > Maybe this info could help to provide some light!
> >
> > Thanks Again!
> >
> > Leandro Guimarães
> >
> >
> >
> > On Fri, Jun 14, 2019 at 7:39 PM Tim Cross <theophil...@gmail.com
> > <mailto:theophil...@gmail.com>> wrote:
> >
> >
> >     Leandro Guimarães <leo.guimar...@gmail.com
> >     <mailto:leo.guimar...@gmail.com>> writes:
> >
> >      > Hi,
> >      >
> >      >    I have a scenario with a large table and I'm trying to insert
> >     it via a
> >      > COPY command with a csv file.
> >      >
> >      >    Everything works, but sometimes my source .csv file has
> >     duplicated data
> >      > in the previously fulfilled table. If I add a check constraint
> >     and try to
> >      > run the COPY command I have an error that stops the whole
> insertion.
> >      >
> >      >   I've tried to put the data in a tmp table and fill the main
> using
> >      > distinct this way (the fields and names are just examples):
> >      >
> >      > INSERT INTO final_table values (name, document)
> >      >    SELECT DISTINCT name, document
> >      >    FROM tmp_TABLE t1
> >      >    WHERE NOT EXISTS (
> >      >    SELECT 1 FROM final_table t2
> >      >    WHERE (t2.name <http://t2.name>, t2.document)
> >      >    IS NOT DISTINCT FROM (t1.name <http://t1.name>, t1.document))
> >      >
> >      > The problem is that my final_table is a large (and partitioned)
> >     table and
> >      > this query is taking a long time to execute.
> >      >
> >      > Someone have any idea (really guys anything would be great) how
> >     to solve
> >      > this situation? I need to ignore duplicates instead to have some
> >     error.
> >      >
> >      > I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and
> >     upgrade is not
> >      > an option.
> >      >
> >
> >     Explain plan would probably shed some light, but I suspect your
> >     performance is being heavily hit by the sub query. Distinct is an
> >     expensive operation and you are performing it once for every
> >     distinct row
> >     in your temp table.
> >
> >     It isn't clear what the primary key is for your final table - name +
> >     document seems suspicious given these seem to be the only two columns
> >     your inserting as well. You don't indicate what the data types are
> >     either - it document is something like 'text' then using it in a
> >     distinct clause is likely to have huge performance impact.
> >
> >     The first thing I'd do is to eliminate duplicates from your temp
> table
> >     as a separate statement or by pre-filtering the CSV before import. I
> >     would then try something like an outer join to identify rows in your
> >     temp table which don't exist in your final table and select from
> there
> >     to insert into the final table. You don't really need the distinct in
> >     the sub query as all you really need to know is if (name, document)
> >     exists - it doesn't matter if more than one exists (for this test).
> >
> >     If you really don't have something more specific for a primary key,
> >     depending on what data type 'document' is and how large it is, you
> may
> >     find adding a column which is a checksum of your 'document' field a
> >     useful addition. I have done this in the past where I had an
> application
> >     where name was not unique and we only wanted distinct instances of
> >     'document' (document was a fairly large XML document in this case).
> >
> >     --
> >     Tim Cross
> >
> >
>
>
> --
> Adrian Klaver
> adrian.kla...@aklaver.com
>

Reply via email to