Re: Parallel copy

Alastair Turner Wed, 26 Feb 2020 07:16:52 -0800

On Wed, 26 Feb 2020 at 10:54, Amit Kapila <amit.kapil...@gmail.com> wrote:
>
> On Tue, Feb 25, 2020 at 9:30 PM Tomas Vondra
> <tomas.von...@2ndquadrant.com> wrote:
> >
...
> >
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first.
> >
>
> Agreed.
>
> > I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
> >
>
> Right, I don't know what is the best way to define that.  I can think
> of the below tests.
>
> 1. A table with 10 columns (with datatypes as integers, date, text).
> It has one index (unique/primary). Load with 1 million rows (basically
> the data should be probably 5-10 GB).
> 2. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). Load with 1
> million rows (basically the data should be probably 5-10 GB).
> 3. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). It has before
> and after trigeers. Load with 1 million rows (basically the data
> should be probably 5-10 GB).
> 4. A table with 10 columns (with datatypes as integers, date, text).
> It has five or six indexes, one index can be (unique/primary). Load
> with 1 million rows (basically the data should be probably 5-10 GB).
>
> Among all these tests, we can check how much time did we spend in
> reading, parsing the csv files vs. rest of execution?


That's a good set of tests of what happens after the parse. Two
simpler test runs may provide useful baselines - no
constraints/indexes with all columns varchar and no
constraints/indexes with columns correctly typed.

For testing the impact of various parts of the parse process, my idea would be:
 - A base dataset with 10 columns including int, date and text. One
text field quoted and containing both delimiters and line terminators
 - A derivative to measure just line parsing - strip the quotes around
the text field and quote the whole row as one text field
 - A derivative to measure the impact of quoted fields - clean up the
text field so it doesn't require quoting
 - A derivative to measure the impact of row length - run ten rows
together to make 100 column rows, but only a tenth as many rows

If that sounds reasonable, I'll try to knock up a generator.

--
Alastair

Re: Parallel copy

Reply via email to