On Wed, 26 Feb 2020 at 10:54, Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Tue, Feb 25, 2020 at 9:30 PM Tomas Vondra > <tomas.von...@2ndquadrant.com> wrote: > > ... > > > > Perhaps. I guess it'll depend on the CSV file (number of fields, ...), > > so I still think we need to do some measurements first. > > > > Agreed. > > > I'm willing to > > do that, but (a) I doubt I'll have time for that until after 2020-03, > > and (b) it'd be good to agree on some set of typical CSV files. > > > > Right, I don't know what is the best way to define that. I can think > of the below tests. > > 1. A table with 10 columns (with datatypes as integers, date, text). > It has one index (unique/primary). Load with 1 million rows (basically > the data should be probably 5-10 GB). > 2. A table with 10 columns (with datatypes as integers, date, text). > It has three indexes, one index can be (unique/primary). Load with 1 > million rows (basically the data should be probably 5-10 GB). > 3. A table with 10 columns (with datatypes as integers, date, text). > It has three indexes, one index can be (unique/primary). It has before > and after trigeers. Load with 1 million rows (basically the data > should be probably 5-10 GB). > 4. A table with 10 columns (with datatypes as integers, date, text). > It has five or six indexes, one index can be (unique/primary). Load > with 1 million rows (basically the data should be probably 5-10 GB). > > Among all these tests, we can check how much time did we spend in > reading, parsing the csv files vs. rest of execution?
That's a good set of tests of what happens after the parse. Two simpler test runs may provide useful baselines - no constraints/indexes with all columns varchar and no constraints/indexes with columns correctly typed. For testing the impact of various parts of the parse process, my idea would be: - A base dataset with 10 columns including int, date and text. One text field quoted and containing both delimiters and line terminators - A derivative to measure just line parsing - strip the quotes around the text field and quote the whole row as one text field - A derivative to measure the impact of quoted fields - clean up the text field so it doesn't require quoting - A derivative to measure the impact of row length - run ten rows together to make 100 column rows, but only a tenth as many rows If that sounds reasonable, I'll try to knock up a generator. -- Alastair