On Wed, Feb 26, 2020 at 8:47 PM Ants Aasma <a...@cybertec.at> wrote: > > On Tue, 25 Feb 2020 at 18:00, Tomas Vondra <tomas.von...@2ndquadrant.com> > wrote: > > Perhaps. I guess it'll depend on the CSV file (number of fields, ...), > > so I still think we need to do some measurements first. I'm willing to > > do that, but (a) I doubt I'll have time for that until after 2020-03, > > and (b) it'd be good to agree on some set of typical CSV files. > > I agree that getting a nice varied dataset would be nice. Including > things like narrow integer only tables, strings with newlines and > escapes in them, extremely wide rows. > > I tried to capture a quick profile just to see what it looks like. > Grabbed a random open data set from the web, about 800MB of narrow > rows CSV [1]. > > Script: > CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count > text); > COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true); > > Profile: > # Samples: 59K of event 'cycles:u' > # Event count (approx.): 57644269486 > # > # Overhead Command Shared Object Symbol > # ........ ........ .................. > ....................................... > # > 18.24% postgres postgres [.] CopyReadLine > 9.23% postgres postgres [.] NextCopyFrom > 8.87% postgres postgres [.] NextCopyFromRawFields > 5.82% postgres postgres [.] pg_verify_mbstr_len > 5.45% postgres postgres [.] pg_strtoint32 > 4.16% postgres postgres [.] heap_fill_tuple > 4.03% postgres postgres [.] heap_compute_data_size > 3.83% postgres postgres [.] CopyFrom > 3.78% postgres postgres [.] AllocSetAlloc > 3.53% postgres postgres [.] heap_form_tuple > 2.96% postgres postgres [.] InputFunctionCall > 2.89% postgres libc-2.30.so [.] __memmove_avx_unaligned_erms > 1.82% postgres libc-2.30.so [.] __strlen_avx2 > 1.72% postgres postgres [.] AllocSetReset > 1.72% postgres postgres [.] RelationPutHeapTuple > 1.47% postgres postgres [.] heap_prepare_insert > 1.31% postgres postgres [.] heap_multi_insert > 1.25% postgres postgres [.] textin > 1.24% postgres postgres [.] int4in > 1.05% postgres postgres [.] tts_buffer_heap_clear > 0.85% postgres postgres [.] pg_any_to_server > 0.80% postgres postgres [.] pg_comp_crc32c_sse42 > 0.77% postgres postgres [.] cstring_to_text_with_len > 0.69% postgres postgres [.] AllocSetFree > 0.60% postgres postgres [.] appendBinaryStringInfo > 0.55% postgres postgres [.] > tts_buffer_heap_materialize.part.0 > 0.54% postgres postgres [.] palloc > 0.54% postgres libc-2.30.so [.] __memmove_avx_unaligned > 0.51% postgres postgres [.] palloc0 > 0.51% postgres postgres [.] pg_encoding_max_length > 0.48% postgres postgres [.] enlargeStringInfo > 0.47% postgres postgres [.] ExecStoreVirtualTuple > 0.45% postgres postgres [.] PageAddItemExtended > > So that confirms that the parsing is a huge chunk of overhead with > current splitting into lines being the largest portion. Amdahl's law > says that splitting into tuples needs to be made fast before > parallelizing makes any sense. >
I have ran very simple case on table with 2 indexes and I can see a lot of time is spent in index insertion. I agree that there is a good amount of time spent in tokanizing but it is not very huge compared to index insertion. I have expanded the time spent in the CopyFrom function from my perf report and pasted here. We can see that a lot of time is spent in ExecInsertIndexTuples(77%). I agree that we need to further evaluate that out of which how much is I/O vs CPU operations. But, the point I want to make is that it's not true for all the cases that parsing is taking maximum amout of time. - 99.50% CopyFrom - 82.90% CopyMultiInsertInfoFlush - 82.85% CopyMultiInsertBufferFlush + 77.68% ExecInsertIndexTuples + 3.74% table_multi_insert + 0.89% ExecClearTuple - 12.54% NextCopyFrom - 7.70% NextCopyFromRawFields - 5.72% CopyReadLine 3.96% CopyReadLineText + 1.49% pg_any_to_server 1.86% CopyReadAttributesCSV + 3.68% InputFunctionCall + 2.11% ExecMaterializeSlot + 0.94% MemoryContextReset My test: -- Prepare: CREATE TABLE t (a int, b int, c varchar); insert into t select i,i, 'aaaaaaaaaaaaaaaaaaaaaaaa' from generate_series(1,10000000) as i; copy t to '/home/dilipkumar/a.csv' WITH (FORMAT 'csv', HEADER true); truncate table t; create index idx on t(a); create index idx1 on t(c); -- Test CopyFrom and measure with perf: copy t from '/home/dilipkumar/a.csv' WITH (FORMAT 'csv', HEADER true); -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com