2017-06-30 14:23 GMT+02:00 Alex K <kondratov.alek...@gmail.com>: > Greetings pgsql-hackers, > > I am a GSOC student this year, my initial proposal has been discussed > in the following thread > https://www.postgresql.org/message-id/flat/7179F2FD-49CE- > 4093-AE14-1B26C5DFB0DA%40gmail.com > > Patch with COPY FROM errors handling seems to be quite finished, so > I have started thinking about parallelism in COPY FROM, which is the next > point in my proposal. > > In order to understand are there any expensive calls in COPY, which > can be executed in parallel, I did a small research. First, please, find > flame graph of the most expensive copy.c calls during the 'COPY FROM file' > attached (copy_from.svg). It reveals, that inevitably serial operations > like > CopyReadLine (<15%), heap_multi_insert (~15%) take less than 50% of > time in summary, while remaining operations like heap_form_tuple and > multiple checks inside NextCopyFrom probably can be executed well in > parallel. > > Second, I have compared an execution time of 'COPY FROM a single large > file (~300 MB, 50000000 lines)' vs. 'COPY FROM four equal parts of the > original file executed in the four parallel processes'. Though it is a > very rough test, it helps to obtain an overall estimation: > > Serial: > real 0m56.571s > user 0m0.005s > sys 0m0.006s > > Parallel (x4): > real 0m22.542s > user 0m0.015s > sys 0m0.018s > > Thus, it results in a ~60% performance boost per each x2 multiplication of > parallel processes, which is consistent with the initial estimation. > > the important use case is big table with lot of indexes. Did you test similar case?
Regards Pavel