[HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

Alexey Kondratov Thu, 23 Mar 2017 04:35:16 -0700

Hi pgsql-hackers,

I'm planning to apply to GSOC'17 and my proposal consists currently of two 
parts:


(1) Add errors handling to COPY as a minimum program

Motivation: Using PG on the daily basis for years I found that there are some 
cases when you need to load (e.g. for a further analytics) a bunch of not well 
consistent records with rare type/column mismatches. Since PG throws exception 
on the first error, currently the only one solution is to preformat your data 
with any other tool and then load to PG. However, frequently it is easier to 
drop certain records instead of doing such preprocessing for every data source 
you have.

I have done a small research and found the item in PG's TODO 
https://wiki.postgresql.org/wiki/Todo#COPY, previous attempt to push similar 
patch 
https://www.postgresql.org/message-id/flat/603c8f070909141218i291bc983t501507ebc996a531%40mail.gmail.com#[email protected].
 There were no negative responses against this patch and it seams that it was 
just forgoten and have not been finalized.

As an example of a general idea I can provide read_csv method of python package 
– pandas 
(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html). 
It uses C parser which throws error on first columns mismatch. However, it has 
two flags error_bad_lines and warn_bad_lines, which being set to False helps to 
drop bad lines or even hide warn messages about them.


(2) Parallel COPY execution as a maximum program

I guess that there is nothing necessary to say about motivation, it just should 
be faster on multicore CPUs.

There is also an record about parallel COPY in PG's wiki 
https://wiki.postgresql.org/wiki/Parallel_Query_Execution. There are some side 
extensions, e.g. https://github.com/ossc-db/pg_bulkload, but it always better 
to have well-performing core functionality out of the box.


My main concerns here are:

1) Is there anyone out of PG comunity who will be interested in such project 
and can be a menthor?
2) These two points have a general idea – to simplify work with a large amount 
of data from a different sources, but mybe it would be better to focus on the 
single task?
3) Is it realistic to mostly finish both parts during the 3+ months of almost 
full-time work or I am too presumptuous?

I will be very appreciate to any comments and criticism.


P.S. I know about very interesting ready projects from the PG's comunity 
https://wiki.postgresql.org/wiki/GSoC_2017, but it always more interesting to 
solve your own problems, issues and questions, which are the product of you 
experience with software. That's why I dare to propose my own project.

P.P.S. A few words about me: I'm a PhD stident in Theoretical physics from 
Moscow, Russia, and highly involved in software development since 2010. I guess 
that I have good skills in Python, Ruby, JavaScript, MATLAB, C, Fortran 
development and basic understanding of algorithms design and analysis.


Best regards,

Alexey

[HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

Reply via email to