Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

Pavel Stehule Thu, 23 Mar 2017 05:41:13 -0700

Hi

2017-03-23 12:33 GMT+01:00 Alexey Kondratov <[email protected]>:


> Hi pgsql-hackers,
>
> I'm planning to apply to GSOC'17 and my proposal consists currently of two
> parts:
>
> (1) Add errors handling to COPY as a minimum program
>
> Motivation: Using PG on the daily basis for years I found that there are
> some cases when you need to load (e.g. for a further analytics) a bunch of
> not well consistent records with rare type/column mismatches. Since PG
> throws exception on the first error, currently the only one solution is to
> preformat your data with any other tool and then load to PG. However,
> frequently it is easier to drop certain records instead of doing such
> preprocessing for every data source you have.
>
> I have done a small research and found the item in PG's TODO
> https://wiki.postgresql.org/wiki/Todo#COPY, previous attempt to push
> similar patch https://www.postgresql.org/message-id/flat/
> 603c8f070909141218i291bc983t501507ebc996a531%40mail.gmail.com#
> [email protected]. There were
> no negative responses against this patch and it seams that it was just
> forgoten and have not been finalized.
>
> As an example of a general idea I can provide *read_csv* method of python
> package – *pandas* (http://pandas.pydata.org/pandas-docs/stable/generated/
> pandas.read_csv.html). It uses C parser which throws error on first
> columns mismatch. However, it has two flags *error_bad_lines* and
> *warn_bad_lines*, which being set to False helps to drop bad lines or
> even hide warn messages about them.
>
>
> (2) Parallel COPY execution as a maximum program
>
> I guess that there is nothing necessary to say about motivation, it just
> should be faster on multicore CPUs.
>
> There is also an record about parallel COPY in PG's wiki
> https://wiki.postgresql.org/wiki/Parallel_Query_Execution. There are some
> side extensions, e.g. https://github.com/ossc-db/pg_bulkload, but it
> always better to have well-performing core functionality out of the box.
>
>
> My main concerns here are:
>
> 1) Is there anyone out of PG comunity who will be interested in such
> project and can be a menthor?
> 2) These two points have a general idea – to simplify work with a large
> amount of data from a different sources, but mybe it would be better to
> focus on the single task?
>

I spent lot of time on implementation @1 - maybe I found somewhere a patch.
Both tasks has some common - you have to divide import to more batches.



> 3) Is it realistic to mostly finish both parts during the 3+ months of
> almost full-time work or I am too presumptuous?
>

It is possible, I am thinking - I am not sure about all possible details,
but basic implementation can be done in 3 months.


>
> I will be very appreciate to any comments and criticism.
>
>
> P.S. I know about very interesting ready projects from the PG's comunity
> https://wiki.postgresql.org/wiki/GSoC_2017, but it always more
> interesting to solve your own problems, issues and questions, which are the
> product of you experience with software. That's why I dare to propose my
> own project.
>
> P.P.S. A few words about me: I'm a PhD stident in Theoretical physics from
> Moscow, Russia, and highly involved in software development since 2010. I
> guess that I have good skills in Python, Ruby, JavaScript, MATLAB, C,
> Fortran development and basic understanding of algorithms design and
> analysis.
>
>
> Best regards,
>
> Alexey
>

Re: [HACKERS] GSOC'17 project introduction: Parallel COPY execution with errors handling

Reply via email to