Hi 2017-03-23 12:33 GMT+01:00 Alexey Kondratov <kondratov.alek...@gmail.com>:
> Hi pgsql-hackers, > > I'm planning to apply to GSOC'17 and my proposal consists currently of two > parts: > > (1) Add errors handling to COPY as a minimum program > > Motivation: Using PG on the daily basis for years I found that there are > some cases when you need to load (e.g. for a further analytics) a bunch of > not well consistent records with rare type/column mismatches. Since PG > throws exception on the first error, currently the only one solution is to > preformat your data with any other tool and then load to PG. However, > frequently it is easier to drop certain records instead of doing such > preprocessing for every data source you have. > > I have done a small research and found the item in PG's TODO > https://wiki.postgresql.org/wiki/Todo#COPY, previous attempt to push > similar patch https://www.postgresql.org/message-id/flat/ > 603c8f070909141218i291bc983t501507ebc996a531%40mail.gmail.com# > 603c8f070909141218i291bc983t501507ebc996a...@mail.gmail.com. There were > no negative responses against this patch and it seams that it was just > forgoten and have not been finalized. > > As an example of a general idea I can provide *read_csv* method of python > package – *pandas* (http://pandas.pydata.org/pandas-docs/stable/generated/ > pandas.read_csv.html). It uses C parser which throws error on first > columns mismatch. However, it has two flags *error_bad_lines* and > *warn_bad_lines*, which being set to False helps to drop bad lines or > even hide warn messages about them. > > > (2) Parallel COPY execution as a maximum program > > I guess that there is nothing necessary to say about motivation, it just > should be faster on multicore CPUs. > > There is also an record about parallel COPY in PG's wiki > https://wiki.postgresql.org/wiki/Parallel_Query_Execution. There are some > side extensions, e.g. https://github.com/ossc-db/pg_bulkload, but it > always better to have well-performing core functionality out of the box. > > > My main concerns here are: > > 1) Is there anyone out of PG comunity who will be interested in such > project and can be a menthor? > 2) These two points have a general idea – to simplify work with a large > amount of data from a different sources, but mybe it would be better to > focus on the single task? > I spent lot of time on implementation @1 - maybe I found somewhere a patch. Both tasks has some common - you have to divide import to more batches. > 3) Is it realistic to mostly finish both parts during the 3+ months of > almost full-time work or I am too presumptuous? > It is possible, I am thinking - I am not sure about all possible details, but basic implementation can be done in 3 months. > > I will be very appreciate to any comments and criticism. > > > P.S. I know about very interesting ready projects from the PG's comunity > https://wiki.postgresql.org/wiki/GSoC_2017, but it always more > interesting to solve your own problems, issues and questions, which are the > product of you experience with software. That's why I dare to propose my > own project. > > P.P.S. A few words about me: I'm a PhD stident in Theoretical physics from > Moscow, Russia, and highly involved in software development since 2010. I > guess that I have good skills in Python, Ruby, JavaScript, MATLAB, C, > Fortran development and basic understanding of algorithms design and > analysis. > > > Best regards, > > Alexey >