Re: Parallel copy

Amit Kapila Tue, 18 Feb 2020 02:30:13 -0800

On Tue, Feb 18, 2020 at 7:28 AM Kyotaro Horiguchi
<horikyota....@gmail.com> wrote:
>
> At Mon, 17 Feb 2020 16:49:22 +0530, Amit Kapila <amit.kapil...@gmail.com> 
> wrote in
> > On Sun, Feb 16, 2020 at 12:21 PM Andrew Dunstan
> > <andrew.duns...@2ndquadrant.com> wrote:
> > > On 2/15/20 7:32 AM, Amit Kapila wrote:
> > > > On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <min...@decodable.me> 
> > > > wrot> > So why not just forbid parallel copy in CSV
> > > mode, at least for now? I guess it depends on the actual use case. If we
> > > expect to be parallel loading humungous CSVs then that won't fly.
> > >
> >
> > I am not sure about this part.  However, I guess we should at the very
> > least have some extendable solution that can deal with csv, otherwise,
> > we might end up re-designing everything if someday we want to deal
> > with CSV.  One naive idea is that in csv mode, we can set up the
> > things slightly differently like the worker, won't start processing
> > the chunk unless the previous chunk is completely parsed.  So each
> > worker would first parse and tokenize the entire chunk and then start
> > writing it.  So, this will make the reading/parsing part serialized,
> > but writes can still be parallel.  Now, I don't know if it is a good
> > idea to process in a different way for csv mode.
>
> In an extreme case, if we didn't see a QUOTE in a chunk, we cannot
> know the chunk is in a quoted section or not, until all the past
> chunks are parsed.  After all we are forced to parse fully
> sequentially as far as we allow QUOTE.
>


Right, I think the benefits of this as compared to single reader idea
would be (a) we can save accessing shared memory for the most part of
the chunk (b) for non-csv mode, even the tokenization (finding line
boundaries) would also be parallel.   OTOH, doing processing
differently for csv and non-csv mode might not be good.

> On the other hand, if we allowed "COPY t FROM f WITH (FORMAT CSV,
> QUOTE '')" in order to signal that there's no quoted section in the
> file then all chunks would be fully concurrently parsable.
>

Yeah, if we can provide such an option, we can probably make parallel
csv processing equivalent to non-csv.  However, users might not like
this as I think in some cases it won't be easier for them to tell
whether the file has quoted fields or not.  I am not very sure of this
point.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Parallel copy

Reply via email to