On Tue, Feb 18, 2020 at 7:28 AM Kyotaro Horiguchi <horikyota....@gmail.com> wrote: > > At Mon, 17 Feb 2020 16:49:22 +0530, Amit Kapila <amit.kapil...@gmail.com> > wrote in > > On Sun, Feb 16, 2020 at 12:21 PM Andrew Dunstan > > <andrew.duns...@2ndquadrant.com> wrote: > > > On 2/15/20 7:32 AM, Amit Kapila wrote: > > > > On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <min...@decodable.me> > > > > wrot> > So why not just forbid parallel copy in CSV > > > mode, at least for now? I guess it depends on the actual use case. If we > > > expect to be parallel loading humungous CSVs then that won't fly. > > > > > > > I am not sure about this part. However, I guess we should at the very > > least have some extendable solution that can deal with csv, otherwise, > > we might end up re-designing everything if someday we want to deal > > with CSV. One naive idea is that in csv mode, we can set up the > > things slightly differently like the worker, won't start processing > > the chunk unless the previous chunk is completely parsed. So each > > worker would first parse and tokenize the entire chunk and then start > > writing it. So, this will make the reading/parsing part serialized, > > but writes can still be parallel. Now, I don't know if it is a good > > idea to process in a different way for csv mode. > > In an extreme case, if we didn't see a QUOTE in a chunk, we cannot > know the chunk is in a quoted section or not, until all the past > chunks are parsed. After all we are forced to parse fully > sequentially as far as we allow QUOTE. >
Right, I think the benefits of this as compared to single reader idea would be (a) we can save accessing shared memory for the most part of the chunk (b) for non-csv mode, even the tokenization (finding line boundaries) would also be parallel. OTOH, doing processing differently for csv and non-csv mode might not be good. > On the other hand, if we allowed "COPY t FROM f WITH (FORMAT CSV, > QUOTE '')" in order to signal that there's no quoted section in the > file then all chunks would be fully concurrently parsable. > Yeah, if we can provide such an option, we can probably make parallel csv processing equivalent to non-csv. However, users might not like this as I think in some cases it won't be easier for them to tell whether the file has quoted fields or not. I am not very sure of this point. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com