On Sat, Feb 15, 2020 at 4:08 PM Alastair Turner <min...@decodable.me> wrote: > > On Sat, 15 Feb 2020 at 04:55, Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > On Fri, Feb 14, 2020 at 7:16 PM Alastair Turner <min...@decodable.me> wrote: > > > > ... > > > > > > Parsing rows from the raw input (the work done by CopyReadLine()) in a > > > single process would accommodate line returns in quoted fields. I don't > > > think there's a way of getting parallel workers to manage the > > > in-quote/out-of-quote state required. > > > > > > > AFAIU, the whole of this in-quote/out-of-quote state is manged inside > > CopyReadLineText which will be done by each of the parallel workers, > > something on the lines of what Thomas did in his patch [1]. > > Basically, we need to invent a mechanism to allocate chunks to > > individual workers and then the whole processing will be done as we > > are doing now except for special handling for partial tuples which I > > have explained in my previous email. Am, I missing something here? > > > The problem case that I see is the chunk boundary falling in the > middle of a quoted field where > - The quote opens in chunk 1 > - The quote closes in chunk 2 > - There is an EoL character between the start of chunk 2 and the closing > quote > > When the worker processing chunk 2 starts, it believes itself to be in > out-of-quote state, so only data between the start of the chunk and > the EoL is regarded as belonging to the partial line. From that point > on the parsing of the rest of the chunk goes off track. > > Some of the resulting errors can be avoided by, for instance, > requiring a quote to be preceded by a delimiter or EoL. That answer > fails when fields end with EoL characters, which happens often enough > in the wild. > > Recovering from an incorrect in-quote/out-of-quote state assumption at > the start of parsing a chunk just seems like a hole with no bottom. So > it looks to me like it's best done in a single process which can keep > track of that state reliably. >
Good point and I agree with you that having a single process would avoid any such stuff. However, I will think some more on it and if you/anyone else gets some idea on how to deal with this in a multi-worker system (where we can allow each worker to read and process the chunk) then feel free to share your thoughts. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com