On Mon, Dec 18, 2023 at 3:36 PM Daniel Verite <dan...@manitou-mail.org> wrote: > PFA a patch that attempts to fix the bug that \. on a line > by itself is handled incorrectly by COPY FROM ... CSV. > This issue has been discussed several times previously, > for instance in [1] and [2], and mentioned in the > doc for \copy in commit 42d3125.
Those links unfortunately seem not to be entirely specific to this issue. Other, related things seem to be discussed there, and it's not obvious that everyone agrees on what to do, or really that anyone agrees on what to do. The best link that I found for this exact issue is https://www.postgresql.org/message-id/e1tdnvq-0001ju...@wrigleys.postgresql.org but the thread isn't very conclusive and is so old that any conclusions reached then might no longer be considered valid today. And I guess the reason I mention is that, supposing your patch were technically perfect in every way, would everyone agree that it ought to be committed? If Alice is a user with a CSV file that might contain \. on a line by itself within a quoted CSV field, then Alice is currently sad because she can't necessarily load all of her CSV files. The patch would fix that, and make her happy. On the other hand, if Bob is a user with a CSV-ish file that definitely doesn't contain \. on a line by itself within a quoted CSV field but might have been truncated in the middle of a quoted field, maybe Bob will be sad if this patch gets committed, because he will no longer be able to append \n\.\n to whatever junk he's got in the file and let the server sort out whether to throw an error. I have to admit that it seems more likely that there are people in the world with Alice's problem rather than people in the world with Bob's problem. We'd probably make more people happy with the change than sad. But it is a definitional change, I think, and that's a little scary, and maybe somebody will think that's a reason why we should change nothing here. Part of my hesitancy, I suppose, is that I don't understand why we even have this strange convention of making \. terminate the input in the first place -- I mean, why wouldn't that be done in some kind of out-of-band way, rather than including a special marker in the data? I can't help feeling a bit nervous about this first documentation hunk: --- a/doc/src/sgml/ref/copy.sgml +++ b/doc/src/sgml/ref/copy.sgml @@ -761,11 +761,7 @@ COPY <replaceable class="parameter">count</replaceable> format, <literal>\.</literal>, the end-of-data marker, could also appear as a data value. To avoid any misinterpretation, a <literal>\.</literal> data value appearing as a lone entry on a line is automatically - quoted on output, and on input, if quoted, is not interpreted as the - end-of-data marker. If you are loading a file created by another - application that has a single unquoted column and might have a - value of <literal>\.</literal>, you might need to quote that value in the - input file. + quoted on output. </para> <note> It doesn't feel right to me to just replace all of this text with nothing. That leaves us documenting only the behavior on output, whereas the previous text documents both the output behavior (we quote it) and the input behavior (it has to be quoted to avoid being taken as the EOF marker). /* - * In CSV mode, we only recognize \. alone on a line. This is because - * \. is a valid CSV data value. + * In CSV mode, backslash is a normal character. */ - if (c == '\\' && (!cstate->opts.csv_mode || first_char_in_line)) + if (c == '\\' && !cstate->opts.csv_mode) I don't think that the update comment accurately describes the behavior. If I understand correctly what you're trying to fix, I'd expect the new behavior to be that we only recognize \. alone on a line and even then only if we're not inside a quoting string, but that's not what the revised comment says. Instead, it claims that backslash is just a normal character, but if that were true, the whole if-statement wouldn't exist at all, since its purpose is to provide special-handling for backslashes -- and indeed the patch does not change that, since the if-statement is still looking for a backslash and doing something special if one is found. Hmm. Looking at the rest of the patch, it seems like you're removing the logic that prevents us from interpreting \. lksdghksdhgjskdghjs as an end-of-file while in CSV mode. But I would have thought based on what problem you're trying to fix that you would have wanted to keep that logic and further restrict it so that it only applies when not within a quoted string. Maybe I'm misunderstanding what bug you're trying to fix? -- Robert Haas EDB: http://www.enterprisedb.com