On Thu, 11 May 2023 at 10:04, Joel Jacobson <j...@compiler.org> wrote: > > The parser currently accepts quoting within an unquoted field. This can lead > to > data misinterpretation when the quote is part of the field data (e.g., > for inches, like in the example).
I think you're thinking about it differently than the parser. I think the parser is treating this the way, say, the shell treats quotes. That is, it sees a quoted "I bought this for my 6" followed by an unquoted "a laptop but it didn't fit my 8" followed by a quoted " tablet". So for example, in that world you might only quote commas and newlines so you might print something like 1,2,I bought this for my "6"" laptop " but it "didn't" fit my "8""" laptop The actual CSV spec https://datatracker.ietf.org/doc/html/rfc4180 only allows fully quoted or fully unquoted fields and there can only be escaped double-doublequote characters in quoted fields and no doublequote characters in unquoted fields. But it also says Due to lack of a single specification, there are considerable differences among implementations. Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing CSV files. An attempt at a common definition can be found in Section 2. So the real question is are there tools out there that generate entries like this and what are their intentions? > I think we should throw a parsing error for unescaped mid-field quotes, > and add a COPY option like ALLOW_MIDFIELD_QUOTES for cases where mid-field > quotes are necessary. The error message could suggest this option when it > encounters an unescaped mid-field quote. > > I think the convenience of not having to use an extra option doesn't outweigh > the risk of undetected data integrity issues. It's also a pretty annoying experience to get a message saying "error, turn this option on to not get an error". I get what you're saying too, which is more of a risk depends on whether turning off the error is really the right thing most of the time or is just causing data to be read incorrectly. -- greg