Re: Should CSV parsing be stricter about mid-field quotes?

Greg Stark Fri, 12 May 2023 11:59:19 -0700

On Thu, 11 May 2023 at 10:04, Joel Jacobson <j...@compiler.org> wrote:
>
> The parser currently accepts quoting within an unquoted field. This can lead 
> to
> data misinterpretation when the quote is part of the field data (e.g.,
> for inches, like in the example).


I think you're thinking about it differently than the parser. I think
the parser is treating this the way, say, the shell treats quotes.
That is, it sees a quoted "I bought this for my 6" followed by an
unquoted "a laptop but it didn't fit my 8" followed by a quoted "
tablet".

So for example, in that world you might only quote commas and newlines
so you might print something like

1,2,I bought this for my "6"" laptop
" but it "didn't" fit my "8""" laptop

The actual CSV spec https://datatracker.ietf.org/doc/html/rfc4180 only
allows fully quoted or fully unquoted fields and there can only be
escaped double-doublequote characters in quoted fields and no
doublequote characters in unquoted fields.

But it also says

      Due to lack of a single specification, there are considerable
      differences among implementations.  Implementors should "be
      conservative in what you do, be liberal in what you accept from
      others" (RFC 793 [8]) when processing CSV files.  An attempt at a
      common definition can be found in Section 2.


So the real question is are there tools out there that generate
entries like this and what are their intentions?

> I think we should throw a parsing error for unescaped mid-field quotes,
> and add a COPY option like ALLOW_MIDFIELD_QUOTES for cases where mid-field
> quotes are necessary. The error message could suggest this option when it
> encounters an unescaped mid-field quote.
>
> I think the convenience of not having to use an extra option doesn't outweigh
> the risk of undetected data integrity issues.

It's also a pretty annoying experience to get a message saying "error,
turn this option on to not get an error". I get what you're saying
too, which is more of a risk depends on whether turning off the error
is really the right thing most of the time or is just causing data to
be read incorrectly.



-- 
greg

Re: Should CSV parsing be stricter about mid-field quotes?

Reply via email to