[ https://issues.apache.org/jira/browse/FLINK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150860#comment-16150860 ]
Luke Hutchison commented on FLINK-6016: --------------------------------------- Yes, that's what I'm suggesting. The data doesn't have to be read twice, it can be emitted in the first pass, but the efficiency of doing so depends on the bandwidth between the single reading thread and the worker threads for each shard. A more scalable approach, though more complex, would be to build a state machine for each shard, recording the state at each input character, and then "run off the end" of each shard boundary until the state of the parser from the previous shard matches the state of the parser for the next shard at the same character position. The "overrun" parser state overwrites the next shard parser state until the states match. Then the state marker for unquoted newline is found to determine line breaks. > Newlines should be valid in quoted strings in CSV > ------------------------------------------------- > > Key: FLINK-6016 > URL: https://issues.apache.org/jira/browse/FLINK-6016 > Project: Flink > Issue Type: Bug > Components: Batch Connectors and Input/Output Formats > Affects Versions: 1.2.0 > Reporter: Luke Hutchison > > The RFC for the CSV format specifies that newlines are valid in quoted > strings in CSV: > https://tools.ietf.org/html/rfc4180 > However, when parsing a CSV file with Flink containing a newline, such as: > {noformat} > "3 > 4",5 > {noformat} > you get this exception: > {noformat} > Line could not be parsed: '"3' > ParserError UNTERMINATED_QUOTED_STRING > Expect field types: class java.lang.String, class java.lang.String > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)