[ 
https://issues.apache.org/jira/browse/FLINK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150860#comment-16150860
 ] 

Luke Hutchison commented on FLINK-6016:
---------------------------------------

Yes, that's what I'm suggesting. The data doesn't have to be read twice, it can 
be emitted in the first pass, but the efficiency of doing so depends on the 
bandwidth between the single reading thread and the worker threads for each 
shard.

A more scalable approach, though more complex, would be to build a state 
machine for each shard, recording the state at each input character, and then 
"run off the end" of each shard boundary until the state of the parser from the 
previous shard matches the state of the parser for the next shard at the same 
character position. The "overrun" parser state overwrites the next shard parser 
state until the states match. Then the state marker for unquoted newline is 
found to determine line breaks.

> Newlines should be valid in quoted strings in CSV
> -------------------------------------------------
>
>                 Key: FLINK-6016
>                 URL: https://issues.apache.org/jira/browse/FLINK-6016
>             Project: Flink
>          Issue Type: Bug
>          Components: Batch Connectors and Input/Output Formats
>    Affects Versions: 1.2.0
>            Reporter: Luke Hutchison
>
> The RFC for the CSV format specifies that newlines are valid in quoted 
> strings in CSV:
> https://tools.ietf.org/html/rfc4180
> However, when parsing a CSV file with Flink containing a newline, such as:
> {noformat}
> "3
> 4",5
> {noformat}
> you get this exception:
> {noformat}
> Line could not be parsed: '"3'
> ParserError UNTERMINATED_QUOTED_STRING 
> Expect field types: class java.lang.String, class java.lang.String 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to