[ 
https://issues.apache.org/jira/browse/FLINK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487151#comment-16487151
 ] 

Luke Hutchison commented on FLINK-6016:
---------------------------------------

[~fhueske] reading a file in parallel is not faster for most filesystems and 
most storage devices on most operating systems. In fact, for a large-latency 
seek device, such as an HDD, reading from several threads in parallel will 
actually increase the total read time, potentially dramatically. The only way 
reading a file in parallel can be truly fast from multiple threads is if the 
entire file is already cached in RAM.

I suggest simply reading the file serially, and emitting lines to a collection 
that can then be read in parallel by multiple mappers.

> Newlines should be valid in quoted strings in CSV
> -------------------------------------------------
>
>                 Key: FLINK-6016
>                 URL: https://issues.apache.org/jira/browse/FLINK-6016
>             Project: Flink
>          Issue Type: Bug
>          Components: Batch Connectors and Input/Output Formats
>    Affects Versions: 1.2.0
>            Reporter: Luke Hutchison
>            Priority: Major
>
> The RFC for the CSV format specifies that newlines are valid in quoted 
> strings in CSV:
> https://tools.ietf.org/html/rfc4180
> However, when parsing a CSV file with Flink containing a newline, such as:
> {noformat}
> "3
> 4",5
> {noformat}
> you get this exception:
> {noformat}
> Line could not be parsed: '"3'
> ParserError UNTERMINATED_QUOTED_STRING 
> Expect field types: class java.lang.String, class java.lang.String 
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to