[ https://issues.apache.org/jira/browse/FLINK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487151#comment-16487151 ]
Luke Hutchison commented on FLINK-6016: --------------------------------------- [~fhueske] reading a file in parallel is not faster for most filesystems and most storage devices on most operating systems. In fact, for a large-latency seek device, such as an HDD, reading from several threads in parallel will actually increase the total read time, potentially dramatically. The only way reading a file in parallel can be truly fast from multiple threads is if the entire file is already cached in RAM. I suggest simply reading the file serially, and emitting lines to a collection that can then be read in parallel by multiple mappers. > Newlines should be valid in quoted strings in CSV > ------------------------------------------------- > > Key: FLINK-6016 > URL: https://issues.apache.org/jira/browse/FLINK-6016 > Project: Flink > Issue Type: Bug > Components: Batch Connectors and Input/Output Formats > Affects Versions: 1.2.0 > Reporter: Luke Hutchison > Priority: Major > > The RFC for the CSV format specifies that newlines are valid in quoted > strings in CSV: > https://tools.ietf.org/html/rfc4180 > However, when parsing a CSV file with Flink containing a newline, such as: > {noformat} > "3 > 4",5 > {noformat} > you get this exception: > {noformat} > Line could not be parsed: '"3' > ParserError UNTERMINATED_QUOTED_STRING > Expect field types: class java.lang.String, class java.lang.String > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)