[ https://issues.apache.org/jira/browse/FLINK-14266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949378#comment-16949378 ]
Jingsong Lee edited comment on FLINK-14266 at 10/11/19 11:27 AM: ----------------------------------------------------------------- Thanks [~fhueske] , I think there are two choices: # Extends DelimitedInputFormat and use CsvRowDeserializationSchema to deserialize bytes with offset and numBytes, need deal with selectedFields too. DelimitedInputFormat already has the split logical to deal with half-line. But as fabian said, we do not know whether the next new-line character is a record delimiter or contained in a string field. # Use jackson ObjectReader.readValues(InputStream). The difficulty are: ## ObjectReader do not know current read offset, it has buffer to cache more bytes. But we need stop in the right place for reading a FileSplit. One solution is to use BoundedInputStream, But we need to read the unfinished line, so we need to modify splitLength first to find the correct end position based on line delimiter and escapeChar. ## We also need to correctly determine the line separator when starting reading for FileSplit that start offset is in middle of file. If first char is line separator, maybe the character before it is an escape character. We need to deal with these things carefully. was (Author: lzljs3620320): Thanks [~fhueske] , I think there are two choices: # Extends DelimitedInputFormat and use CsvRowDeserializationSchema to deserialize bytes with offset and numBytes, need deal with selectedFields too. DelimitedInputFormat already has the split logical to deal with half-line. But as fabian said, we do not know whether the next new-line character is a record delimiter or contained in a string field. # Use jackson ObjectReader.readValues(InputStream). The difficulty are: ## ObjectReader do not know current read offset, it has buffer to cache more bytes. One solution is to use BoundedInputStream, But we need to read the unfinished line, so we need to modify splitLength first to find the correct end position based on line delimiter and escapeChar. ## We also need to correctly determine the line separator when starting reading. If first char is line separator, maybe the character before it is an escape character. We need to deal with these things carefully. > Introduce RowCsvInputFormat to new CSV module > --------------------------------------------- > > Key: FLINK-14266 > URL: https://issues.apache.org/jira/browse/FLINK-14266 > Project: Flink > Issue Type: Sub-task > Components: Connectors / FileSystem > Reporter: Jingsong Lee > Assignee: Jingsong Lee > Priority: Major > Fix For: 1.10.0 > > > Now, we have an old CSV, but that is not standard CSV support. we should > support the RFC-compliant CSV format for table/sql. -- This message was sent by Atlassian Jira (v8.3.4#803005)