[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

greghogan Tue, 20 Sep 2016 06:06:41 -0700

Github user greghogan commented on the issue:

    https://github.com/apache/flink/pull/2060
  
    Apologies for the long delay. I'd like to attempt to summarize this ticket 
and pull request to validate my understanding.
    
    Previously StringParser was using the system encoding and 
`GenericCsvInputFormat` was using UTF-8 for the delimiter and an overloadable 
UTF-8 for the comment prefix.
    
    StringParser's quoteCharacter remains a `byte` with no encoding.
    
    Now GenericCsvInputFormat can be configured with a charset which is used 
for the delimiter, comment prefix, and field parsers (only used in 
StringParser).
    
    Should `setCommentPrefix(String commentPrefix, Charset charset)` and 
`setCommentPrefix(String commentPrefix, String charsetName)` be removed from 
`GenericCsvInputFormat`? Would different encodings be used on the same file?
    
    Allow the user to set the character encoding in `CsvReader` which would be 
applied in `CsvReader.configureInputFormat`?
    
    Are the new tests checking the encoding? The test strings are using using 
characters common to UTF-8 and ASCII. We could instead use one of the UTF-16 
encodings from 
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink issue #2060: [FLINK-3921] StringParser encoding

Reply via email to