Please move this discussion to either the PR. There's little value in
spreading discussions over several channels; any insight raised here
should also be visible in the PR.
On 28.09.2018 07:07, x1q1j1 wrote:
dear everyone,
I hope to discuss this jira with everyone in order to handle this matter better. Here are some of my thoughts:
1.Where should the BOM be read? I think that when the file is started
at the beginning of the file, you still need to increase the logic for
processing the bom. Add a variable to the read bom encoding logic to record the
file bom encoding. For example: put it in the function createinputsplit.
2.We can use the previously generated variables to determine whether it
is (bom with UTF8, UTF16 with bom, UTF32 with bom), and control the byte size
according to the encoding type to handle the end of each line, because I found
that the previous bug is actually A coding problem, and the improper handling
of each line of records ends up. In response to this problem, I did the
following work:
String utf8 = "UTF-8";
String utf16 = "UTF-16";
String utf32 = "UTF-32";
int stepSize = 0;
String charsetName = this.getCharsetName();
if (charsetName.contains(utf8)) {
stepSize = 1;
} else if (charsetName.contains(utf16)) {
stepSize = 2;
} else if (charsetName.contains(utf32)) {
stepSize = 4;
}
//Check if \n is used as delimiter and the end of this line is a \r, then
remove \r from the line
if (this.getDelimiter() != null && this.getDelimiter().length == 1
&& this.getDelimiter()[0] == NEW_LINE && offset + numBytes >= stepSize
&& bytes[offset + numBytes - stepSize] == CARRIAGE_RETURN) {
numBytes -= stepSize;
}
numBytes = numBytes - stepSize + 1;
return new String(bytes, offset, numBytes, this.getCharsetName());
If you still don't know what I want to describe, you can see the detailed
code implementation in the PR I submitted.
Here is the link to PR: https://github.com/apache/flink/pull/6710
Here is the link to Jira: https://issues.apache.org/jira/browse/FLINK-10134
Looking forward to your reply
Best wishes.
qianjinxu