dear everyone, I hope to discuss this jira with everyone in order to handle this matter better. Here are some of my thoughts:
1.Where should the BOM be read? I think that when the file is started at the beginning of the file, you still need to increase the logic for processing the bom. Add a variable to the read bom encoding logic to record the file bom encoding. For example: put it in the function createinputsplit. 2.We can use the previously generated variables to determine whether it is (bom with UTF8, UTF16 with bom, UTF32 with bom), and control the byte size according to the encoding type to handle the end of each line, because I found that the previous bug is actually A coding problem, and the improper handling of each line of records ends up. In response to this problem, I did the following work: String utf8 = "UTF-8"; String utf16 = "UTF-16"; String utf32 = "UTF-32"; int stepSize = 0; String charsetName = this.getCharsetName(); if (charsetName.contains(utf8)) { stepSize = 1; } else if (charsetName.contains(utf16)) { stepSize = 2; } else if (charsetName.contains(utf32)) { stepSize = 4; } //Check if \n is used as delimiter and the end of this line is a \r, then remove \r from the line if (this.getDelimiter() != null && this.getDelimiter().length == 1 && this.getDelimiter()[0] == NEW_LINE && offset + numBytes >= stepSize && bytes[offset + numBytes - stepSize] == CARRIAGE_RETURN) { numBytes -= stepSize; } numBytes = numBytes - stepSize + 1; return new String(bytes, offset, numBytes, this.getCharsetName()); If you still don't know what I want to describe, you can see the detailed code implementation in the PR I submitted. Here is the link to PR: https://github.com/apache/flink/pull/6710 Here is the link to Jira: https://issues.apache.org/jira/browse/FLINK-10134 Looking forward to your reply Best wishes. qianjinxu