dear everyone,
    
        I hope to discuss this jira with everyone in order to handle this 
matter better. Here are some of my thoughts:


        1.Where should the BOM be read? I think that when the file is started 
at the beginning of the file, you still need to increase the logic for 
processing the bom. Add a variable to the read bom encoding logic to record the 
file bom encoding. For example: put it in the function createinputsplit.
        2.We can use the previously generated variables to determine whether it 
is (bom with UTF8, UTF16 with bom, UTF32 with bom), and control the byte size 
according to the encoding type to handle the end of each line, because I found 
that the previous bug is actually A coding problem, and the improper handling 
of each line of records ends up. In response to this problem, I did the 
following work:



String utf8 = "UTF-8";


String utf16 = "UTF-16";


String utf32 = "UTF-32";


int stepSize = 0;


String charsetName = this.getCharsetName();


if (charsetName.contains(utf8)) {


stepSize = 1;


} else if (charsetName.contains(utf16)) {


stepSize = 2;


} else if (charsetName.contains(utf32)) {


stepSize = 4;


}


//Check if \n is used as delimiter and the end of this line is a \r, then 
remove \r from the line


if (this.getDelimiter() != null && this.getDelimiter().length == 1


&& this.getDelimiter()[0] == NEW_LINE && offset + numBytes >= stepSize


&& bytes[offset + numBytes - stepSize] == CARRIAGE_RETURN) {


numBytes -= stepSize;


}


numBytes = numBytes - stepSize + 1;


return new String(bytes, offset, numBytes, this.getCharsetName());




        


   If you still don't know what I want to describe, you can see the detailed 
code implementation in the PR I submitted. 
Here is the link to PR:  https://github.com/apache/flink/pull/6710  
Here is the link to Jira: https://issues.apache.org/jira/browse/FLINK-10134



    Looking forward to your reply


        
Best wishes.
qianjinxu

Reply via email to