On Wed, 6 Aug 2025 10:52:00 GMT, Volkan Yazici <vyaz...@openjdk.org> wrote:
>> I would assume your "double char" actually means the "surrogate pair"? >> >> I believe for the first pass of scanning you might want to skip the >> 'surrogate", as a single dangling surrogate char should trigger a >> "malformed" error, instead of 'unmappable", if the charset is implemented to >> handle supplementary character. >> >> for (char c = 0xFF; c < 0xFFFF; c++) { >> if (Character.isSurrogate(c)) >> continue; >> if (!encoder.canEncode(c)) >> return new char[]{c}; >> } >> >> And for the second pass for the 'surrogates", I think we can just pick any >> non-bmp panel, which should always be translated into a surrogate pair and >> check if the charset can map/encode it, if not, it's our candidate. >> >> for (int i = 0x10000; i < 0x1FFFF; i++) { >> char[] cc = Character.toChars(i); >> if (!encoder.canEncode(new String(cc))) >> return cc; >> } > >> for (char c = 0xFF; c < 0xFFFF; c++) > > Doesn't this exclude `0xFFFF`, which is a valid (single-`char`, > non-surrogate) BMP character? > >> ... we can just pick any non-bmp panel ... >> ``` >> for (int i = 0x10000; i < 0x1FFFF; i++) { ... >> ``` > > Doesn't the non-BMP range rather end with 0x10FFFF? (1) we might want to include 0xffff in first pass (2) we just need to pick any unmappable non-bmp character, i would assume that it should be pretty safe we will find one in the first non-bmp panel that is not encoded by a specific charset. ------------- PR Review Comment: https://git.openjdk.org/jdk/pull/26635#discussion_r2257902674