On Wed, 6 Aug 2025 09:58:16 GMT, Volkan Yazici <vyaz...@openjdk.org> wrote:

>> we definitely want to exclude 'some' charsets here. yes, all unicode 
>> variants probably should be excluded, as they are expected to have a 
>> 'mapping' for every unicode character. Additionally, many charsets have an 
>> "internal status", meaning they might shift in and shift out its status 
>> based on input. See https://www.rfc-editor.org/rfc/rfc1468.html for an 
>> example. The encoder might/should add the shift-in/out escape sequence 
>> characters on top of the 'replacement', if the replacement character's 
>> target sub-charset does not match the 'existing' sub-charset. i would assume 
>> this is really out of the scope of this pr though :-)
>
>> if the replacement character's target sub-charset does not match the 
>> 'existing' sub-charset
> 
> In such a `replacement`, does `CharsetEncoder::isLegalReplacement` still 
> return `true`?

it's 'tricky' :-)  some charsets have a default initial status, ascii-charset 
for example, this might trigger false return if the replacement is set without 
the appropriate shift-in/out esc-seq when target-sub-charset and existing 
charset are not matched.  I'm not confident that all our implementations really 
handle it correctly :-)  it might be interested (not really :-) given these 
charsets might not be that important these days and it's rare people try to 
change the default replacement bytes) to do full-scan-check, but again probably 
is not in-scope of this change.  

iso2022-jp is one such charset. we attempt to shift-in to the correct 
sub-charset by keeping the requested mode in **_implReplaceWith_**

        protected void implReplaceWith(byte[] newReplacement) {
            /* It's almost impossible to decide which charset they belong
               to. The best thing we can do here is to "guess" based on
               the length of newReplacement.
             */
            if (newReplacement.length == 1) {
                replaceMode = ASCII;
            } else if (newReplacement.length == 2) {
                replaceMode = JISX0208_1983;
            }
        } 

then during encoding

                            if (unmappableCharacterAction()
                                == CodingErrorAction.REPLACE
                                && currentMode != replaceMode) {
                                if (dl - dp < 3)
                                    return CoderResult.OVERFLOW;
                                if (replaceMode == ASCII) {
                                    da[dp++] = (byte)0x1b;
                                    da[dp++] = (byte)0x28;
                                    da[dp++] = (byte)0x42;
                                } else {
                                    da[dp++] = (byte)0x1b;
                                    da[dp++] = (byte)0x24;
                                    da[dp++] = (byte)0x42;
                                }
                                currentMode = replaceMode;
                            }

i believe this might not be really bulletproved.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/26635#discussion_r2257966394

Reply via email to