On Mon, 6 Feb 2023 11:47:42 GMT, Eirik Bjorsnos <[email protected]> wrote:
>> src/java.base/share/classes/java/lang/System.java line 2668:
>>
>>> 2666: @Override
>>> 2667: public int mismatchUTF8(String str, byte[] b, int
>>> fromIndex, int toIndex) {
>>> 2668: byte[] encoded = str.isLatin1() ? str.value() :
>>> str.getBytes(UTF_8.INSTANCE);
>>
>> I think this is incorrect: latin-1 characters above codepoint 127
>> (non-ascii) would be represented by 2 bytes in UTF-8. What you want here is
>> probably `str.isAscii() ? ...`. The ASCII check will have to look at the
>> bytes, so will incur a minor penalty.
>>
>> Good news is that you should already be able to do this with what's already
>> exposed via `JLA.getBytesNoRepl(str, StandardCharsets.UTF_8)`, so no need
>> for more shared secrets.
>
> Nice, I have updated the PR such that the new shared secret is replaced with
> using getBytesNoRepl instead. If there is a performance difference, it seems
> to hide in the noise.
>
> I had expected such a regression to be caught by existing tests, which seems
> not to be the case. I added TestZipFileEncodings.latin1NotAscii to adress
> this.
getBytesNoRepl throws CharacterCodingException "for malformed input or
unmappable characters".
This should never happen since initCEN should already reject it. If it should
happen anyway, I return NO_MATCH which will ignore the match just like the
catch in getEntryPos currently does.
-------------
PR: https://git.openjdk.org/jdk/pull/12290