On Mon, 6 Feb 2023 11:47:42 GMT, Eirik Bjorsnos <d...@openjdk.org> wrote:
>> src/java.base/share/classes/java/lang/System.java line 2668: >> >>> 2666: @Override >>> 2667: public int mismatchUTF8(String str, byte[] b, int >>> fromIndex, int toIndex) { >>> 2668: byte[] encoded = str.isLatin1() ? str.value() : >>> str.getBytes(UTF_8.INSTANCE); >> >> I think this is incorrect: latin-1 characters above codepoint 127 >> (non-ascii) would be represented by 2 bytes in UTF-8. What you want here is >> probably `str.isAscii() ? ...`. The ASCII check will have to look at the >> bytes, so will incur a minor penalty. >> >> Good news is that you should already be able to do this with what's already >> exposed via `JLA.getBytesNoRepl(str, StandardCharsets.UTF_8)`, so no need >> for more shared secrets. > > Nice, I have updated the PR such that the new shared secret is replaced with > using getBytesNoRepl instead. If there is a performance difference, it seems > to hide in the noise. > > I had expected such a regression to be caught by existing tests, which seems > not to be the case. I added TestZipFileEncodings.latin1NotAscii to adress > this. getBytesNoRepl throws CharacterCodingException "for malformed input or unmappable characters". This should never happen since initCEN should already reject it. If it should happen anyway, I return NO_MATCH which will ignore the match just like the catch in getEntryPos currently does. ------------- PR: https://git.openjdk.org/jdk/pull/12290