Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos

Eirik Bjorsnos Mon, 06 Feb 2023 07:04:31 -0800

On Mon, 6 Feb 2023 11:47:42 GMT, Eirik Bjorsnos <[email protected]> wrote:


>> src/java.base/share/classes/java/lang/System.java line 2668:
>> 
>>> 2666:             @Override
>>> 2667:             public int mismatchUTF8(String str, byte[] b, int 
>>> fromIndex, int toIndex) {
>>> 2668:                 byte[] encoded = str.isLatin1() ? str.value() : 
>>> str.getBytes(UTF_8.INSTANCE);
>> 
>> I think this is incorrect: latin-1 characters above codepoint 127 
>> (non-ascii) would be represented by 2 bytes in UTF-8. What you want here is 
>> probably `str.isAscii() ? ...`. The ASCII check will have to look at the 
>> bytes, so will incur a minor penalty.
>> 
>> Good news is that you should already be able to do this with what's already 
>> exposed via `JLA.getBytesNoRepl(str, StandardCharsets.UTF_8)`, so no need 
>> for more shared secrets.
>
> Nice, I have updated the PR such that the new shared secret is replaced with 
> using getBytesNoRepl instead. If there is a performance difference, it seems 
> to hide in the noise.
> 
> I had expected such a regression to be caught by existing tests, which seems 
> not to be the case. I added TestZipFileEncodings.latin1NotAscii to adress 
> this.

getBytesNoRepl throws CharacterCodingException "for malformed input or 
unmappable characters".

This should never happen since initCEN should already reject it. If it should 
happen anyway, I return NO_MATCH which will ignore the match just like the 
catch in getEntryPos currently does.

-------------

PR: https://git.openjdk.org/jdk/pull/12290

Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos

Reply via email to