On Mon, 30 Jan 2023 10:32:38 GMT, Eirik Bjorsnos <d...@openjdk.org> wrote:
> After finding a hash match, getEntryPos needs to compare the lookup name up > to the encoded entry name in the CEN. This comparison is done by decoding the > entry name into a String. The names can then be compared using the String > API. This decoding step adds a significat cost to this method. > > This PR suggest to update the string comparison such that in the common case > where both the lookup name and the entry name are encoded in > ASCII-compatible UTF-8, decoding can be avoided and the byte arrays can > instead be compared direcly. > > ZipCoder is updated with a new method to compare a string with an encoded > byte array range. The default implementation decodes to string (like the > current code), while the UTF-8 implementation uses > JavaLangAccess.getBytesNoRepl to get the bytes. Both methods thes uses > Arrays.mismatch for comparison with or without matching trailing slashes. > > Additionally, this PR suggest to make the following updates to getEntryPos: > > - The try/catch for IAE is redundant and can be safely removed. (initCEN > already checks this and will throws IAE for invalid UTF-8). This seems to > give a 3-4% speedup on micros) > - A new test InvalidBytesInEntryNameOrComment is a added to verify that > initCEN does in fact reject invalid UTF-8 in CEN file names and comments. (I > found no existing test coverage for this) > - The recursion when looking for "name/" matches is replaced with iteration. > We keep track of any "name/" match and return it at the end of the search. (I > feel this is easier to follow and it also gives a ~30% speedup for addSlash > lookups with no regression on regular lookups) > > (My though is that including these additional updates in this PR might reduce > reviewer overhead given that it touches the exact same code. I might be wrong > on this, please advise :) > > I'm seeing a ~17% saving on the micro ZipFileGetEntry.getEntryHit (modified > to use xalan.jar): > > Baseline: > > Benchmark (size) Mode Cnt Score Error > Units > ZipFileGetEntry.getEntryHit 512 avgt 15 74.941 ± 1.004 > ns/op > ZipFileGetEntry.getEntryHit 1024 avgt 15 84.943 ± 1.320 > ns/op > ZipFileGetEntry.getEntryHitUncached 512 avgt 15 120.371 ± 2.386 > ns/op > ZipFileGetEntry.getEntryHitUncached 1024 avgt 15 126.128 ± 1.075 > ns/op > ZipFileGetEntry.getEntryMiss 512 avgt 15 23.818 ± 0.838 > ns/op > ZipFileGetEntry.getEntryMiss 1024 avgt 15 29.762 ± 5.998 > ns/op > ZipFileGetEntry.getEntryMissUncached 512 avgt 15 59.405 ± 0.545 > ns/op > ZipFileGetEntry.getEntryMissUncached 1024 avgt 15 71.840 ± 2.455 > ns/op > ZipFileGetEntry.getEntrySlash 512 avgt 15 135.621 ± 4.341 > ns/op > ZipFileGetEntry.getEntrySlash 1024 avgt 15 134.190 ± 2.141 > ns/op > > > > PR: > > > Benchmark (size) Mode Cnt Score Error > Units > ZipFileGetEntry.getEntryHit 512 avgt 15 62.267 ± 1.329 > ns/op > ZipFileGetEntry.getEntryHit 1024 avgt 15 72.916 ± 2.428 > ns/op > ZipFileGetEntry.getEntryHitUncached 512 avgt 15 101.630 ± 1.154 > ns/op > ZipFileGetEntry.getEntryHitUncached 1024 avgt 15 113.161 ± 0.502 > ns/op > ZipFileGetEntry.getEntryMiss 512 avgt 15 23.003 ± 1.191 > ns/op > ZipFileGetEntry.getEntryMiss 1024 avgt 15 23.236 ± 1.114 > ns/op > ZipFileGetEntry.getEntryMissUncached 512 avgt 15 56.781 ± 1.505 > ns/op > ZipFileGetEntry.getEntryMissUncached 1024 avgt 15 67.767 ± 1.963 > ns/op > ZipFileGetEntry.getEntrySlash 512 avgt 15 73.745 ± 2.717 > ns/op > ZipFileGetEntry.getEntrySlash 1024 avgt 15 75.784 ± 1.051 > ns/op > > > To assess the impact on startup/warmup, I made a main method class which > measures the total time of calling ZipFile.getEntry for all entries in the > 109 JAR file dependenies of spring-petclinic. The shows a nice improvement > (time in micros): > > > Percentile Baseline Patch > 50 % 23155 21149 > 75 % 23598 21454 > 90 % 23989 21691 > 95 % 24238 21973 > 99 % 25270 22446 > STDEV 792 549 > Count 500 500 As suggested by @cl4es, I've replaced the use of ArraySupport.mismatch with Arrays.mismatch. The added range checks seems to give a regression of ~3% on the getEntryHit micro. I realized that encoding to bytes and then comparing to CEN bytes might not be safe for encodings were multiple representations is possible for the same code points. So I moved string/byte array comparison into ZipCoder, which can now decode from CEN and compare as in the current code. Micros indicate this has no performance impact. src/java.base/share/classes/java/lang/System.java line 2678: > 2676: } > 2677: return Arrays.mismatch(encoded, 0, encoded.length, b, > fromIndex, toIndex); > 2678: } Leaving the ArraySupport.mismatch code here for now if anyone wants to investigate the ~3% regression introduced by the range checks in Arrays.mismatch ------------- PR: https://git.openjdk.org/jdk/pull/12290