Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v14]

Eirik Bjorsnos Wed, 15 Feb 2023 02:14:26 -0800

> After finding a hash match, getEntryPos needs to compare the lookup name up 
> to the encoded entry name in the CEN. This comparison is done by decoding the 
> entry name into a String. The names can then be compared using the String 
> API. This decoding step adds a significat cost to this method.
> 
> This PR suggest to update the string comparison such that in the common case 
> where  both the lookup name and the entry name are encoded in 
> ASCII-compatible UTF-8,  decoding can be avoided and the byte arrays can 
> instead be compared direcly. 
> 
> ZipCoder is updated with a new method to compare a string with an encoded 
> byte array range.  The default implementation decodes to string (like the 
> current code), while the UTF-8 implementation uses 
> JavaLangAccess.getBytesNoRepl to get the  bytes. Both methods thes uses 
> Arrays.mismatch for comparison with or without matching trailing slashes. 
> 
> Additionally, this PR suggest to make the following updates to getEntryPos:
> 
> - The try/catch for IAE is redundant and can be safely removed. (initCEN 
> already checks this and will throws IAE for invalid UTF-8). This seems to 
> give a 3-4% speedup on micros)
> - A new test InvalidBytesInEntryNameOrComment is a added to verify that 
> initCEN does in fact reject invalid UTF-8 in CEN file names and comments. (I 
> found no existing test coverage for this)
> - The recursion when looking for "name/" matches is replaced with iteration. 
> We keep track of any "name/" match and return it at the end of the search. (I 
> feel this is easier to follow and it also gives a ~30% speedup for addSlash 
> lookups with no regression on regular lookups)
> 
> (My though is that including these additional updates in this PR might reduce 
> reviewer overhead given that it touches the exact same code. I might be wrong 
> on this, please advise :)
> 
> I'm seeing a ~17% saving on the micro ZipFileGetEntry.getEntryHit (modified 
> to use xalan.jar):
> 
> Baseline:
> 
> Benchmark                             (size)  Mode  Cnt    Score   Error  
> Units
> ZipFileGetEntry.getEntryHit              512  avgt   15   74.941 ± 1.004  
> ns/op
> ZipFileGetEntry.getEntryHit             1024  avgt   15   84.943 ± 1.320  
> ns/op
> ZipFileGetEntry.getEntryHitUncached      512  avgt   15  120.371 ± 2.386  
> ns/op
> ZipFileGetEntry.getEntryHitUncached     1024  avgt   15  126.128 ± 1.075  
> ns/op
> ZipFileGetEntry.getEntryMiss             512  avgt   15   23.818 ± 0.838  
> ns/op
> ZipFileGetEntry.getEntryMiss            1024  avgt   15   29.762 ± 5.998  
> ns/op
> ZipFileGetEntry.getEntryMissUncached     512  avgt   15   59.405 ± 0.545  
> ns/op
> ZipFileGetEntry.getEntryMissUncached    1024  avgt   15   71.840 ± 2.455  
> ns/op
> ZipFileGetEntry.getEntrySlash            512  avgt   15  135.621 ± 4.341  
> ns/op
> ZipFileGetEntry.getEntrySlash           1024  avgt   15  134.190 ± 2.141  
> ns/op
> 
> 
> 
> PR:
> 
> 
> Benchmark                             (size)  Mode  Cnt    Score   Error  
> Units
> ZipFileGetEntry.getEntryHit              512  avgt   15   62.267 ± 1.329  
> ns/op
> ZipFileGetEntry.getEntryHit             1024  avgt   15   72.916 ± 2.428  
> ns/op
> ZipFileGetEntry.getEntryHitUncached      512  avgt   15  101.630 ± 1.154  
> ns/op
> ZipFileGetEntry.getEntryHitUncached     1024  avgt   15  113.161 ± 0.502  
> ns/op
> ZipFileGetEntry.getEntryMiss             512  avgt   15   23.003 ± 1.191  
> ns/op
> ZipFileGetEntry.getEntryMiss            1024  avgt   15   23.236 ± 1.114  
> ns/op
> ZipFileGetEntry.getEntryMissUncached     512  avgt   15   56.781 ± 1.505  
> ns/op
> ZipFileGetEntry.getEntryMissUncached    1024  avgt   15   67.767 ± 1.963  
> ns/op
> ZipFileGetEntry.getEntrySlash            512  avgt   15   73.745 ± 2.717  
> ns/op
> ZipFileGetEntry.getEntrySlash           1024  avgt   15   75.784 ± 1.051  
> ns/op
> 
> 
> To assess the impact on startup/warmup, I made a main method class which 
> measures the total time of calling ZipFile.getEntry for all entries in the 
> 109 JAR file dependenies of spring-petclinic. The shows a nice improvement 
> (time in micros):
> 
> 
> Percentile Baseline Patch
> 50 %          23155 21149
> 75 %          23598 21454
> 90 %          23989 21691
> 95 %          24238 21973
> 99 %          25270 22446
> STDEV           792   549
> Count           500   500


Eirik Bjorsnos has updated the pull request incrementally with one additional 
commit since the last revision:

  Update javadocs for ZipCoder.Comparison to first clarify that the "encoded 
string" refers to those found in the CEN byte array and to use the term "the 
encoded string" instead of the more convoluted "the string encoded in the byte 
array".

-------------

Changes:
  - all: https://git.openjdk.org/jdk/pull/12290/files
  - new: https://git.openjdk.org/jdk/pull/12290/files/49ad8d35..9b83002a

Webrevs:
 - full: https://webrevs.openjdk.org/?repo=jdk&pr=12290&range=13
 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=12290&range=12-13

  Stats: 10 lines in 1 file changed: 0 ins; 2 del; 8 mod
  Patch: https://git.openjdk.org/jdk/pull/12290.diff
  Fetch: git fetch https://git.openjdk.org/jdk pull/12290/head:pull/12290

PR: https://git.openjdk.org/jdk/pull/12290

Re: RFR: 8301873: Avoid string decoding in ZipFile.Source.getEntryPos [v14]

Reply via email to