Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

Eirik Bjorsnos Tue, 21 Feb 2023 12:37:02 -0800

On Tue, 21 Feb 2023 20:23:11 GMT, Eirik Bjorsnos <[email protected]> wrote:


>> src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line 
>> 175:
>> 
>>> 173:          }
>>> 174:          // uppercase b1 using 'the oldest ASCII trick in the book'
>>> 175:          int U = b1 & 0xDF;
>> 
>> I'm sure some people reading this comment will wonder which book :-) It 
>> might be better to drop that bit and if possible, find a better name for "U" 
>> as normally variables start with a lower case.
>
> Hi Alan,
> 
> I thought I was clever by encoding the 'uppercaseness' in the variable name, 
> but yeah I'll find a better name :)
> 
> There is some precedent for using the 'ASCII trick' comment in the JDK.  I 
> found it in ZipFile.isMetaName, which is also where I first learned about 
> this interesting relationship between ASCII (and also latin1) letters.
> 
> The comment was first added by Martin Buchholz back in 2016 as part of 
> JDK-8157069, 'Assorted ZipFile improvements'. In 2020, Claes was updating 
> this code and Lance has some input about clarifying the comment. Martin then 
> [chimed 
> in](https://mail.openjdk.org/pipermail/core-libs-dev/2020-May/066363.html) to 
> defend his comment:
> 
>> I still like my ancient "ASCII trick" comment.
> 
> I think this 'trick', whatever we call it, is sufficiently intricate that it 
> deserves to be called out somehow and that we should not just casually 
> bitmask with these magic constants without any discussion at all. 
> 
> An earlier iteration of this PR included a small essay in the javadoc of this 
> method describing the layout and relationship of letters in latin1 and how we 
> can apply that knowledge of the layout to implement the method.
> 
> How would you feel about adding that description back to the Javadocs? This 
> would then live close to the similarly implemented toUpperCase and 
> toLowerCase methods currently under review in #12623. 
> 
> Here's the updated discussion included in the Javadoc:
> 
> 
>     /**
>      * Compares two latin1 code points, ignoring case considerations.
>      *
>      * Implementation note: In ISO/IEC 8859-1, the uppercase and lowercase
>      * letters are found in the following code point ranges:
>      *
>      * 0x41-0x5A: Uppercase ASCII letters: A-Z
>      * 0x61-0x7A: Lowercase ASCII letters: a-z
>      * 0xC0-0xD6: Uppercase latin1 letters: A-GRAVE - O with Diaeresis
>      * 0xD8-0xDE: Uppercase latin1 letters: O with slash - Thorn
>      * 0xE0-0xF6: Lowercase latin1 letters: a-grave - o with Diaeresis
>      * 0xF8-0xFE: Lowercase latin1 letters: o with slash - thorn
>      *
>      * While both ASCII letter ranges are contiguous, the latin1 ranges are 
> not:
>      *
>      * The 'multiplication sign' 0xD7 splits the uppercase range in two.
>      * The 'division sign' 0xF7 splits the lowercase range in two.
>      *
>      * Lowercase letters are found 32 positions (0x20) after their 
> corresponding uppercase letter.
>      * The 'division sign' and 'multiplication sign' have the same relative 
> distance.
>      *
>      * Since 0x20 is a single bit, we can apply the 'oldest ASCII trick in 
> the book' to
>      * lowercase any letter by setting the bit:
>      *
>      * ('C' | 0x20) == 'c'
>      *
>      * By removing the bit, we can perform the uppercase operation:
>      *
>      * ('c' & 0xDF) == 'C'
>      *
>      * Applying this knowledge of the latin1 layout, we can test for equality 
> ignoring case by
>      * checking that the code points are either equal, or that one of the 
> code points is a letter
>      * which uppercases is the same as the uppercase of the other code point.
>      *
>      * @param b1 byte representing a latin1 code point
>      * @param b2 another byte representing a latin1 code point
>      * @return true if the two bytes are considered equals ignoring case in 
> latin1
>      */
>      static boolean equalsIgnoreCase(byte b1, byte b2) {
>          if (b1 == b2) {
>              return true;
>          }
>          int upper = b1 & 0xDF;
>          if (upper < 'A') {
>              return false;  // Low ASCII
>          }
>          return (upper <= 'Z' // In range A-Z
>                  || (upper >= 0xC0 && upper <= 0XDE && upper != 0xD7)) // 
> ..or A-grave-Thorn, excl. multiplication
>                  && upper == (b2 & 0xDF); // b2 has same uppercase
>     }

Perhaps @Martin-Buchholz could chime in and also tell us which book he found 
his ASCII trick in :)

-------------

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

Reply via email to