Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

Martin Buchholz Tue, 21 Feb 2023 12:50:56 -0800

On Tue, 21 Feb 2023 20:33:41 GMT, Eirik Bjorsnos <d...@openjdk.org> wrote:


>> Hi Alan,
>> 
>> I thought I was clever by encoding the 'uppercaseness' in the variable name, 
>> but yeah I'll find a better name :)
>> 
>> There is some precedent for using the 'ASCII trick' comment in the JDK.  I 
>> found it in ZipFile.isMetaName, which is also where I first learned about 
>> this interesting relationship between ASCII (and also latin1) letters.
>> 
>> The comment was first added by Martin Buchholz back in 2016 as part of 
>> JDK-8157069, 'Assorted ZipFile improvements'. In 2020, Claes was updating 
>> this code and Lance has some input about clarifying the comment. Martin then 
>> [chimed 
>> in](https://mail.openjdk.org/pipermail/core-libs-dev/2020-May/066363.html) 
>> to defend his comment:
>> 
>>> I still like my ancient "ASCII trick" comment.
>> 
>> I think this 'trick', whatever we call it, is sufficiently intricate that it 
>> deserves to be called out somehow and that we should not just casually 
>> bitmask with these magic constants without any discussion at all. 
>> 
>> An earlier iteration of this PR included a small essay in the javadoc of 
>> this method describing the layout and relationship of letters in latin1 and 
>> how we can apply that knowledge of the layout to implement the method.
>> 
>> How would you feel about adding that description back to the Javadocs? This 
>> would then live close to the similarly implemented toUpperCase and 
>> toLowerCase methods currently under review in #12623. 
>> 
>> Here's the updated discussion included in the Javadoc:
>> 
>> 
>>     /**
>>      * Compares two latin1 code points, ignoring case considerations.
>>      *
>>      * Implementation note: In ISO/IEC 8859-1, the uppercase and lowercase
>>      * letters are found in the following code point ranges:
>>      *
>>      * 0x41-0x5A: Uppercase ASCII letters: A-Z
>>      * 0x61-0x7A: Lowercase ASCII letters: a-z
>>      * 0xC0-0xD6: Uppercase latin1 letters: A-GRAVE - O with Diaeresis
>>      * 0xD8-0xDE: Uppercase latin1 letters: O with slash - Thorn
>>      * 0xE0-0xF6: Lowercase latin1 letters: a-grave - o with Diaeresis
>>      * 0xF8-0xFE: Lowercase latin1 letters: o with slash - thorn
>>      *
>>      * While both ASCII letter ranges are contiguous, the latin1 ranges are 
>> not:
>>      *
>>      * The 'multiplication sign' 0xD7 splits the uppercase range in two.
>>      * The 'division sign' 0xF7 splits the lowercase range in two.
>>      *
>>      * Lowercase letters are found 32 positions (0x20) after their 
>> corresponding uppercase letter.
>>      * The 'division sign' and 'multiplication sign' have the same relative 
>> distance.
>>      *
>>      * Since 0x20 is a single bit, we can apply the 'oldest ASCII trick in 
>> the book' to
>>      * lowercase any letter by setting the bit:
>>      *
>>      * ('C' | 0x20) == 'c'
>>      *
>>      * By removing the bit, we can perform the uppercase operation:
>>      *
>>      * ('c' & 0xDF) == 'C'
>>      *
>>      * Applying this knowledge of the latin1 layout, we can test for 
>> equality ignoring case by
>>      * checking that the code points are either equal, or that one of the 
>> code points is a letter
>>      * which uppercases is the same as the uppercase of the other code point.
>>      *
>>      * @param b1 byte representing a latin1 code point
>>      * @param b2 another byte representing a latin1 code point
>>      * @return true if the two bytes are considered equals ignoring case in 
>> latin1
>>      */
>>      static boolean equalsIgnoreCase(byte b1, byte b2) {
>>          if (b1 == b2) {
>>              return true;
>>          }
>>          int upper = b1 & 0xDF;
>>          if (upper < 'A') {
>>              return false;  // Low ASCII
>>          }
>>          return (upper <= 'Z' // In range A-Z
>>                  || (upper >= 0xC0 && upper <= 0XDE && upper != 0xD7)) // 
>> ..or A-grave-Thorn, excl. multiplication
>>                  && upper == (b2 & 0xDF); // b2 has same uppercase
>>     }
>
> Perhaps @Martin-Buchholz could chime in and also tell us which book he found 
> his ASCII trick in :)

"oldest trick in the book" is a phrase that does not necessarily imply 
existence of an actual book!

Let this evoke an image of a **personal** book of tricks that programmers in 
the 1960s might have recorded such techniques in.  And the tricks were passed 
down across generations of programmers!

-------------

PR: https://git.openjdk.org/jdk/pull/12632

Re: RFR: 8302871: Speed up StringLatin1.regionMatchesCI [v7]

Reply via email to