On Tue, 21 Feb 2023 20:23:11 GMT, Eirik Bjorsnos <d...@openjdk.org> wrote:
>> src/java.base/share/classes/java/lang/CharacterDataLatin1.java.template line >> 175: >> >>> 173: } >>> 174: // uppercase b1 using 'the oldest ASCII trick in the book' >>> 175: int U = b1 & 0xDF; >> >> I'm sure some people reading this comment will wonder which book :-) It >> might be better to drop that bit and if possible, find a better name for "U" >> as normally variables start with a lower case. > > Hi Alan, > > I thought I was clever by encoding the 'uppercaseness' in the variable name, > but yeah I'll find a better name :) > > There is some precedent for using the 'ASCII trick' comment in the JDK. I > found it in ZipFile.isMetaName, which is also where I first learned about > this interesting relationship between ASCII (and also latin1) letters. > > The comment was first added by Martin Buchholz back in 2016 as part of > JDK-8157069, 'Assorted ZipFile improvements'. In 2020, Claes was updating > this code and Lance has some input about clarifying the comment. Martin then > [chimed > in](https://mail.openjdk.org/pipermail/core-libs-dev/2020-May/066363.html) to > defend his comment: > >> I still like my ancient "ASCII trick" comment. > > I think this 'trick', whatever we call it, is sufficiently intricate that it > deserves to be called out somehow and that we should not just casually > bitmask with these magic constants without any discussion at all. > > An earlier iteration of this PR included a small essay in the javadoc of this > method describing the layout and relationship of letters in latin1 and how we > can apply that knowledge of the layout to implement the method. > > How would you feel about adding that description back to the Javadocs? This > would then live close to the similarly implemented toUpperCase and > toLowerCase methods currently under review in #12623. > > Here's the updated discussion included in the Javadoc: > > > /** > * Compares two latin1 code points, ignoring case considerations. > * > * Implementation note: In ISO/IEC 8859-1, the uppercase and lowercase > * letters are found in the following code point ranges: > * > * 0x41-0x5A: Uppercase ASCII letters: A-Z > * 0x61-0x7A: Lowercase ASCII letters: a-z > * 0xC0-0xD6: Uppercase latin1 letters: A-GRAVE - O with Diaeresis > * 0xD8-0xDE: Uppercase latin1 letters: O with slash - Thorn > * 0xE0-0xF6: Lowercase latin1 letters: a-grave - o with Diaeresis > * 0xF8-0xFE: Lowercase latin1 letters: o with slash - thorn > * > * While both ASCII letter ranges are contiguous, the latin1 ranges are > not: > * > * The 'multiplication sign' 0xD7 splits the uppercase range in two. > * The 'division sign' 0xF7 splits the lowercase range in two. > * > * Lowercase letters are found 32 positions (0x20) after their > corresponding uppercase letter. > * The 'division sign' and 'multiplication sign' have the same relative > distance. > * > * Since 0x20 is a single bit, we can apply the 'oldest ASCII trick in > the book' to > * lowercase any letter by setting the bit: > * > * ('C' | 0x20) == 'c' > * > * By removing the bit, we can perform the uppercase operation: > * > * ('c' & 0xDF) == 'C' > * > * Applying this knowledge of the latin1 layout, we can test for equality > ignoring case by > * checking that the code points are either equal, or that one of the > code points is a letter > * which uppercases is the same as the uppercase of the other code point. > * > * @param b1 byte representing a latin1 code point > * @param b2 another byte representing a latin1 code point > * @return true if the two bytes are considered equals ignoring case in > latin1 > */ > static boolean equalsIgnoreCase(byte b1, byte b2) { > if (b1 == b2) { > return true; > } > int upper = b1 & 0xDF; > if (upper < 'A') { > return false; // Low ASCII > } > return (upper <= 'Z' // In range A-Z > || (upper >= 0xC0 && upper <= 0XDE && upper != 0xD7)) // > ..or A-grave-Thorn, excl. multiplication > && upper == (b2 & 0xDF); // b2 has same uppercase > } Perhaps @Martin-Buchholz could chime in and also tell us which book he found his ASCII trick in :) ------------- PR: https://git.openjdk.org/jdk/pull/12632