I've discovered some errors in Java's case insensitive methods
for its String class. Its equalsIgnoreCase() is the most
obvious one that gets things wrong, but there are several
others as well.
There is inarguably at least one significant bug, and quite plausibly
several others as well. I've looked at the JDK7 source, and these remain
buggy. I enclose a testing program to illustrate these bugs. It
runs the tests against both JDK and the equivalent ICU function.
The JDK gets many of them wrong, while ICU gets them all correct.
--tom
==TECHNICAL DETAILS FOLLOW==
The source of these bug is the many, many ASCII assumptions regarding
casing, assumptions that do not hold for Unicode. There are also holdover
bugs due to ignorance about Unicode outside the BMP that come from Unicode
1's 16-bitness, which no longer applies.
The easiest bug to illustrate is that
"𐐔𐐇𐐝𐐀𐐡𐐇𐐓".equalsIgnoreCase("𐐼𐐯𐑅𐐨𐑉𐐯𐐻")
erroneously returns false when Unicode demands that it return to true.
The bug here is that regionMatches() thinks that a String comprises
a sequence of 16-bit Unicode characters. It does not. Those are char
units, not Unicode characters, which are 21-bit quanties normally rounded
up to 32 bits, not down to 16. The example strings I just used are
in Deseret, which is a case-changing script in the SMP not in the BMP.
Therefore, the char-based monkey work is broken. You can see how broken
this is right here, from regionMatches:
while (len-- > 0) {
char c1 = ta[to++];
char c2 = pa[po++];
if (c1 == c2) {
continue;
}
if (ignoreCase) {
// If characters don't match but case may be ignored,
// try converting both characters to uppercase.
// If the results match, then the comparison scan should
// continue.
char u1 = Character.toUpperCase(c1);
char u2 = Character.toUpperCase(c2);
if (u1 == u2) {
continue;
}
// Unfortunately, conversion to uppercase does not work properly
// for the Georgian alphabet, which has strange rules about case
// conversion. So we need to make one last check before
// exiting.
if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) {
continue;
}
}
That is the source of this bug, and several others I shall describe. It
needs to be reworked so that it actually works with all Unicode strings.
You cannot store Unicode characters in Java char variables, and you must
not call the Character casing methods since they don't work for most of
Unicode's range.
Next, if you are comparing strings, you should not be using simple case
maps, and you must not assume that strings don't change length when
casemapped or folded, because they quite obviously do. You can't let
equalsIgnoreCase() short-circuit to failure because the strings have
different lengths. That is an ASCII mindset for characters. It
is contrary to a Unicode mindset for strings.
Plus you are supposed to be using (full) case*folds* not casemaps, which
are quite different from one another. Here is an example:
Original: weiß WEIẞ
Simple casemaps for chars:
lowercase weiß weiß
upppercase WEIß WEIẞ
Full casemaps for strings:
lowercase weiß weiß
uppercase WEISS WEIẞ
Casefolds:
fold simple weiß weiß
fold full weiss weiss
The final line is the most important, which shows that they are the
same because they have the same casefold. Q.E.D.
To compare whether two strings are the same without respect to case, you
must first calculate their respective Unicode casefolds (not casemaps!)
and then compare those. You must not compare either the original strings
or casemaps generated from those, as both of those can give wrong answers.
You must use casefolds.
But there is nothing in String or Character that gives you the casefold.
There needs to be. Character should provide the simple casefold and
String should provide the full casefold. (I don't know what to do about
locales and turkish casefolds.)
Demo program enclosed. I compare results from Java's
String.equalIgnoreCase() with results from ICU's
CaseInsensitiveString.equals().
Make sure your classpath has the (current) ICU library in it,
and make sure to compile with "javac -encoding UTF-8".
Hope this helps.
--tom
import java.lang.System;
import java.io.*;
import com.ibm.icu.util.CaseInsensitiveString;
public class weiss {
private static BufferedReader stdin;
private static PrintStream stdout, stderr;
public static void eqtest(String s1, String s2) {
CaseInsensitiveString si1 = new CaseInsensitiveString(s1);
CaseInsensitiveString si2 = new CaseInsensitiveString(s2);
stdout.printf("%s: Java %s equals %s\n",
s1.equalsIgnoreCase(s2) ? "pass" : "FAIL", s1, s2);
stdout.printf("%s: ICU %s equals %s\n\n",
si1.equals(s2) ? "pass" : "FAIL", s1, s2);
}
public static void main(String argv[]) {
try {
stdin = new BufferedReader(new InputStreamReader(System.in,
"UTF-8"));
stdout = new PrintStream(System.out, true, "UTF-8");
stderr = new PrintStream(System.err, true, "UTF-8");
} catch (IOException hosed) {
System.err.printf("%s: error setting std streams to UTF-8: %s.\n",
hosed.getMessage());
System.exit(1);
}
eqtest("ẛ", "Ṡ"); // "\N{LATIN SMALL LETTER LONG S WITH DOT ABOVE}",
"\N{LATIN CAPITAL LETTER S WITH DOT ABOVE}"
eqtest("µ", "Μ"); // "\N{MICRO SIGN}", "\N{GREEK CAPITAL LETTER MU}"
eqtest("µ", "μ"); // "\N{MICRO SIGN}", "\N{GREEK SMALL LETTER MU}"
eqtest("ƦᴀƦᴇ", "ʀᴀʀᴇ"); // "\N{LATIN LETTER YR}\N{LATIN LETTER SMALL
CAPITAL A}\N{LATIN LETTER YR}\N{LATIN LETTER SMALL CAPITAL E}", "\N{LATIN
LETTER SMALL CAPITAL R}\N{LATIN LETTER SMALL CAPITAL A}\N{LATIN LETTER SMALL
CAPITAL R}\N{LATIN LETTER SMALL CAPITAL E}"
eqtest("efficient", "EFFICIENT"); // "e\N{LATIN SMALL LIGATURE
FFI}cient", "EFFICIENT"
eqtest("flour and water", "FLOUR AND WATER"); // "flour and water",
"FLOUR AND WATER"
eqtest("I WORK AT Ⓚ", "i work at ⓚ"); // "I WORK AT \N{CIRCLED LATIN
CAPITAL LETTER K}", "i work at \N{CIRCLED LATIN SMALL LETTER K}"
eqtest("HENRY Ⅷ", "henry ⅷ"); // "HENRY \N{ROMAN NUMERAL
EIGHT}", "henry \N{SMALL ROMAN NUMERAL EIGHT}"
// Classic German ligature issues
eqtest("tschüß", "TSCHÜSS"); // "tsch\N{LATIN SMALL LETTER U WITH
DIAERESIS}\N{LATIN SMALL LETTER SHARP S}", "TSCH\N{LATIN CAPITAL LETTER U WITH
DIAERESIS}SS"
// the capital version is from Unicode 5.1, which is very old now
eqtest("weiß", "WEIẞ"); // "wei\N{LATIN SMALL LETTER SHARP S}",
"WEI\N{LATIN CAPITAL LETTER SHARP S}"
eqtest("weiß", "WEISS"); // "wei\N{LATIN SMALL LETTER SHARP
S}", "WEISS"
eqtest("weiss", "WEIẞ"); // "weiss", "WEI\N{LATIN CAPITAL
LETTER SHARP S}"
// English ligature issues
eqtest("poſt", "post"); // "po\N{LATIN SMALL LETTER LONG S}t",
"post"
eqtest("poſt", "post"); // "po\N{LATIN SMALL LIGATURE LONG S
T}", "post"
// Deseret is only non-BMP case changing scdript
eqtest("𐐔𐐇𐐝𐐀𐐡𐐇𐐓", "𐐼𐐯𐑅𐐨𐑉𐐯𐐻"); // "\N{DESERET CAPITAL LETTER
DEE}\N{DESERET CAPITAL LETTER SHORT E}\N{DESERET CAPITAL LETTER ES}\N{DESERET
CAPITAL LETTER LONG I}\N{DESERET CAPITAL LETTER ER}\N{DESERET CAPITAL LETTER
SHORT E}\N{DESERET CAPITAL LETTER TEE}", "\N{DESERET SMALL LETTER
DEE}\N{DESERET SMALL LETTER SHORT E}\N{DESERET SMALL LETTER ES}\N{DESERET SMALL
LETTER LONG I}\N{DESERET SMALL LETTER ER}\N{DESERET SMALL LETTER SHORT
E}\N{DESERET SMALL LETTER TEE}"
// Greek simple casefolding tests
eqtest("στιγμας", "στιγμασ"); // "\N{GREEK SMALL LETTER
SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL
LETTER GAMMA}\N{GREEK SMALL LETTER MU}\N{GREEK SMALL LETTER ALPHA}\N{GREEK
SMALL LETTER FINAL SIGMA}", "\N{GREEK SMALL LETTER SIGMA}\N{GREEK SMALL LETTER
TAU}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL
LETTER MU}\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER SIGMA}"
eqtest("στιγμας", "ΣΤΙΓΜΑΣ"); // "\N{GREEK SMALL LETTER
SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL
LETTER GAMMA}\N{GREEK SMALL LETTER MU}\N{GREEK SMALL LETTER ALPHA}\N{GREEK
SMALL LETTER FINAL SIGMA}", "\N{GREEK CAPITAL LETTER SIGMA}\N{GREEK CAPITAL
LETTER TAU}\N{GREEK CAPITAL LETTER IOTA}\N{GREEK CAPITAL LETTER GAMMA}\N{GREEK
CAPITAL LETTER MU}\N{GREEK CAPITAL LETTER ALPHA}\N{GREEK CAPITAL LETTER SIGMA}"
// Greek full casefolding tests
eqtest("ᾲ", "ᾺΙ"); // "\N{GREEK SMALL LETTER ALPHA WITH
VARIA AND YPOGEGRAMMENI}", "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{GREEK
CAPITAL LETTER IOTA}"
eqtest("ᾲ στο διάολο", "Ὰͅ Στο Διάολο"); // "\N{GREEK SMALL LETTER
ALPHA WITH VARIA AND YPOGEGRAMMENI} \N{GREEK SMALL LETTER SIGMA}\N{GREEK SMALL
LETTER TAU}\N{GREEK SMALL LETTER OMICRON} \N{GREEK SMALL LETTER DELTA}\N{GREEK
SMALL LETTER IOTA}\N{GREEK SMALL LETTER ALPHA WITH TONOS}\N{GREEK SMALL LETTER
OMICRON}\N{GREEK SMALL LETTER LAMDA}\N{GREEK SMALL LETTER OMICRON}", "\N{GREEK
CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK YPOGEGRAMMENI} \N{GREEK
CAPITAL LETTER SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER OMICRON}
\N{GREEK CAPITAL LETTER DELTA}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER
ALPHA WITH TONOS}\N{GREEK SMALL LETTER OMICRON}\N{GREEK SMALL LETTER
LAMDA}\N{GREEK SMALL LETTER OMICRON}"
eqtest("ᾲ στο διάολο", "ᾺΙ ΣΤΟ ΔΙΆΟΛΟ"); // "\N{GREEK SMALL LETTER
ALPHA WITH VARIA AND YPOGEGRAMMENI} \N{GREEK SMALL LETTER SIGMA}\N{GREEK SMALL
LETTER TAU}\N{GREEK SMALL LETTER OMICRON} \N{GREEK SMALL LETTER DELTA}\N{GREEK
SMALL LETTER IOTA}\N{GREEK SMALL LETTER ALPHA WITH TONOS}\N{GREEK SMALL LETTER
OMICRON}\N{GREEK SMALL LETTER LAMDA}\N{GREEK SMALL LETTER OMICRON}", "\N{GREEK
CAPITAL LETTER ALPHA WITH VARIA}\N{GREEK CAPITAL LETTER IOTA} \N{GREEK CAPITAL
LETTER SIGMA}\N{GREEK CAPITAL LETTER TAU}\N{GREEK CAPITAL LETTER OMICRON}
\N{GREEK CAPITAL LETTER DELTA}\N{GREEK CAPITAL LETTER IOTA}\N{GREEK CAPITAL
LETTER ALPHA WITH TONOS}\N{GREEK CAPITAL LETTER OMICRON}\N{GREEK CAPITAL LETTER
LAMDA}\N{GREEK CAPITAL LETTER OMICRON}"
// Unicode 6.0.0 case-changing code point
eqtest("ԦԦ", "ԧԧ"); // "\N{CYRILLIC CAPITAL LETTER SHHA WITH
DESCENDER}\N{CYRILLIC CAPITAL LETTER SHHA WITH DESCENDER}", "\N{CYRILLIC SMALL
LETTER SHHA WITH DESCENDER}\N{CYRILLIC SMALL LETTER SHHA WITH DESCENDER}"
}
}