I've discovered some errors in Java's case insensitive methods for its String class. Its equalsIgnoreCase() is the most obvious one that gets things wrong, but there are several others as well.
There is inarguably at least one significant bug, and quite plausibly several others as well. I've looked at the JDK7 source, and these remain buggy. I enclose a testing program to illustrate these bugs. It runs the tests against both JDK and the equivalent ICU function. The JDK gets many of them wrong, while ICU gets them all correct. --tom ==TECHNICAL DETAILS FOLLOW== The source of these bug is the many, many ASCII assumptions regarding casing, assumptions that do not hold for Unicode. There are also holdover bugs due to ignorance about Unicode outside the BMP that come from Unicode 1's 16-bitness, which no longer applies. The easiest bug to illustrate is that "𐐔𐐇𐐝𐐀𐐡𐐇𐐓".equalsIgnoreCase("𐐼𐐯𐑅𐐨𐑉𐐯𐐻") erroneously returns false when Unicode demands that it return to true. The bug here is that regionMatches() thinks that a String comprises a sequence of 16-bit Unicode characters. It does not. Those are char units, not Unicode characters, which are 21-bit quanties normally rounded up to 32 bits, not down to 16. The example strings I just used are in Deseret, which is a case-changing script in the SMP not in the BMP. Therefore, the char-based monkey work is broken. You can see how broken this is right here, from regionMatches: while (len-- > 0) { char c1 = ta[to++]; char c2 = pa[po++]; if (c1 == c2) { continue; } if (ignoreCase) { // If characters don't match but case may be ignored, // try converting both characters to uppercase. // If the results match, then the comparison scan should // continue. char u1 = Character.toUpperCase(c1); char u2 = Character.toUpperCase(c2); if (u1 == u2) { continue; } // Unfortunately, conversion to uppercase does not work properly // for the Georgian alphabet, which has strange rules about case // conversion. So we need to make one last check before // exiting. if (Character.toLowerCase(u1) == Character.toLowerCase(u2)) { continue; } } That is the source of this bug, and several others I shall describe. It needs to be reworked so that it actually works with all Unicode strings. You cannot store Unicode characters in Java char variables, and you must not call the Character casing methods since they don't work for most of Unicode's range. Next, if you are comparing strings, you should not be using simple case maps, and you must not assume that strings don't change length when casemapped or folded, because they quite obviously do. You can't let equalsIgnoreCase() short-circuit to failure because the strings have different lengths. That is an ASCII mindset for characters. It is contrary to a Unicode mindset for strings. Plus you are supposed to be using (full) case*folds* not casemaps, which are quite different from one another. Here is an example: Original: weiß WEIẞ Simple casemaps for chars: lowercase weiß weiß upppercase WEIß WEIẞ Full casemaps for strings: lowercase weiß weiß uppercase WEISS WEIẞ Casefolds: fold simple weiß weiß fold full weiss weiss The final line is the most important, which shows that they are the same because they have the same casefold. Q.E.D. To compare whether two strings are the same without respect to case, you must first calculate their respective Unicode casefolds (not casemaps!) and then compare those. You must not compare either the original strings or casemaps generated from those, as both of those can give wrong answers. You must use casefolds. But there is nothing in String or Character that gives you the casefold. There needs to be. Character should provide the simple casefold and String should provide the full casefold. (I don't know what to do about locales and turkish casefolds.) Demo program enclosed. I compare results from Java's String.equalIgnoreCase() with results from ICU's CaseInsensitiveString.equals(). Make sure your classpath has the (current) ICU library in it, and make sure to compile with "javac -encoding UTF-8". Hope this helps. --tom
import java.lang.System; import java.io.*; import com.ibm.icu.util.CaseInsensitiveString; public class weiss { private static BufferedReader stdin; private static PrintStream stdout, stderr; public static void eqtest(String s1, String s2) { CaseInsensitiveString si1 = new CaseInsensitiveString(s1); CaseInsensitiveString si2 = new CaseInsensitiveString(s2); stdout.printf("%s: Java %s equals %s\n", s1.equalsIgnoreCase(s2) ? "pass" : "FAIL", s1, s2); stdout.printf("%s: ICU %s equals %s\n\n", si1.equals(s2) ? "pass" : "FAIL", s1, s2); } public static void main(String argv[]) { try { stdin = new BufferedReader(new InputStreamReader(System.in, "UTF-8")); stdout = new PrintStream(System.out, true, "UTF-8"); stderr = new PrintStream(System.err, true, "UTF-8"); } catch (IOException hosed) { System.err.printf("%s: error setting std streams to UTF-8: %s.\n", hosed.getMessage()); System.exit(1); } eqtest("ẛ", "Ṡ"); // "\N{LATIN SMALL LETTER LONG S WITH DOT ABOVE}", "\N{LATIN CAPITAL LETTER S WITH DOT ABOVE}" eqtest("µ", "Μ"); // "\N{MICRO SIGN}", "\N{GREEK CAPITAL LETTER MU}" eqtest("µ", "μ"); // "\N{MICRO SIGN}", "\N{GREEK SMALL LETTER MU}" eqtest("ƦᴀƦᴇ", "ʀᴀʀᴇ"); // "\N{LATIN LETTER YR}\N{LATIN LETTER SMALL CAPITAL A}\N{LATIN LETTER YR}\N{LATIN LETTER SMALL CAPITAL E}", "\N{LATIN LETTER SMALL CAPITAL R}\N{LATIN LETTER SMALL CAPITAL A}\N{LATIN LETTER SMALL CAPITAL R}\N{LATIN LETTER SMALL CAPITAL E}" eqtest("efficient", "EFFICIENT"); // "e\N{LATIN SMALL LIGATURE FFI}cient", "EFFICIENT" eqtest("flour and water", "FLOUR AND WATER"); // "flour and water", "FLOUR AND WATER" eqtest("I WORK AT Ⓚ", "i work at ⓚ"); // "I WORK AT \N{CIRCLED LATIN CAPITAL LETTER K}", "i work at \N{CIRCLED LATIN SMALL LETTER K}" eqtest("HENRY Ⅷ", "henry ⅷ"); // "HENRY \N{ROMAN NUMERAL EIGHT}", "henry \N{SMALL ROMAN NUMERAL EIGHT}" // Classic German ligature issues eqtest("tschüß", "TSCHÜSS"); // "tsch\N{LATIN SMALL LETTER U WITH DIAERESIS}\N{LATIN SMALL LETTER SHARP S}", "TSCH\N{LATIN CAPITAL LETTER U WITH DIAERESIS}SS" // the capital version is from Unicode 5.1, which is very old now eqtest("weiß", "WEIẞ"); // "wei\N{LATIN SMALL LETTER SHARP S}", "WEI\N{LATIN CAPITAL LETTER SHARP S}" eqtest("weiß", "WEISS"); // "wei\N{LATIN SMALL LETTER SHARP S}", "WEISS" eqtest("weiss", "WEIẞ"); // "weiss", "WEI\N{LATIN CAPITAL LETTER SHARP S}" // English ligature issues eqtest("poſt", "post"); // "po\N{LATIN SMALL LETTER LONG S}t", "post" eqtest("poſt", "post"); // "po\N{LATIN SMALL LIGATURE LONG S T}", "post" // Deseret is only non-BMP case changing scdript eqtest("𐐔𐐇𐐝𐐀𐐡𐐇𐐓", "𐐼𐐯𐑅𐐨𐑉𐐯𐐻"); // "\N{DESERET CAPITAL LETTER DEE}\N{DESERET CAPITAL LETTER SHORT E}\N{DESERET CAPITAL LETTER ES}\N{DESERET CAPITAL LETTER LONG I}\N{DESERET CAPITAL LETTER ER}\N{DESERET CAPITAL LETTER SHORT E}\N{DESERET CAPITAL LETTER TEE}", "\N{DESERET SMALL LETTER DEE}\N{DESERET SMALL LETTER SHORT E}\N{DESERET SMALL LETTER ES}\N{DESERET SMALL LETTER LONG I}\N{DESERET SMALL LETTER ER}\N{DESERET SMALL LETTER SHORT E}\N{DESERET SMALL LETTER TEE}" // Greek simple casefolding tests eqtest("στιγμας", "στιγμασ"); // "\N{GREEK SMALL LETTER SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER MU}\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER FINAL SIGMA}", "\N{GREEK SMALL LETTER SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER MU}\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER SIGMA}" eqtest("στιγμας", "ΣΤΙΓΜΑΣ"); // "\N{GREEK SMALL LETTER SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER GAMMA}\N{GREEK SMALL LETTER MU}\N{GREEK SMALL LETTER ALPHA}\N{GREEK SMALL LETTER FINAL SIGMA}", "\N{GREEK CAPITAL LETTER SIGMA}\N{GREEK CAPITAL LETTER TAU}\N{GREEK CAPITAL LETTER IOTA}\N{GREEK CAPITAL LETTER GAMMA}\N{GREEK CAPITAL LETTER MU}\N{GREEK CAPITAL LETTER ALPHA}\N{GREEK CAPITAL LETTER SIGMA}" // Greek full casefolding tests eqtest("ᾲ", "ᾺΙ"); // "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}", "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{GREEK CAPITAL LETTER IOTA}" eqtest("ᾲ στο διάολο", "Ὰͅ Στο Διάολο"); // "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI} \N{GREEK SMALL LETTER SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER OMICRON} \N{GREEK SMALL LETTER DELTA}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER ALPHA WITH TONOS}\N{GREEK SMALL LETTER OMICRON}\N{GREEK SMALL LETTER LAMDA}\N{GREEK SMALL LETTER OMICRON}", "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK YPOGEGRAMMENI} \N{GREEK CAPITAL LETTER SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER OMICRON} \N{GREEK CAPITAL LETTER DELTA}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER ALPHA WITH TONOS}\N{GREEK SMALL LETTER OMICRON}\N{GREEK SMALL LETTER LAMDA}\N{GREEK SMALL LETTER OMICRON}" eqtest("ᾲ στο διάολο", "ᾺΙ ΣΤΟ ΔΙΆΟΛΟ"); // "\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI} \N{GREEK SMALL LETTER SIGMA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER OMICRON} \N{GREEK SMALL LETTER DELTA}\N{GREEK SMALL LETTER IOTA}\N{GREEK SMALL LETTER ALPHA WITH TONOS}\N{GREEK SMALL LETTER OMICRON}\N{GREEK SMALL LETTER LAMDA}\N{GREEK SMALL LETTER OMICRON}", "\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{GREEK CAPITAL LETTER IOTA} \N{GREEK CAPITAL LETTER SIGMA}\N{GREEK CAPITAL LETTER TAU}\N{GREEK CAPITAL LETTER OMICRON} \N{GREEK CAPITAL LETTER DELTA}\N{GREEK CAPITAL LETTER IOTA}\N{GREEK CAPITAL LETTER ALPHA WITH TONOS}\N{GREEK CAPITAL LETTER OMICRON}\N{GREEK CAPITAL LETTER LAMDA}\N{GREEK CAPITAL LETTER OMICRON}" // Unicode 6.0.0 case-changing code point eqtest("ԦԦ", "ԧԧ"); // "\N{CYRILLIC CAPITAL LETTER SHHA WITH DESCENDER}\N{CYRILLIC CAPITAL LETTER SHHA WITH DESCENDER}", "\N{CYRILLIC SMALL LETTER SHHA WITH DESCENDER}\N{CYRILLIC SMALL LETTER SHHA WITH DESCENDER}" } }