In preparation for a short talk I'm to give in Portland next week about how different operating systems, filesystems, and languages (including but not limited to regexes) handle Unicode, I got to thinking about normalization issues. And I think I've found a Java compiler bug, or at best, an infelicity in a grey area.
It's no consolation, but Perl has exactly the same problem (well, pair of problems) as Java has here. We do the same thing as Java, which I think is the Wrong Thing, and we are also at the mercy of our filesystem for mapping of classnames to filesystem objects, which is even worse. I would like someone to tell me why Java shouldn't be fixed to cope with these matters, both as internal identifiers and as those that exist outside Java proper, in the filesystem (classnames). After reading about the differences between how Apple and Sun did normalization in the filesystem: http://developers.sun.com/global/products_platforms/solaris/reference/presentations/IUC29-FileSystems.pdf I wondered what impact this might/must have on Java. After all, classnames must map to filesystem entries, and therefore if the system is doing any kind of normalization, you're going to have Issues. Apple runs everything through NFD (well, nearly), whereas the Sun paper cited from about five years ago says that they plan to do something analogous to how "case-preserving but case-insensitive" filesystems behave: that is, they'll let you create anything you want, but won't let you create a new entry in the same directory if they are canonically equivalent. Before I went so far as to test this on Apple and Sun machines, let alone others, I thought I would just try my test on local variables instead. I have now tested it on Sun, Apple, and Linux, including various versions of the compiler, and they all report the same thing. And the thing they report, I feel, is wrong, because I know that it will not work this way for class names the way it will for local versions. I will include the source code twice, once as plain text so you can read it, and once as an octet stream lest a "helpful" mailer decides it should be normalizing things that pass through it, an evil that the Apple mouse will do to you believe it or not. If you put this wicked file in a file called "nftest.java" and run this command: $ javac -encoding UTF-8 nftest.java && java nftest Then you will get this output: élève is 1. élève is 2. élève is 3. élève is 4. Those probably look the same. Running them through `uniquote -x` shows though that they are not: \x{E9}l\x{E8}ve is 1. e\x{301}le\x{300}ve is 2. \x{E9}le\x{300}ve is 3. e\x{301}l\x{E8}ve is 4. See the difference? Those are variable names, and I do not think Java should permit duplicate variable names that differ only in normalization, since it obviously cannot be permitted to do so for classnames, and it feels hackish to have different identifier rules for classnames as for other variables. Is this is a bug? If so, are there plans to address it? And what about the filesystem? I am unaware of any document in The Unicode Standard that references either or both of these issues; if any such exist, kindly point me at them. My hunch is that these two problems, even though they are completely consequential to Unicode, exist beyond the proper purview of The Unicode Standard itself. But that doesn't absolve us from solving them. Has this been previously discussed, and if so, what if any decision was made regarding these two interrelated problems? Thank you very much. --tom PS: The MIME contents of this message are as follows: msg part type/subtype size description 1 multipart/mixed 8904 1 text/plain 4071 a letter from tchrist 2 application/octet-stream 1560 the nftest(-v1).java program as octets name="nftest-v1.java" filename="nftest-v1.java" 3 text/plain 1560 the nftest(-v2).java program as plain text
nftest-v1.java
Description: the nftest(-v1).java program as octets
/* * nftest.java * Tom Christiansen <tchr...@perl.com> * Tue Jul 19 08:13:29 MDT 2011 * * This tests whether Java normalizes its variable names. * We will use four different canonically equivalent strings, * as see if we can get four different answers, or a compilation * error. * * N String As a literal Graphemes Chars Norm? * ============================================================= * 1 élève "\x{E9}l\x{E8}ve" 5 5 NFC * 2 élève "e\x{301}le\x{300}ve" 5 7 NFD * 3 élève "\x{E9}le\x{300}ve" 5 6 mixed * 4 élève "e\x{301}l\x{E8}ve" 5 6 mixed */ import java.io.*; public class nftest { static PrintStream stdout; public static void main(String argv[]) throws IOException { int élève = 1; // "\x{E9}l\x{E8}ve" NFC int élève = 2; // "e\x{301}le\x{300}ve" NFD int élève = 3; // "\x{E9}le\x{300}ve" mixed int élève = 4; // "e\x{301}l\x{E8}ve" mixed stdout = new PrintStream(System.out, true, "UTF-8"); stdout.printf("%s is %d.\n", "élève", élève); // "\x{E9}l\x{E8}ve" NFC stdout.printf("%s is %d.\n", "élève", élève); // "e\x{301}le\x{300}ve" NFD stdout.printf("%s is %d.\n", "élève", élève); // "\x{E9}le\x{300}ve" mixed stdout.printf("%s is %d.\n", "élève", élève); // "e\x{301}l\x{E8}ve" mixed } }