I’m implementing a Unicode names library. I’m confused about loose 
character-name matching, even after rereading The Unicode Standard § 4.8, UAX 
#34 § 4, #44 § 5.9.2 – as well as 
[L2/13-142](http://www.unicode.org/L2/L2013/13142-name-match.txt 
<http://www.unicode.org/L2/L2013/13142-name-match.txt>), 
[L2/14-035](http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035 
<http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035>), and the 
[meeting in which those two items were 
resolved](https://www.unicode.org/L2/L2014/14026.htm 
<https://www.unicode.org/L2/L2014/14026.htm>).

In particular, I’m confused by the claim in The Unicode Standard § 4.8 saying, 
“Because Unicode character names do not contain any underscore (“_”) 
characters, a common strategy is to replace any hyphen-minus or space in a 
character name by a single “_” when constructing a formal identifier from a 
character name. This strategy automatically results in a syntactically correct 
identifier in most formal languages. Furthermore, such identifiers are 
guaranteed to be unique, because of the special rules for character name 
matching.”

I’m also confused by the relationship between UAX34-R3 and UAX44-LM2.

To make these issues concrete, let’s say that my library provides a function 
called getCharacter that takes a name argument, tries to find a loosely 
matching character, and then returns it (or a null value if there is no 
currently loosely matching character). So then what should the following 
expressions return?

getCharacter(“HANGUL-JUNGSEONG-O-E”)

getCharacter(“HANGUL_JUNGSEONG_O_E”)

getCharacter(“HANGUL_JUNGSEONG_O_E_”)

getCharacter(“HANGUL_JUNGSEONG_O__E”)

getCharacter(“HANGUL_JUNGSEONG_O_-E”)

getCharacter(“HANGUL JUNGSEONGCHARACTERO E”)

getCharacter(“HANGUL JUNGSEONG CHARACTER OE”)

getCharacter(“TIBETAN_LETTER_A”)

getCharacter(“TIBETAN_LETTER__A”)

getCharacter(“TIBETAN_LETTER _A”)

getCharacter(“TIBETAN_LETTER_-A”)

Thanks,
J. S. Choi

Reply via email to