Hi, James This is supplementation of what Morimoto-san wrote. I'm Japanese too.
You must use readings for ordering. Japanese dictionary ordering is by pronunciation, not by alphabetical order. In English, 'food' is put at nearby 'foot'. In Japanese, 'quick'(kwik) is put at nearby 'cuisine'(kwizi:n). (Mmm, this is not good example...) And the most important thing is something else. Japanese people also can NOT get correct pronunciation only by string written in kanji characters. It may sounds funny, but it is true. (as same problem as James mentioned about 大きい and 大学) For example, '角田 純子'(woman name) has many readings. Last name(角田) can be read Kakuta/Kakuda/Sumita/Tsunoda (and maybe more!!). First name(純子) can be read Junko/Sumiko. And pairs of these Last/First name are ....... The standard (Morimoto-san mentioned) is very complicated. Many of Japanese people doesn't know about this and some dictionary publisher has another style. :-( If you can get and use reading string in hiragana (or katakana), its result is not so far off by unicode ordering. And if your OS has correct locale, use locale.strcoll() simplly with reading string. I tested on Windows XP(Japanese codepage 932 console) and FreeBSD 7(Japanese UTF-8 locale console). WinXP is OK and FreeBSD is NG on Japanese string. WinXP is NG and FreeBSD is OK on German string with umlaut. The result is attached last part of this mail. FYI, I made loose translation of method in wikipedia entry Morimoto- san mentioned. 1. convert kanji character or alphabetical word (loan word) to kana reading 2. make character replacement i. replace small characters by base characters (ex.「ぁ」to「あ」, 「ゃ」 to 「や」 「っ」 to 「つ」) ii. replace consonants(some plosives and fricatives) by base characters (ex. 「が」[ga] to 「か」[ka], 「ば」[ba]and「ぱ」[pa] to 「は」ha) 3. replace long sound「ー」 by pronunced character according to preposing character if preposing character is 「あ」「か」「さ」「た」「な」「は」「ま」「や」「ら」「わ」 then 「あ」 「い」「き」「し」「ち」「に」「ひ」「み」「り」「ゐ」 then 「い」 「う」「く」「す」「つ」「ぬ」「ふ」「む」「ゆ」「る」 then 「う」 「え」「け」「せ」「て」「ね」「へ」「め」「れ」「ゑ」 then 「え」 「お」「こ」「そ」「と」「の」「ほ」「も」「よ」「ろ」「を」 then 「お」 「ん」 then 「ん」 otherwise keep 「ー」 (ex. 「あーるぬーぼー」[a-runu-bo-](aka. art nouveau)->「ああるぬうぼお」) 4. replace repeat character「ゝ」 with same character as preposing character if preposing character exists and preposing character is not 「ー」 (long sound) 5. sort string in the followin order 「あ」「い」「う」「え」「お」「か」「き」「く」「け」「こ」「さ」「し」「す」「せ」「そ」「た」「ち」「つ」「て」「と」「な」「に」 「ぬ」「ね」「の」「は」「ひ」「ふ」「へ」「ほ」「ま」「み」「む」「め」「も」「や」「ゆ」「よ」「ら」「り」「る」「れ」「ろ」「わ」「ゐ」 「ゑ」「を」「ん」「ゝ」「ー」 6. if sort order value in step 5 is same priority, use rule below i. consonant order: unvoiced(ksth) > voiced(gzdb) > half voiced(p) (ex. 「は」>「ば」>「ぱ」 similar to diacritical mark: 'Kloster' and 'Klöster') ii. long sound > small character > repeat character > other character iii. hiragana > katakana locale test result is below On FreeBSD 7(Japanese utf-8 locale console) > python Python 2.5.4 (r254:67916, Oct 8 2009, 15:59:07) [GCC 4.2.1 20070719 [FreeBSD]] on freebsd7 Type "help", "copyright", "credits" or "license" for more information. >>> import locale >>> locale.setlocale(locale.LC_ALL, 'de_DE.ISO8859-1') 'de_DE.ISO8859-1' >>> locale.strcoll(u'Kloster', u'Klöster') -4 >>> locale.strcoll(u'Klosteranlage', u'Klöster') 97 (OK: Kloste ->Klöster->Klosteranlage) >>> locale.setlocale(locale.LC_ALL, 'ja_JP.UTF-8') 'ja_JP.UTF-8' >>> locale.strcoll(u'ウエット',u'ウェット') 1 >>> locale.strcoll(u'ウエット',u'ウェッド') 1 >>> locale.strcoll(u'ウエット',u'ウェッン') 1 >>> locale.strcoll(u'ウエット',u'ウェッタ') 1 >>> locale.strcoll(u'ウエット',u'うえっと') 96 (NG....) on WindowsXP(Japanese Shift_JIS locale commandline) Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import locale >>> locale.setlocale(locale.LC_ALL, 'German_Germany.1252') 'German_Germany.1252' >>> locale.strcoll(u'Kloster', u'Klöster') 1 >>> locale.strcoll(u'Klosteranlage', u'Klöster') 1 >>> locale.strcoll(u'Klosteranlage', u'Kloster') 1 >>> locale.strcoll(u'Klost', u'Kloster') -1 >>> locale.strcoll(u'Klost', u'Klöster') 1 (NG on ''Japanese(ShiftJIS) codepage'' Windows: Klöster->Klost- >Kloster->Klosteranlage ) >>> locale.setlocale(locale.LC_ALL, 'Japanese_Japan.932') 'Japanese_Japan.932' >>> locale.strcoll(u'ウエット',u'ウェット') 1 >>> locale.strcoll(u'ウエット',u'ウェッド') -1 >>> locale.strcoll(u'ウエッド',u'ウェット') 1 >>> locale.strcoll(u'ウエッド',u'ウェッド') 1 (OK: ウェット->ウエット->ウェッド->ウエッド) >>> locale.strcoll(u'ウエット',u'ウェッン') -1 >>> locale.strcoll(u'ウエット',u'ウェッタ') 1 (OK: ウェッタ->ウエット->ウェッン) >>> locale.strcoll(u'ウエット',u'うえっと') -1 >>> locale.strcoll(u'ウエッド',u'うえっと') 1 >>> locale.strcoll(u'ウエット',u'うえっど') -1 >>> locale.strcoll(u'うえっと',u'うえっど') -1 >>> locale.strcoll(u'ウエッド',u'うえっど') -1 (almost OK:ウエット->うえっと->ウエッド->うえっど hiragana/katakana in reverse from standard?) HTH Keishi Katoux On Jan 6, 10:39 pm, Tetsuya Morimoto <tetsuya.morim...@gmail.com> wrote: > Hi, James > > I'm Japanese. > > > Any other suggestions or ways to correctly sort Japanese Words? > > Though I'm not well versed in ordering Japanese words, > my friend told me the standards named "JIS X 4061:1996". > > That's the standards using for Japanese dictionary(not python) or > Japanese book index. > The ja.wikipedia has an overview of that if you can read > Japanese.http://ja.wikipedia.org/wiki/日本語文字列照合順番 > > Also, you can buy the specification of the standards, but it seems > Japanese > only.http://www.webstore.jsa.or.jp/webstore/Com/FlowControl.jsp?lang=en&bu... > > I found the implementation with Perl. Maybe, there is no Python > implementation.http://search.cpan.org/~sadahiro/Lingua-JA-Sort-JIS-0.05/JIS.pod > > I hope this helps. > > thanks, > Tetsuya > > On Thu, Jan 6, 2011 at 7:31 PM, James Hancock <jlhanc...@gmail.com> wrote: > > Wow, I have no idea what you just said... but, I think I agree. > > > If I am right, what you are saying is, because the same kanji has multiple > > readings and the words are sorted by reading and not the kanji itself, that > > dictionary is going to be impossibly difficult to make. > > > quick example, skip if you want to > > -------------------------------------------------> > > 大学 is read "daigaku" and means college > > 大き is read "Ōki" and means big. > > > The 大 is read "dai" in one and "O" in the other. Giving the character a > > priority would not make it sort the right way because the order desired is > > by the reading, and the reading is all about the context of the characters. > > > giving the character a value would make it be: > > > 大き (Oki) > > 大学 (Daigaku > > 最高 (Saiko - means the best, because Django is the best!) > > > When really, alphabetically it is: > > > 大き (Oki) > > 最高 (Saiko) > > 大学 (Daigaku) > > > <---------------------------------- > > End of Example: > > > Man, the Japanese where not thinking about programming when they made their > > language. > > > Any other suggestions or ways to correctly sort Japanese Words? > > > Cheers, > > James Hancock > > > On Thu, Jan 6, 2011 at 5:38 PM, Masklinn <maskl...@masklinn.net> wrote: > > >> On 2011-01-06, at 07:23 , Sam Walters wrote: > >> > Hi, > >> > Personally I would map the priority of every character in a dict and > >> > pass this to sorted > > >> Given Japanese is not an alphabetical language and mixes syllabic and > >> logographic scripts (the logographic system having a few thousand > >> graphemes), I doubt this kind of trivial ideas is going to work correctly > >> (it only works correctly for simple alphabetical engines, even diacritics > >> are going to cause an explosion in the number of cases) > > >> -- > >> You received this message because you are subscribed to the Google Groups > >> "Django users" group. > >> To post to this group, send email to django-us...@googlegroups.com. > >> To unsubscribe from this group, send email to > >> django-users+unsubscr...@googlegroups.com. > >> For more options, visit this group at > >>http://groups.google.com/group/django-users?hl=en. > > > -- > > You received this message because you are subscribed to the Google Groups > > "Django users" group. > > To post to this group, send email to django-us...@googlegroups.com. > > To unsubscribe from this group, send email to > > django-users+unsubscr...@googlegroups.com. > > For more options, visit this group at > >http://groups.google.com/group/django-users?hl=en. -- You received this message because you are subscribed to the Google Groups "Django users" group. To post to this group, send email to django-us...@googlegroups.com. To unsubscribe from this group, send email to django-users+unsubscr...@googlegroups.com. For more options, visit this group at http://groups.google.com/group/django-users?hl=en.