Re: [XeTeX] Polyglossia: Support for romanization of CJK

Mike "Pomax" Kamermans Wed, 15 Jun 2011 16:20:00 -0700

On 6/15/2011 11:44 AM, Gerrit wrote:

Hello again, everyone,
I am currently writing an article, in which I also have someromanization of Japanese. Until now, I have to define the hyphenationmanually, which I think is a little bit of a nuisance.
[snip]

What do you think about that?

Since phonetic guide texts for CJKV are tied to characters, I wouldconsider the most logical split one where the guide text is dictated bythe character boundaries, and the language used. Hyphenation for guidetext would be strongly tied to the original text splits, aspronunciation guide text does not significantly run past the characterboundary (more creative uses of top text such as the common Japanesepractice of treating it as a 'thinking space', using the real text toexpress what is said and the guide text what is thought wouldn't beconvered by this of course. Nor should they, probably).

To my knowledge, this is already automatically the case for (Mandarin)Chinese, as every character only has a single syllable pronunciation, sohyphenation is unlikely to even matter; whether it's romanised orbopomofo, the guide text won't run past the character.

For Japanese this is also true for the most part, with a very smallnumber of special words that consist of multiple characters that onlyhave a single syllable pronunciation (like 所為, romanised as "sei",which cannot be decomposed as [se]-[i]. In Japanese the furigana forthis is never split up over multiple lines either). Aside from thesewords, there are some "ateji" readings for words, where some originallycharacter-less word has been assigned a set of characters that do notnormally "spell" that word. For these, you would also need specialhyphenation rules. However, the vast majority of Japanese words followthe rules of compositional reading, so 天国(tengoku) would split up as天(ten-)//国(-goku) and 腹切り(harakiri) would split up as 腹(hara-)//切り(-kiri), with optional guide text over the syllable り(ri) dependingon the target audience.

I do not know about character guide texts in other Asian languages thatborrowed Chinese characters.

The main challenge would be to build the "which character maps to whichreading in which word" dataset, which will be quite vast. For westernlanguages grammars can be constructed that fairly accurately describewhen a word would be allowed to split, based on its written form. ForCJK languages that approach goes straight out the window, because youcan split anywhere in a sentence. This means that there is no concept of"hyphenation", and it will only apply to western guide text, which forchinese character words requires knowing the pronunciation of thesewords (or taking a really good guess and allowing the author to overrideguesses). Particularly for Chinese and Japanese this leads to hugedatasets; the first because even though most characters are completewords, and typically only have one pronunciation, there are easily tenthousand characters in daily use (although of course not all asfrequent), the second because even though there are fewer characters tocontend with in Japanese, some 3500, the actual pronunciations depend onthe words characters are used in, and unlike Chinese most Japanese wordsare actually compound character words, still leaving you with over tenthousands distinct combinations for which you can't really abstractpronunciation rules because most characters in Japanese have three orfour readings (at least). To get automate hyphenation right, you firstneed to tackle automatic guessing of pronunciation (even lexicalanalysers for Japanese like MeCab, ChaSen or YamCha can't get aroundthis) and you'll end up with quite a few MB of data just to hyphenateguide text, and then only when it's western guide text.

That's not to discourage anyone from taking a stab at it, it's justquite a mountain of work.


- Mike "Pomax" Kamermans
nihongoresources.com


--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
  http://tug.org/mailman/listinfo/xetex

Re: [XeTeX] Polyglossia: Support for romanization of CJK

Reply via email to