Hi, From: [EMAIL PROTECTED] (Craig Small) Subject: Re: search.debian.org is online Date: Mon, 30 Dec 2002 11:07:32 +1100
> > Note that, if this problem is fixed, Korean people will benefit very > > much even if the word-separation problem is not fixed. > I don't understand. Are you saying that Korean uses two-byte characters > but doesn't have spaces in words and should be ok now? The current version of Debian search site has two problems for east Asian languages: 1. handling of two-byte characters 2. extraction of words from sentences without whitespaces The problem of 1 affects Chinese, Japanese, and Korean. However, the problem of 2 affects Chinese and Japanese only, because modern Korean uses whitespaces between words. You said that the problem of 2 will be solved in the future version of mnogosearch (version 3.2.8) by using "chasen". It is a good news though we have to wait the release of the version. http://lists.debian.org/debian-www/2002/debian-www-200212/msg00268.html I think you are aware of this problem. However, I am not sure that you are aware of the problem of 1. I think the problem of 1 exists *besides* the problem 2. The reason of my idea is reported in the following mail. http://lists.debian.org/debian-www/2002/debian-www-200212/msg00267.html Since Korean is two byte language and uses whitespaces between words, solving the problem 1 will immediately benefits Koreans. The following is the detail of problem 1 reported in the above URL. If you already understand the problem, you don't need to read it. It is apparent that a word "news" which is translated into each language appears in http://www.debian.org/index.<language>.html . Now, since the word "news" appears as a section title, the word appears alone (i.e., isn't affected by the "word separation without whitespaces" problem) and should be able to be searched. However, the search fails for Chinese, Japanese, and Korean. This means that, even if a Japanese (Chinese, Korean) word appears with separated by whitespaces, the search fails. Thus, there exists another distinct problem than the problem 2. However, two-byte search doesn't always fail. For example, I reported in http://lists.debian.org/debian-www/2002/debian-www-200212/msg00256.html that I can search my name. I guess the condition when a search succeeds or fails depends on whether the Japanese word is written in normal EUC-JP encoding or in HTML "&#xxxx;" expression where xxxx is UTF-8 codepoint. When the word is written in "&#xxxx;" expression, the search succeeds while the word is written in normal EUC-JP encoding, the search fails. --- Tomohiro KUBOTA <[EMAIL PROTECTED]> http://www.debian.or.jp/~kubota/