Hi, From: [EMAIL PROTECTED] (Craig Small) Subject: Re: Status of new search engine Date: Tue, 17 Dec 2002 22:16:51 +1100
> > brokenly. I found the search page http://search.debian.org/new/search.cgi > > have the following line: > > <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> > I've fixed that now. I'm not sure how to permanently do this as it > comes from the generic webwml templates. Ok, I confirmed it. Then, could you please modify the program to assemble the result page to convert all results into UTF-8? I imagine, if we are lucky, we can use the output of search engine because you said the core part of the search engine is working on UTF-8, which means that the web pages are *already* converted into UTF-8 in order to be searched. Otherwise, if you are using Perl, you can use libtext-iconv-perl package. The default encoding for each language (i.e., the "from" encoding for conversion) is available in webwml/<language>/.wmlrc file. I think you can hold the pairs of language and encoding as constants or hard-coded, because it is rare that a new language is added to Debian web pages. > Now, if I pick it up from the search page, I get > http://search.debian.org/new/search.en.cgi?q=%E4%B9%85%E4%BF%9D%E7%94%B0+%E6%99%BA%E5%BA%83 > and results look sensible. This works fine. Ah, now I can input my name in the webform and the search goes well. I think that, since the webpage is now UTF-8, Internet Explorer submits the query in UTF-8. > I then searched ???????? which is something to do with security > and got > http://search.debian.org/new/search.en.cgi?q=%E3%82%BB%E3%82%AD%E3%83%A5%E3%83%AA%E3%83%86%E3%82%A3%E6%83%85%E5%A0%B1&ps=10&o=0&m=and&lang= > with no results I imagine this is a fault of the search engine. Since Japanese sentence doesn't separate words with whitespaces, the search engine cannot extract words from Japanese sentences. I confirmed this point by searching %E4%B9%85%E4%BF%9D , which is a first two characters from my name %E4%B9%85%E4%BF%9D%E7%94%B0 with three characters. It gave result of zero. > and > http://www.google.com/search?hl=en&ie=UTF-8&oe=UTF-8&q=%E3%82%BB%E3%82%AD%E3%83%A5%E3%83%AA%E3%83%86%E3%82%A3%E6%83%85%E5%A0%B1&btnG=Google+Search > with lots of results. Google seems to have a better sentence analyzer. I heard that "namazu" can be used for such purpose, i.e., constructing a whole-text search engine for Japanese. It is a free software and available as a Debian package. Namazu is very popular not only among Japanese free software community but also among commercial usages. For example, Debian JP Project uses Namazu for whole-text search of mailing list archive. http://www.debian.or.jp/search/ However, please don't ask me about Namazu because I have never used it. --- Tomohiro KUBOTA <[EMAIL PROTECTED]> http://www.debian.or.jp/~kubota/