Hi, From: [EMAIL PROTECTED] (Denis Barbier) Subject: Re: automatically-generated ISO-8859-1 characters in mulbibyte webpages Date: Thu, 2 Jan 2003 16:24:59 +0100
> I find only 18 names in people.names containing non-ASCII letters, > so /org/www.debian.org/cron/people_scripts/people.pl could contain > some extra elsif in its canonical_names function to replace > non-ASCII letters by HTML entities. Most names seem to be ISO-8859-1 > encoded. > When done, this script could also skip maintainers with non-ASCII > letters which are not processed in order to prevent future trouble. I think the simplest filter (assume ISO-8859-1) would be like following: s/\xa0/ /g; s/\xa1/¡/g; s/\xa2/¢/g; : : s/\xff/ÿ/g; However, as you said, it may cause future trouble. I think it is also a good idea to simply skip (remove) non-ASCII characters as you said, because it can be very simply implemented. After avoiding tag breaking by this solution, we have enough time to think about UTF-8 filter. BTW, I found similar trouble in lists.debian.org pages. In thread-list pages or date-list pages like http://lists.debian.org/debian-devel/2002/debian-devel-200212/threads.html, there are no charset specification. In such cases, web browsers will assume these pages according to user preference. Naturally, Japanese people configure web browsers to "assume Japanese encoding for pages without charset specification". On the other hand, the thread-list pages show senders' names in <em> format, and threfore, a tag </em> follows the name. If the last letter of the name is 8bit, the tag is broken. The result is that all following part are shown in <em> (italic) format. The test is easy: please configure your browser to "assume Japanese encoding for pages without charset specification" and load the above page. However, in this case, the solution is a bit complicated. All mails should have encoding information in MIME format. Thus, the best solution would be to parse MIME. On the other hand, the simplest makeshift solution is to add "charset=iso8859-1" for all pages but there are mailing lists where most of 8bit characters are cyrillic and so on. --- Tomohiro KUBOTA <[EMAIL PROTECTED]> http://www.debian.or.jp/~kubota/