Re: "Sitemap" webpage

2001-07-08 Thread peter karlsson
Tomohiro KUBOTA: > Imagine an ISO-2022-JP string has a JIS X 0208 part and following > ASCII part. When the JIS X 0208 part ends with 0x22, it matches "\e > and thus the regexp will fail. Yes, I am aware of that, but since regular expressions are not powerful enough to parse all possible combinat

Re: "Sitemap" webpage

2001-07-07 Thread Tomohiro KUBOTA
Hi, Thank you again for submitting your fix. Now the page is good. At Sat, 7 Jul 2001 09:29:02 +0200 (CEST), peter karlsson <[EMAIL PROTECTED]> wrote: > and those were not matched properly. However, I seem to have missed a > quotation mark missing in the regexp, it should read: > >$title =

Re: "Sitemap" webpage

2001-07-07 Thread peter karlsson
Tomohiro KUBOTA: > I found many items read only "Debian". I've put in a fix for this now. -- \\// peter - http://www.softwolves.pp.se/ Statement concerning unsolicited e-mail according to Swedish law: http://www.softwolves.pp.se/peter/reklampost.html

Re: "Sitemap" webpage

2001-07-07 Thread peter karlsson
Tomohiro KUBOTA: > $title =~ s/^#use .* title="(.+?)(" .*$|"$|\e.*$)/$1/; > > I think it should be modified as: > > $title =~ s/^#use .* title="(.+?)("\s.*$|"$)/$1/; That does not work (that was my first attempt), because there are some Japanese pages that have title="DBCS" and those wer

Re: "Sitemap" webpage

2001-07-06 Thread Tomohiro KUBOTA
Hi, At Fri, 6 Jul 2001 18:53:57 +0200 (CEST), peter karlsson <[EMAIL PROTECTED]> wrote: > I have committed a fix now. It seems to work on my local machine (I > can't read Japanese, but I can see that there is no mis-encoding left). Thanks. I checked. I found many items read only "Debian". The

Re: "Sitemap" webpage

2001-07-06 Thread peter karlsson
Tomohiro KUBOTA: > Could someone CVS committer please implement this to > webwml/english/sitemap.wml ? I have committed a fix now. It seems to work on my local machine (I can't read Japanese, but I can see that there is no mis-encoding left). -- \\// peter - http://www.softwolves.pp.se/ Stat

Re: "Sitemap" webpage

2001-07-06 Thread Tomohiro KUBOTA
Hi, At Fri, 6 Jul 2001 17:21:34 +0200, Josip Rodin <[EMAIL PROTECTED]> wrote: >> my $title = `egrep '^#use .* title=' $page `; chomp $title; >> $title =~ s/^#use .* title="([^"]+)".*$/$1/; > I suppose we could just change that regexp to match everything after the > opening double quote up to

Re: "Sitemap" webpage

2001-07-06 Thread Josip Rodin
On Fri, Jul 06, 2001 at 09:36:50PM +0900, Tomohiro KUBOTA wrote: > I checked webwml/english/sitemap.wml and found: > > my $title = `egrep '^#use .* title=' $page `; chomp $title; > $title =~ s/^#use .* title="([^"]+)".*$/$1/; > > This seems to be the code to extract title for sitemap items.

Re: "Sitemap" webpage

2001-07-06 Thread Tomohiro KUBOTA
Hi, At 06 Jul 2001 08:46:08 +0900, Olaf Meeuwissen <[EMAIL PROTECTED]> wrote: > But some sites screw up the charset :-( Claiming to use one encoding > and using another. Yes. We must not do such a poor mistake! :-) > Hmm, that sounds like it could be an inconsistency in the parser rules > (n

Re: "Sitemap" webpage

2001-07-05 Thread Olaf Meeuwissen
Tomohiro KUBOTA <[EMAIL PROTECTED]> writes: > Note that new web browsers which understand will NOT be confused by > any encodings. But some sites screw up the charset :-( Claiming to use one encoding and using another. > UTF-8 is not popular yet and some browsers may fail to display, > though

Re: "Sitemap" webpage

2001-07-05 Thread Tomohiro KUBOTA
Hi, At Thu, 5 Jul 2001 17:36:39 +0100, David Starner <[EMAIL PROTECTED]> wrote: > Doesn't ISO-2022-JP have a form that invokes JIS X 0208 into the upper half? > Could SJIS be used instead? No. Additional explanations about real state of Japanese encodings: There are three popular encodings fo

Re: "Sitemap" webpage

2001-07-05 Thread Olaf Meeuwissen
"David Starner" <[EMAIL PROTECTED]> writes: > Doesn't ISO-2022-JP have a form that invokes JIS X 0208 into the > upper half? No, but you may have been thrown off by the fact that EUC-JP is a proper ISO-2022 encoding. This is not the same as a ISO-2022-JP encoding. See Ken Lunde's CJKV, Chap. 4

Re: "Sitemap" webpage

2001-07-05 Thread Olaf Meeuwissen
Tomohiro KUBOTA <[EMAIL PROTECTED]> writes: > [encoding story zapped] > > When the corresponding Japanese wml page has a Japanese title > (in #use wml::debian::template title="" line) which includes > a Japanese character which include include 0x22 (DOUBLE QUOTE) > in its pair of bytes, a pro

Re: "Sitemap" webpage

2001-07-05 Thread peter karlsson
David Starner: > Doesn't ISO-2022-JP have a form that invokes JIS X 0208 into the upper half? You have EUC-JP, which encodes the JIS X 0208 at 0xA1-0xFE (it is the same encoding as ISO-2022-JP, but with the high bit set, and no escape sequences). > Could SJIS be used instead? Shift-JIS is a hor

Re: "Sitemap" webpage

2001-07-05 Thread David Starner
Writes Tomohiro KUBOTA <[EMAIL PROTECTED]>: > Does anyone have any idea to solve this problem? It seems to me you have two options: pick an encoding that doesn't have this problem, or change wml so it deals with ISO-2022-JP. Doesn't ISO-2022-JP have a form that invokes JIS X 0208 into the upper h

"Sitemap" webpage

2001-07-05 Thread Tomohiro KUBOTA
Hi, I found that some items of Japanese version of "Sitemap" page are broken. http://www.debian.org/sitemap.ja.html I researched this problem and found the reason. However, before explaining it, I will have to explain the encoding used for Japanese web pages. Japanese web pages (wml sources