Bug#158015: tidy: reexplanation of numeric entity bug with -big5

Peter Moulder Tue, 13 Apr 2010 02:12:21 -0700

The two URLs that Dan gave are unfortunately no longer valid for this bug
report, in that the current versions use utf-8.


The wayback machine doesn't have a copy of either page from before the bug 
report was filed,
but the versions it does have (e.g.
http://web.archive.org/web/20021025151116/http://jidanni.org/lang/pinyin/19970607tai_ke.html),
are enough that I think I know what the bug is that Dan was trying to
explain; I'll give a smaller example here.

The problem that I see (in at least versions 20091223cvs-1 and 20080116cvs-2)
is that numeric entities get munged in big5 documents: e.g. given an input
document of 

  <html><head><title>IPA ng: &#331;</title><head><body>&#331;</body><html>

(with or without

  <meta http-equiv="Content-Type" content="text/html; charset=big5">

in the head)
when run through `tidy -big5' will wrongly get output with
each &#331; converted to &amp;#331;.

The correct behaviour would be to retain them in &#331; form (IPA `ng' symbol).

(I don't know whether or not the following is the source of the bug, but
 note that numeric entities refer to unicode code points, not big5 code points.
 The html4 spec is a little misleading on this point: the confusion arises from
 the phrase "the document character set" (in "Numeric character references
 specify the code position of a character in the document character set."),
 which on a casual reading might be taken to mean "the document encoding", but
 section 5.1 (and indeed the sentence immediately following that misleading
 sentence) clarify that they do refer to unicode (iso10646) code points.)

pjrm.



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Bug#158015: tidy: reexplanation of numeric entity bug with -big5

Reply via email to