On Thu, Dec 13, 2012 at 9:17 AM, Stephen Bloch <sbl...@adelphi.edu> wrote: > > On Dec 12, 2012, at 11:42 PM, Haiwei Zhou wrote: > >> In HTML the <img> tag has no end tag. > > Not exactly true. In XHTML (i.e. HTML >= 4.0, IIRC), it SHOULD have an end > tag -- as should every other tag in XHTML. You can do this either with > <img src="blah.blah"></img> > or, briefer, with > <img src="blah.blah"/> > > However, most or all browsers accept Web pages with certain common tags > unterminated: <img>, <p>, <br>, <li>, etc. and there's a reasonable argument > that Racket's XML library should be capable of accepting them too.
This is not really an accurate characterization of HTML, XHTML, and tags. First, the standard that modern browsers follow is called HTML5 or just HTML [1,2] and is not an XML dialect. In HTML, some tags do *not* have a close tag (such as <img>). Further, the parser for HTML is required to handle ill-formed HTML (such as unclosed <p> tags) in a specified way. However, these two situations are distinct, and the latter is an error recovery mechanism for invalid HTML. You can see the distinction between them in a validator [3]. XHTML is a syntax for writing HTML which has somewhat different rules for things like close tags, and is only used by browsers when content is served with a particular MIME type. There's a discussion of XHTML and HTML here [4]. Because of these issues, building an HTML parser that works with in-the-wild HTML documents is complicated endeavor, that isn't really helped by having an XML parser. That's why Jay suggested using a dedicated HTML parser for such documents (there are separate issues with *generating* HTML using the `xml` library, but those aren't really relevant here). Sam [1] http://dev.w3.org/html5/spec/Overview.html [2] www.whatwg.org/C [3] http://validator.nu/ [4] http://www.whatwg.org/specs/web-apps/current-work/multipage/introduction.html#html-vs-xhtml ____________________ Racket Users list: http://lists.racket-lang.org/users