cleaned up the previous post. • 〈HTML6, Your HTML/XML Simplified〉 http://xahlee.org/comp/html6.html
plain text version follows -------------------------------------------------- HTML6, Your HTML/XML Simplified Xah Lee, 2010-09-21 Tired of the standard bodies telling us what to do and change their altitude? Tired of the SGML/HTML/XML/XHTML/HTML5 changes? Tire no more, here's a new proposal that will make life easier. Introducing HTML6 HTML6 is based on HTML5, XML, and a rectified LISP syntax. More specifically, it is derived from existing work on this, the SXML. http://okmij.org/ftp/Scheme/SXML.html, except that there is complete regularity at syntax level, and is not considered or compatible with lisp readers. The syntax can be specified by 3 short lines of parsing expression grammar. The aim is far more simpler syntax, 100% regularity, and leaner. but with a far simpler, and more strict, format. First of all, no error is accepted, ever. If a source code has incorrect syntax, that page is not displayed. Example Here's a standard ATOM webfeed XML file. <?xml version="1.0" encoding="utf-8"?> <feed xmlns="http://www.w3.org/2005/Atom" xml:base="http://xahlee.org/ emacs/"> <title>Xah's Emacs Blog</title> <subtitle>Emacs, Emacs, Emacs</subtitle> <link rel="self" href="http://xahlee.org/emacs/blog.xml"/> <link rel="alternate" href="http://xahlee.org/emacs/blog.html"/> <updated>2010-09-19T14:53:08-07:00</updated> <author> <name>Xah Lee</name> <uri>http://xahlee.org/</uri> </author> <id>http://xahlee.org/emacs/blog.html</id> <icon>http://xahlee.org/ics/sum.png</icon> <rights>© 2009, 2010 Xah Lee</rights> <entry> <title>Using Emacs's Abbrev Mode for Abbreviation</title> <id>tag:xahlee.org,2010-09-19:215308</id> <updated>2010-09-19T14:53:08-07:00</updated> <summary>tutorial</summary> <link rel="alternate" href="http://xahlee.org/emacs/ emacs_abbrev_mode.html"/> </entry> </feed> Here's how it looks like in html6: 〔?xml 「version “1.0” encoding “utf-8”」〕 〔feed 「xmlns “http://www.w3.org/2005/Atom” xml:base “http://xahlee.org/ emacs/”」 〔title Xah's Emacs Blog〕 〔subtitle Emacs, Emacs, Emacs〕 〔link 「rel “self” href “http://xahlee.org/emacs/blog.xml”」〕 〔link 「rel “alternate” href “http://xahlee.org/emacs/blog.html”」〕 〔updated 2010-09-19T14:53:08-07:00〕 〔author 〔name Xah Lee〕 〔uri http://xahlee.org/〕 〕 〔id http://xahlee.org/emacs/blog.html〕 〔icon http://xahlee.org/ics/sum.png〕 〔rights © 2009, 2010 Xah Lee〕 〔entry 〔title Using Emacs's Abbrev Mode for Abbreviation〕 〔id tag:xahlee.org,2010-09-19:215308〕 〔updated 2010-09-19T14:53:08-07:00〕 〔summary tutorial〕 〔link 「rel “alternate” href “http://xahlee.org/emacs/ emacs_abbrev_mode.html”」〕 〕 〕 Simple Matching Pairs For Tag Delimiters The standard xml markup bracket is simplified using simple lisp style matching pairs. For example, this code: <h1>HTML6</h1> Is written as: 〔h1 HTML6〕 The delimiter used is: Character Unicode Code Point Unicode Name 〔 U+3014 LEFT TORTOISE SHELL BRACKET 〕 U+3015 RIGHT TORTOISE SHELL BRACKET XML Properties and Attributes Syntax In xml: <h1 id="xyz" class="abc">HTML6</h1> In html6: 〔h1「id “xyz” class “abc”」HTML6〕 The attributes are specified by matching corner brackets. Items inside are a sequence of pairs. The value must be quoted by curly double quotes. Escape Mechanisms To include the 〔tortoise shell〕 delimiters in data, use “〔” and “〕”, similarly for the 「corner brackets」. Unicode; No More CD Data and Entities “&” There's no Entities. Except the unicode in hexadecimal format “&#x‹unicode code point hexidecimal›”. For example, “&” is not allowed. Treatment of Whitespace Basically identical to XML. Char Encoding; UTF8 and UTF16 Only Source code must be UTF8 or UTF16, only. Nothing else. File Name Extension File name extension is “.xml6” or “.html6”. Semantics The semantics should follow xhtml5. Questions and Answers What's wrong with xhtml/html5 exactly? The politics of standard body changes, and their attitude about what is correct also changes whimsically. In around 2000, we are told that XML and XHTML will change society, or, at least, make the web correct and valid and far more easier to develop and flexible. Now it's a decade later. Sure the web has improved, but as far as html/xhtml and browser rendering goes, it's still a syntax soup with extreme complexities. 99.99% of web pages are still not valid, and nobody cares. Major browsers still don't agree on their rendering behavior. Web dev is actually far more complex, involving tens or hundreds of tech that hardly a person even knows about (ajax, jason, lots xml variations). It's hard to say if it is better at all than the HTML3 days with “font” and “table” tags and gazillion tricks. The best practical approach is still trial n error with browsers. And, now HTML5 comes alone, from a newfangled hip group primarily from current big corporations Google and Apple, with a attitude that validation is overrated — a insult to the face about the XML mantra from w3c, just when there starts to be more and more sites with correct XHTML and Microsoft's Internet Explorer getting on track about correctness. XML is break from SGML, with many justifications why it needs be, and with some backward compatible trade-offs, and now HTML5 is a break from both SGML and XML. See also: (Google Earth) KML Validation Fuckup Google's 「rel="nofollow"」 Rule HTML Correctness and Validators Why not just adopt SXML from the lisp world? Lisp's SXML is not a stand-alone syntax for the need of the web. Lisp's format typically are made in a way to follow lisp's traditions, and often has quirks of its own. The syntax is not 100% regular of nested parens. SXML is easy for lispers to adopt, but harder for other languages and communities. For lisp's syntax irregularities, see: Fundamental Problems of Lisp. For example, the xml as textual representation of a tree has a quirk, in that each node has this special thing called “attributes” (aka “properties”). The “attribute” is not a node of the tree, but rather, is info attached to a node. The standard lisp syntax (aka sexp) to represent attributes is this, e.g.. (h1 :id "xyz" :class "abc" ...) Syntactically, each of “:id”, “"xyz"” etc are not distinguishable from a node/branch in the tree. Only semantically, after lisp reader parsed the special character “:” in a node's name, then it is considered a property name, and that the next element in the expression is being considered as a value for that property. Another way to represent xml's attribute is this: (h1 ((id . "xyz") (class . "abc")) ...) This too, have syntactical ambiguity. The whole “((id . "xyz") (class . "abc"))” can be interpreted as a node by itself, where the first element is again a node. But also here, it uses lisp's special “cons” syntax “(id . "xyz")” which is itself ambiguous at the syntax level. e.g. it can be considered as a node named “id” with 2 branches “.” and “"xyz"”, or it can be considered as a node named “cons” with 2 branches “id” and “"xyz"”. Another common lisp syntax for attributes is this: (h1 (@ (id . "xyz") (class . "abc")) ...) Again, this whole “(@ ...)” part at the syntax level is simply a node named �...@”. Only at the semantic level, that it is taken as properties of a node due to semantics attached to the head string �...@”. So, in conceiving html6, i thought a solution for getting rid of syntax ambiguity for node vs attributes is to use a special bracket for properties/attributes of a node. e.g. “〔h1「id “xyz” class “abc”」...〕”. Why use weird Unicode characters for matching pair? Unicode has become widely adopted today. (See: Unicode Popularity On Web.) Unicode also has a lot proper matching pairs. (See: Matching Brackets in Unicode.) It seems today is the right time to adopt the wide range of proper characters instead of keep relying on the very limited number of ASCII characters. The straight quote character " is not a matching pair, and in code it present several problems. For example, it is difficult to know which quote matches which. Also, it is difficult to recover from a missing quote. (this problem is especially pronounced in text editors for syntax highlighting.) A proper matching pair allow programs and editors to more easily correctly determine the quoted content, and for easily navigating the tree. The unicode characters 〔〕 and 「」 may be difficult to input. Possibly, they can be replaced by () and {} for html6. Though, that also means a lot ugly escape will need to happen in text, and if not escaped, that means incorrect syntax. One thing about this html6 is that it is intentionally separate from being a valid sexp of the lisp world. The core idea is that the syntax of html6 is designed specifically as a 2-dimentional textual representation of a tree, and with a attribute quote that attaches a limited form of info (pairs sequence) to any node to fit existing structure of XML. The advantage of this is that it should be extremely easy to parse, in perhaps just 3 lines of parsing expression grammar. And can be easily done in perl, python, ruby... without entailing lisp quirks, and can be trivially tranformed into legal lisp syntax by lisps as well. Any thoughts about flaws? Xah ∑ xahlee.org ☄ -- http://mail.python.org/mailman/listinfo/python-list