Re: [racket] XML library: representing CDATA

Jay McCarthy Thu, 05 Jan 2012 12:16:55 -0800

I'm happy to take the code below as a patch. I think that would be the
best thing to do in this case, because I don't really have anything
else to add.


Jay

On Thu, 5 Jan 2012 18:07:47 +0000, Norman Gray <nor...@astro.gla.ac.uk> mumbled:

> Jay, hello.

> On 4 Jan 2012, at 20:53, Jay McCarthy wrote:

> >> In the XML module's cdata struct, "[t]he string field is assumed to
> >> be of the form <![CDATA[‹content›]]> with proper quoting of
> >> ‹content›."  It's not clear that this is a very useful design of the
> >> interface.
> > 
> >> Principally, it makes it inconvenient to get at the <content>, and
> >> requires calls to substring (or something like that) in order to
> >> extract the <content> from cdata-string.
> > 
> > I'm happy including a helper function that does the substring.

> An alternative would be to define, say, the following.

> #lang racket

> (struct source (start stop)) ; dummy definition

> (struct cdata source (chars)
>   #:guard (λ (start stop chars type)
>             (cond ((regexp-match #rx"^<!\\[CDATA\\[(.*)]]>$" chars)
>                    => (λ (m)
>                         (values start stop (list-ref m 1))))
>                   (else (values start stop chars)))))
> (define (cdata-string cdata)
>   (string-append "<![CDATA[" (cdata-chars cdata) "]]>"))

> (define c1 (cdata #f #f "cdata1"))
> (define c2 (cdata #f #f "<![CDATA[cdata2]]>"))

> (printf "c1: ~a & ~a~%" (cdata-chars c1) (cdata-string c1))
> (printf "c2: ~a & ~a~%" (cdata-chars c2) (cdata-string c2))
=>
> c1: cdata1 & <![CDATA[cdata1]]>
> c2: cdata2 & <![CDATA[cdata2]]>

> This would entail corresponding changes to the XML writer, but would be 
> coherent and backward compatible, in the sense that something that was 
> illegal before would become legal, but nothing hitherto legal would become 
> illegal.

  
> >> Secondly, it represents low-level syntactical information which
> >> should not, I think, be present in the result of a parse of an XML
> >> document.  The fact that the content string originated from within a
> >> CDATA section is, I think, useful to know, but only just.  Note that
> >> the fact that a string or character originated within a CDATA
> >> section is not part of the XML information set
> >> (<http://www.w3.org/TR/xml-infoset/> Sect. 2.6, and Appx D point
> >> 19).  Supposing (which would be sturdily defensible) that xexprs
> >> should represent no more than the content of the XML information
> >> set, then there would be no need for the cdata structure at all
> >> (though this obviously makes escaping characters on output somewhat
> >> more involved).
> > 
> > I'm happy making the backwards compatible change of changing the
> > reader to never produce them.

> Right, so parsing "<p>Foo <![CDATA[b&r<>]]> baz</p>" would produce 

> (list 'p '() "Foo " "b&r<>" " baz")

> or

> (list 'p '() "Foo b&r<> baz")

> The only arguable downside to this is that the presence of a #<cdata> 
> structure gives the caller a hint that there's something that (someone 
> thought) needs escaping here.  However, if they're being as careful as they 
> should be about escaping before outputting, then this won't make any 
> difference.

> >> It's also completely counterintuitive: the documentation of this
> >> struct is only three sentences long, and when reading it I _still_
> >> managed to elide the explanation that the CDATA line-noise actually
> >> had to be included in the string, presumaly because it seemed so
> >> obvious that it wouldn't.
> > 
> > The sentence is there because it is non-intuitive. I don't know any
> > other way to say it. The XML collect doesn't insert the wrapper, it
> > assumes it is already there.

> Perhaps a big "NOTE:" at the beginning of the second paragraph would draw 
> attention to it.

> >> Side-issue regarding the wording of the documentation: it's not
> >> completely clear what "proper quoting of content" means.  I presume
> >> it means purely racket-quoting of the string contents, and doesn't
> >> refer to XML quoting at all.  Thus (cdata #f #f "<![CDATA[\"&]]>")
> >> would be acceptable in principle (it is acceptable in fact).
> > 
> > It refers to the fact that "]]>" cannot appear in the content.

> We may be at cross-purposes, then, but it's still not clear what "proper 
> quoting" refers to, since there's no scope for quoting the contents of CDATA 
> sections.  If you want to include "]]>" within/near a CDATA section (perhaps 
> you're writing about CDATA sections, or you have a taste for esoteric 
> smilies: 8]]> "gleeful person with handlebar moustache"), then you'd have to 
> do something like <![CDATA[esoteric smilie: 8]]]]><![CDATA[> "gleeful"]]>

> I think it would be reasonable for write-xexpr and friends to simply throw an 
> error if they find a "]]>" in CDATA content, leaving it up to the creator of 
> the xexpr to handle this corner case themself.

> >> Is there any chance of a (admittedly backward-incompatible) change
> >> to this part of the interface?  I doubt that the cdata structure is
> >> very extensively used.
> > 
> > I believe its main use is in including Javascript output where XML
> > quoting will cause stuff like "1 < 2" to fail to compile in most
> > browsers. In that case, it is very important that the CDATA tags not
> > be there (i.e. we WANT invalid XML) because browsers will break on
> > that too.

> That's the broad sort of situation where I'm using it.  Looking at Eli's 
> Javascript example, I think that's a case where the module can properly leave 
> such two-language-at-a-time hacking to the (poor) author, and blithely output 
> <![CDATA[...]]> in all cases.

> Best wishes,

> Norman


> -- 
> Norman Gray  :  http://nxg.me.uk


--
Jay McCarthy <jay.mccar...@gmail.com>
Assistant Professor / Brigham Young University
http://faculty.cs.byu.edu/~jay

"The glory of God is Intelligence" - D&C 93
____________________
  Racket Users list:
  http://lists.racket-lang.org/users

Re: [racket] XML library: representing CDATA

Reply via email to