Re: [RFC] Roundtripping namespaced xml documents for data.xml

Paul Gearon Mon, 26 May 2014 13:46:30 -0700

Hi Herwig,

First, I have to start with an apology, and it's to do with this section:

If I compare the a:bar element form both documents with func-deep-equal
>> then they should compare equal, despite the fact that the a:bar qname
>> resolves differently in each context.
>
>
> Are you saying that deep-equals compares the actual serialization (with
> prefixes), or that the default equality should do that?
> If so, please read the infoset specification:
> http://www.w3.org/TR/xml-infoset/#infoitem.element
> The relevant quote for this case:

I really don't know how I sent that.

I confess to writing it (mea culpa), but it felt wrong as I did so. I went
back to the docs, and realized that I had been thinking of how attribute
values are not affected by namespace. Stupid error, and I don't know what I
was thinking.

The weird thing is that I *thought* that I edited the email to remove this
error. Maybe I mentioned it twice and only removed it once? Anyway, it was
wrong. I apologise for confusing the conversion with this furfy.

Subsequent parts of the email were written with the correct understanding.

Incidentally, looking this over did make me think again about storing the
fully resolved QNames in the parsed document. I still came down on not
doing so, but some of my reasoning is more due to non-XML arguments.

Back to the main thread (I'll remove the part about deep-equals not
comparing resolved qnames)...

On Sun, May 25, 2014 at 8:19 AM, Herwig Hochleitner
<hhochleit...@gmail.com>wrote:

> 2014-05-23 19:01 GMT+02:00 Paul Gearon <gea...@gmail.com>:
>
>
>> I still argue for using keywords here. The existing API uses them, and
>> they're a natural fit.
>>
>
> The fact that they have established meaning (for denoting literal xml
> names + their prefix in a given serialization) in the API is exactly one of
> my reasons for not wanting to change those semantics. Having a separate
> tier for representing raw, serialized xml is fine. It's what the library
> currently does. Adding new behavior, like proper xml namespacing, warrants
> adding a new tier.
>
>
>> The one real problem is elements would need a special hash/equality check
>> to deal with namespace contexts (presuming that fn:deep-equal maps to
>> Object.equals).
>>
>
> I had been thinking along those lines before. Check out the dev thread, I
> try to argue that at first there, but at some point I realized that it
> makes no sense to strictly stick to the raw representation and compute
> other info just on the fly. The key observation is, that a tree of raw,
> prefixed xml doesn't make any sense without xmlns declarations, whereas
> they are redundant, as soon as the tree has been resolved.
>

My point of view is that processing real-world XML rarely needs the fully
resolved URIs. Instead, most applications only care about the element names
as they appear in the document. Also, keywords have the nice property of
being interned, which matters when parsing 20GB XML files. It's possible to
intern a QName implementation, but they will still use more memory per
element.

The counter argument is that the standard requires support for handling
fully resolved QNames, so these need to be dealt with. However, I don't
believe that the use-cases for dealing with fully resolved data comes up as
often. Also, when it does come up, my experience has been that it is
usually in smaller documents (<1MB), where efficiency of comparison is not
as paramount.

The issue here may be more of dealing with the representation tier vs. the
model tier. I will address that question more at the bottom of this email.

> To your point from below:
>
>
>> I didn't follow the discussion for putting URIs into keywords, as I could
>> not see why we would want this (am I missing something obvious?)
>>
>
> We need the URIs for xml processing and the XmlNamespace metadata can get
> lost or not be there in the first place. Also the URI counts for equality,
> see below.
> I totally agree that it makes no sense putting them in keywords.
>

OK, I agree. My difference has been that I don't think that the entire URI
need exist in the element tag, but rather allow it to be built from the
element tag + the namespace context. (That would be the
representation-to-model tier mapping. I mention this, along with namespace
contexts at the end)

The keywords would need to be translated according to the current context.
>> However, that approach still works for fragments that can be evaluated in
>> different contexts,
>>
>
> The problem are fragments that are taken out from their (xmlns -
> declaring) root-element and/or that have no XmlNamespace metadata. Apart
> from actual prefix assignment (which can be done in the emitter), QNames
> are completely context free in that regard. See the key observation above.
>

This is why I have advocated attaching the namespace context as metadata.

while storing URIs directly needs everything to be rebuilt for another
>> context.
>>
>
> Are you talking about prefix assignments? See my comment about diffing
> metadata below. I also detailed on this point in the design page.
>
> Most possible QNames can be directly expressed as a keyword (for instance,
>> the QName 㑦:㒪 can be represented as the read-able keyword :㑦/㒪). The
>> keyword function is just a workaround for exotic values. While I know they
>> can exist (e.g. containing \u000A), I've yet to see any QNames in the wild
>> that cannot be represented as a read-able keyword.
>>
>
> Seen xhtml?
>

Yes. I don't like it (and neither did anyone else, hence HTML 5), but
that's not relevant. :)

(what I mean is, yes, I know about it, and I know it has to be dealt with.
I just don't have to like it)  :)

What about the QName {http://www.w3.org/1999/xhtml}body? Notice that :
> http://www.w3.org/1999/xhtml/body would be read like (keyword "http:" "/
> www.w3.org/1999/xhtml/body"). Another point that's already been made on
> the dev thread.
>

Not sure what you're trying to get at with this example.

The syntax {http://www.w3.org/1999/xhtml}body is a universal name in
Clark's notation, and it's used for describing a resolved QName.  As Clark
points out, it's not valid to use a universal name in XML: you only use it
in the data model. In this case, the QName would presumably be either just
"body" with the default namespace, or xhtml:body with an in-context
namespace of 
xmlns:xhtml="http://www.w3.org/1999/xhtml<http://www.w3.org/1999/xhtml%7Dbody>
".

Consequently, the keyword to be constructed should look like either
(keyword ""body") or (keyword "xthml" "body"). Somewhere nearby there will
be metadata of {:xhtml
"http://www.w3.org/1999/xhtml<http://www.w3.org/1999/xhtml%7Dbody>
"}

As an aside, I was curious, so I tried both full URIs and universal names
in a couple of XML validators, and was surprised to see that they
validated. I've no idea why. The W3C validators reject them (as they are
supposed to).

<snip mistake="stupid"/> <!-- the stupid mistake, with reply, mentioned at
the top -->

> I still don't see why the reverse mapping is needed. Is it because I'm
>> storing the QName in a keyword and can look up the current namespace for
>> the URI, while you are storing the fully qualified name?
>>
>
> First, terminology: In xml the namespace _is_ the uri. The thing that you
> write before the : in the serialization is a prefix. It is only an artifact
> of serialization, completely meaningless except when you actually read or
> write xml. So I want the user to be to "write" xml without javax.xml, just
> by transforming the tree back to its context-dependent keyworded
> prefix-representation. So we need a way to find the (a) current prefix for
> a namespace.
>

I agree that the QName format is a serialization artifact. The data model
is really all URIs.

That said, I've used URIs as the data model, and they get in the way. They
are not especially useful to work with, slower, and take more memory (those
last two are because they aren't usually interned, they're validated, and
they have numerous internal fields). My admittedly limited experience is
that elements are almost never accessed as URIs, but by name. The fact that
data.xml hasn't supported namespaces before now is an example of how little
people use URIs from XML.

When processing large amounts of data, I'd much rather deal with a stream
of keyword-based elements that provide enough context to build the URI than
with the full URI for every element that I have to convert to keywords.
I'll get to this when I talk about the tiers below.

Sorry, I'm not following what you were getting at with this. In this
>> example D and E both get mapped to the same namespace, meaning that
>> <D:foo/> and <E:foo/> can be made to resolve the same way. But in a
>> different context they could be different values.
>>
>
> Which is the reason we need to lift elements out of their context as soon
> as possible. We don't want an element to change its namespace, just because
> we transplant it into another xml fragment. Chouser went to great length
> about this point, before he realized that this was exactly my goal aswell.
>

Well, I *am* talking about attaching the context to the elements, so the
data is there.

> If both the explicit declarations of namespaces on elements and current
>> context are stored with the element (one in the :namespaces field, the
>> other in metadata), then this allows resolution to be handled correctly,
>> while also maintaining where each namespace needs to be emitted.
>>
>
> My plan is to only store the metadata. The set of namespaces is implicitly
> given by QNames contained within the fragment and early introduction of
> nessecary xmlns declarations can be achived by diffing the metadata. See my
> design document.
>
> Note: I'm talking about the new representation here. The current one will
> continue to work unchanged.
>

Do you mean the "current data.xml" that doesn't support namespaces, or the
"current one" meaning the code that you have in your github repo?

> I guess I was uncomfortable with XmlNamespaceImpl because of the fancy
>> structures with mutation. I was attracted to using a stack, since that's
>> what's going on in the parser/emitter.
>>
>
> Don't be fooled by the transients. XmlNamespaceImpl is an immutable,
> persistent data structure.
>

OK, I'll spend more time looking at it. I only browsed over the source
briefly and didn't try to grok it.

> If you have time (or inclination) for comparison, you can look at mine at
>> https://github.com/quoll/data.xml on the new_namespaces branch. I
>> haven't yet written the code for equality (fn:deep-equal), nor for
>> resolving the URIs for QNames, but it's parsing and emitting, and I think
>> it's correct. Unfortunately, it's all still in one file, as per the master
>> branch.
>>
>
> I've taken a short look, but stopped reading when I realized, that you
> keep the thing in a dynamic var, in an atom that you mutate from the
> emitter or parser. It might not say anything about the data structure
> itself, but it has "wrong approach" written all over it.
>

I don't like mutation myself. I did it for speed of implementation at the
time (and laziness as it borrowed from a SAX parser). I'm not suggesting my
code as a complete alternative (e.g. there are still missing parts), but to
present a different approach.

The reason for that approach was that it was the fastest way (for me) to
create a stack that runs in parallel to the parser or emitter that contains
the context that the parser/emitter is not showing you.

Typically, I would pass the current value of the stack into the code that
handles the current element, and the return value from the handler would be
a tuple of both the returned data and the new stack value. I'll do that in
the next day or so if it'll get it looked at.

While I'm at it, I should break it up into namespaces based on
functionality, as you have.

Also I'd prefer if we could focus the discussion on the proposed
> specification for now. As soon as we agree there, we can start bikeshedding
> the data structures.
>

I guess I've been uncomfortable with the 2-tier model, which has influenced
my discussion to date. While the model tier allows for elements like the
deep-equals operation, I haven't run into any uses which would require it
(there are probably others in the Clojure community who have). So my bias
has been to treat that tier as a transformation built from the
representational tier on an as-needed basis.

Also, my initial reading of the proposal was that the representational tier
was as a stepping point to getting to the real data to be returned, which
was to be the model tier. Perhaps I was mistaken, and no such emphasis is
implied.

One reason for believing that the representation tier is simply a
transitional step is because it treats namespace declarations for elements
as simple attributes. They may be represented syntactically in this way,
but the semantics are already established by the time this tier is called
(you're still expecting to use the Java StAX parser, right?). Also, since
I'm happy with keywords, if I don't want to go to the effort of handling
everything in the model tier, then that will put a lot of work on me to
deal with namespaces that I ought not to be worried about.

The data model I was proposing (and is generated with my version of the
parser... regardless of how poorly I implemented the stack) would map to
the representational tier, providing all the semantics of the model tier
without the full resolution. That is: keywords for tags; a namespace map
for the declarations made on the element; a namespace map in metadata for
the namespace context inherited from the parent element.

> I hope to implement the emitter, aswell as the tree walkers, soon. Then we
> may finally have a non-hypothetical design to talk about.
>

I'm stuck on a project until I have a namespace-aware data.xml, so I might
as well try it as well. In my case I'll need to fix up the implementation
of the namespace stack, and write a transformation function that converts
the tree to a fully resolved one, and back. You may still hate it, but at
least it will show what I'm talking about. :)

Paul

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [RFC] Roundtripping namespaced xml documents for data.xml

Reply via email to