Hi Herwig, First, I have to start with an apology, and it's to do with this section:
If I compare the a:bar element form both documents with func-deep-equal >> then they should compare equal, despite the fact that the a:bar qname >> resolves differently in each context. > > > Are you saying that deep-equals compares the actual serialization (with > prefixes), or that the default equality should do that? > If so, please read the infoset specification: > http://www.w3.org/TR/xml-infoset/#infoitem.element > The relevant quote for this case: I really don't know how I sent that. I confess to writing it (mea culpa), but it felt wrong as I did so. I went back to the docs, and realized that I had been thinking of how attribute values are not affected by namespace. Stupid error, and I don't know what I was thinking. The weird thing is that I *thought* that I edited the email to remove this error. Maybe I mentioned it twice and only removed it once? Anyway, it was wrong. I apologise for confusing the conversion with this furfy. Subsequent parts of the email were written with the correct understanding. Incidentally, looking this over did make me think again about storing the fully resolved QNames in the parsed document. I still came down on not doing so, but some of my reasoning is more due to non-XML arguments. Back to the main thread (I'll remove the part about deep-equals not comparing resolved qnames)... On Sun, May 25, 2014 at 8:19 AM, Herwig Hochleitner <hhochleit...@gmail.com>wrote: > 2014-05-23 19:01 GMT+02:00 Paul Gearon <gea...@gmail.com>: > > >> I still argue for using keywords here. The existing API uses them, and >> they're a natural fit. >> > > The fact that they have established meaning (for denoting literal xml > names + their prefix in a given serialization) in the API is exactly one of > my reasons for not wanting to change those semantics. Having a separate > tier for representing raw, serialized xml is fine. It's what the library > currently does. Adding new behavior, like proper xml namespacing, warrants > adding a new tier. > > >> The one real problem is elements would need a special hash/equality check >> to deal with namespace contexts (presuming that fn:deep-equal maps to >> Object.equals). >> > > I had been thinking along those lines before. Check out the dev thread, I > try to argue that at first there, but at some point I realized that it > makes no sense to strictly stick to the raw representation and compute > other info just on the fly. The key observation is, that a tree of raw, > prefixed xml doesn't make any sense without xmlns declarations, whereas > they are redundant, as soon as the tree has been resolved. > My point of view is that processing real-world XML rarely needs the fully resolved URIs. Instead, most applications only care about the element names as they appear in the document. Also, keywords have the nice property of being interned, which matters when parsing 20GB XML files. It's possible to intern a QName implementation, but they will still use more memory per element. The counter argument is that the standard requires support for handling fully resolved QNames, so these need to be dealt with. However, I don't believe that the use-cases for dealing with fully resolved data comes up as often. Also, when it does come up, my experience has been that it is usually in smaller documents (<1MB), where efficiency of comparison is not as paramount. The issue here may be more of dealing with the representation tier vs. the model tier. I will address that question more at the bottom of this email. > To your point from below: > > >> I didn't follow the discussion for putting URIs into keywords, as I could >> not see why we would want this (am I missing something obvious?) >> > > We need the URIs for xml processing and the XmlNamespace metadata can get > lost or not be there in the first place. Also the URI counts for equality, > see below. > I totally agree that it makes no sense putting them in keywords. > OK, I agree. My difference has been that I don't think that the entire URI need exist in the element tag, but rather allow it to be built from the element tag + the namespace context. (That would be the representation-to-model tier mapping. I mention this, along with namespace contexts at the end) The keywords would need to be translated according to the current context. >> However, that approach still works for fragments that can be evaluated in >> different contexts, >> > > The problem are fragments that are taken out from their (xmlns - > declaring) root-element and/or that have no XmlNamespace metadata. Apart > from actual prefix assignment (which can be done in the emitter), QNames > are completely context free in that regard. See the key observation above. > This is why I have advocated attaching the namespace context as metadata. while storing URIs directly needs everything to be rebuilt for another >> context. >> > > Are you talking about prefix assignments? See my comment about diffing > metadata below. I also detailed on this point in the design page. > > Most possible QNames can be directly expressed as a keyword (for instance, >> the QName 㑦:㒪 can be represented as the read-able keyword :㑦/㒪). The >> keyword function is just a workaround for exotic values. While I know they >> can exist (e.g. containing \u000A), I've yet to see any QNames in the wild >> that cannot be represented as a read-able keyword. >> > > Seen xhtml? > Yes. I don't like it (and neither did anyone else, hence HTML 5), but that's not relevant. :) (what I mean is, yes, I know about it, and I know it has to be dealt with. I just don't have to like it) :) What about the QName {http://www.w3.org/1999/xhtml}body? Notice that : > http://www.w3.org/1999/xhtml/body would be read like (keyword "http:" "/ > www.w3.org/1999/xhtml/body"). Another point that's already been made on > the dev thread. > Not sure what you're trying to get at with this example. The syntax {http://www.w3.org/1999/xhtml}body is a universal name in Clark's notation, and it's used for describing a resolved QName. As Clark points out, it's not valid to use a universal name in XML: you only use it in the data model. In this case, the QName would presumably be either just "body" with the default namespace, or xhtml:body with an in-context namespace of xmlns:xhtml="http://www.w3.org/1999/xhtml<http://www.w3.org/1999/xhtml%7Dbody> ". Consequently, the keyword to be constructed should look like either (keyword ""body") or (keyword "xthml" "body"). Somewhere nearby there will be metadata of {:xhtml "http://www.w3.org/1999/xhtml<http://www.w3.org/1999/xhtml%7Dbody> "} As an aside, I was curious, so I tried both full URIs and universal names in a couple of XML validators, and was surprised to see that they validated. I've no idea why. The W3C validators reject them (as they are supposed to). <snip mistake="stupid"/> <!-- the stupid mistake, with reply, mentioned at the top --> > I still don't see why the reverse mapping is needed. Is it because I'm >> storing the QName in a keyword and can look up the current namespace for >> the URI, while you are storing the fully qualified name? >> > > First, terminology: In xml the namespace _is_ the uri. The thing that you > write before the : in the serialization is a prefix. It is only an artifact > of serialization, completely meaningless except when you actually read or > write xml. So I want the user to be to "write" xml without javax.xml, just > by transforming the tree back to its context-dependent keyworded > prefix-representation. So we need a way to find the (a) current prefix for > a namespace. > I agree that the QName format is a serialization artifact. The data model is really all URIs. That said, I've used URIs as the data model, and they get in the way. They are not especially useful to work with, slower, and take more memory (those last two are because they aren't usually interned, they're validated, and they have numerous internal fields). My admittedly limited experience is that elements are almost never accessed as URIs, but by name. The fact that data.xml hasn't supported namespaces before now is an example of how little people use URIs from XML. When processing large amounts of data, I'd much rather deal with a stream of keyword-based elements that provide enough context to build the URI than with the full URI for every element that I have to convert to keywords. I'll get to this when I talk about the tiers below. Sorry, I'm not following what you were getting at with this. In this >> example D and E both get mapped to the same namespace, meaning that >> <D:foo/> and <E:foo/> can be made to resolve the same way. But in a >> different context they could be different values. >> > > Which is the reason we need to lift elements out of their context as soon > as possible. We don't want an element to change its namespace, just because > we transplant it into another xml fragment. Chouser went to great length > about this point, before he realized that this was exactly my goal aswell. > Well, I *am* talking about attaching the context to the elements, so the data is there. > If both the explicit declarations of namespaces on elements and current >> context are stored with the element (one in the :namespaces field, the >> other in metadata), then this allows resolution to be handled correctly, >> while also maintaining where each namespace needs to be emitted. >> > > My plan is to only store the metadata. The set of namespaces is implicitly > given by QNames contained within the fragment and early introduction of > nessecary xmlns declarations can be achived by diffing the metadata. See my > design document. > > Note: I'm talking about the new representation here. The current one will > continue to work unchanged. > Do you mean the "current data.xml" that doesn't support namespaces, or the "current one" meaning the code that you have in your github repo? > I guess I was uncomfortable with XmlNamespaceImpl because of the fancy >> structures with mutation. I was attracted to using a stack, since that's >> what's going on in the parser/emitter. >> > > Don't be fooled by the transients. XmlNamespaceImpl is an immutable, > persistent data structure. > OK, I'll spend more time looking at it. I only browsed over the source briefly and didn't try to grok it. > If you have time (or inclination) for comparison, you can look at mine at >> https://github.com/quoll/data.xml on the new_namespaces branch. I >> haven't yet written the code for equality (fn:deep-equal), nor for >> resolving the URIs for QNames, but it's parsing and emitting, and I think >> it's correct. Unfortunately, it's all still in one file, as per the master >> branch. >> > > I've taken a short look, but stopped reading when I realized, that you > keep the thing in a dynamic var, in an atom that you mutate from the > emitter or parser. It might not say anything about the data structure > itself, but it has "wrong approach" written all over it. > I don't like mutation myself. I did it for speed of implementation at the time (and laziness as it borrowed from a SAX parser). I'm not suggesting my code as a complete alternative (e.g. there are still missing parts), but to present a different approach. The reason for that approach was that it was the fastest way (for me) to create a stack that runs in parallel to the parser or emitter that contains the context that the parser/emitter is not showing you. Typically, I would pass the current value of the stack into the code that handles the current element, and the return value from the handler would be a tuple of both the returned data and the new stack value. I'll do that in the next day or so if it'll get it looked at. While I'm at it, I should break it up into namespaces based on functionality, as you have. Also I'd prefer if we could focus the discussion on the proposed > specification for now. As soon as we agree there, we can start bikeshedding > the data structures. > I guess I've been uncomfortable with the 2-tier model, which has influenced my discussion to date. While the model tier allows for elements like the deep-equals operation, I haven't run into any uses which would require it (there are probably others in the Clojure community who have). So my bias has been to treat that tier as a transformation built from the representational tier on an as-needed basis. Also, my initial reading of the proposal was that the representational tier was as a stepping point to getting to the real data to be returned, which was to be the model tier. Perhaps I was mistaken, and no such emphasis is implied. One reason for believing that the representation tier is simply a transitional step is because it treats namespace declarations for elements as simple attributes. They may be represented syntactically in this way, but the semantics are already established by the time this tier is called (you're still expecting to use the Java StAX parser, right?). Also, since I'm happy with keywords, if I don't want to go to the effort of handling everything in the model tier, then that will put a lot of work on me to deal with namespaces that I ought not to be worried about. The data model I was proposing (and is generated with my version of the parser... regardless of how poorly I implemented the stack) would map to the representational tier, providing all the semantics of the model tier without the full resolution. That is: keywords for tags; a namespace map for the declarations made on the element; a namespace map in metadata for the namespace context inherited from the parent element. > I hope to implement the emitter, aswell as the tree walkers, soon. Then we > may finally have a non-hypothetical design to talk about. > I'm stuck on a project until I have a namespace-aware data.xml, so I might as well try it as well. In my case I'll need to fix up the implementation of the namespace stack, and write a transformation function that converts the tree to a fully resolved one, and back. You may still hate it, but at least it will show what I'm talking about. :) Paul -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.