Re: losing characters in multi-line XML with Sax and Xerces

2005-05-12 Thread Joseph Kesselman
Standard reminder that SAX may return contiguous text as multiple characters() events; forgetting that is the usual cause of this particular complaint. __ Joe Kesselman, IBM Next-Generation Web Technologies: XML, XSL and more. "The world changed profoundly and u

Re: How to use entities with XML Schema?

2005-05-17 Thread Joseph Kesselman
Schema has no concept of entities, so to do this you have to validate against a DTD (or an internal subset) to expand the entities, then validate again against the schema. I _think_ simply turning on both kinds of validation and having the proper doctype in the source file will do the right thing.

RE: schema validation and default attributes

2005-05-20 Thread Joseph Kesselman
FWIW, this is why the DOM included the "specified" flag on Attr nodes. I think DOM Level 3 may address how/whether one asks the same question about schemas. Of course the real question is whether the default is present in the _output_ DTD/schema, which is not necessarily the same as the input and

Re: How to add DocumentType node to a Document

2005-05-25 Thread Joseph Kesselman
The official DOM answer is "create the DocumentType first, then use it when creating the Document node." This is because some DOM implementations may specialize themselves differently depending on what kind of document they're processing. Some (not all!) DOMs will also permit you to simply add the

Re: Validation User-Exits in Xerces

2005-06-03 Thread Joseph Kesselman
The idea of being able to have the schema grammar directly trigger processing, somewhat like YACC and similar grammar-to-action-binding tools, has been proposed in the past. I don't _think_ Xerces explicitly supports it, but I wouldn't mind being wrong. The more usual approach is to plug in SAX- o

RE: Parse XML doc in a string using DOM?

2005-06-21 Thread Joseph Kesselman
Note that is not valid XML; that should have been . __ Joe Kesselman, IBM Next-Generation Web Technologies: XML, XSL and more. "The world changed profoundly and unpredictably the day Tim Berners Lee got bitten by a radioactive spider." -- Rafe Culpin, in r.m.fi

Re: Problem reading newline characters and rewriting them

2005-07-12 Thread Joseph Kesselman
Per the XML Recommendation, XML parsers normalize all newline sequences into the XML newline character. There's no information retained about which version of newline was read in. __ Joe Kesselman, IBM Next-Generation Web Technologies: XML, XSL and more. "The wo

Re: preserve white spaces outside the root-element

2005-07-13 Thread Joseph Kesselman
The DOM spec says the Document node can't contain text elements, so there's really no way to record this whitespace in a standard DOM Document. __ Joe Kesselman, IBM Next-Generation Web Technologies: XML, XSL and more. "The world changed profoundly and unpredict

Re: going crazy with this: org.xml.sax.SAXParseException: Content is not allowed in prolog

2005-07-21 Thread Joseph Kesselman
Check the XML spec for the definition of "prolog" -- I believe this message means you have something not permitted (eg non-whitespace text) before the root element of the document. __ Joe Kesselman, IBM Next-Generation Web Technologies: XML, XSL and more. "The w

Re: Parsing an XML document without a root element

2005-08-17 Thread Joseph Kesselman
Write a wrapper document that pulls in this file as an external parsed entity. __ Joe Kesselman, IBM Next-Generation Web Technologies: XML, XSL and more. "The world changed profoundly and unpredictably the day Tim Berners Lee got bitten by a radioactive spider."

Re: Including an xml-file into another xml-file

2005-08-17 Thread Joseph Kesselman
Sounds like the file you're trying to read in isn't even a well-formed XML entity. Fix it? __ Joe Kesselman, IBM Next-Generation Web Technologies: XML, XSL and more. "The world changed profoundly and unpredictably the day Tim Berners Lee got bitten by a radioact

Re: Validate without DOCTYPE

2006-01-18 Thread Joseph Kesselman
The usual/simplest solution is to set up a filtering stream wrapper which prepends the appropriate doctype declaration if one isn't provided in the file, and parse from that. This may not be elegant, but it's simple and it works. __ Joe Kesselman, IBM Next-Gener

Re: How to read multiple XML from socket: cannot change the protocol (Re: How to handle continuous stream of XML)

2006-02-21 Thread Joseph Kesselman
Note too that a well-formed XML document can only have one top-level element -- everything after that is normally discarded -- so that too could be used as a clue for diviing a multiple-document stream. Or you could invent some new marker between documents, and have your input-stream filter use th

RE: How to read multiple XML from socket: cannot change the protocol (Re: How to handle continuous stream of XML)

2006-02-23 Thread Joseph Kesselman
>I believe that you can have PI, comments, whitespace etc after the root >element, is that significant for you ? They can exist in the file. They aren't supposed to be significant to the parser. Obviously, if present, they're a problem for dividing up a stream into multiple documents, which brings

Re: Lossless parsing

2006-02-28 Thread Joseph Kesselman
>I have a case where I want to apply slight changes to a document. The >most part of the document should be left unchanged, though. By >"unchanged" I mean *really* unchanged: In particular the documents >syntactical representation must not be changed. Process it as text? Seriously, if that's the

Re: xerces DOM concurrent access and defer-node-expansion

2006-03-07 Thread Joseph Kesselman
>The FAQ[1], declares xerces DOM implementation is not thread safe. Most DOMs are not threadsafe, as the DOM REC points out. Threadsafety at such a low level of a system tends to be expensive and redundant, and often insufficient since what you're concerned about is safety over a complete transac

Re: xerces DOM concurrent access and defer-node-expansion

2006-03-15 Thread Joseph Kesselman
>Would it work if length and previously accessed position were stored in >ThreadLocal variables? Don't go there. You're starting to talk about imparing performance for all users to protect a few who really should be coding higher-level interlocks in any case. _

Re: Parser problem in jdk1.5

2006-03-22 Thread Joseph Kesselman
Remember that text content may be spread over several successive calls to characters(). (Single most common SAX coding error...) __ Joe Kesselman -- Beware of Blueshift! "The world changed profoundly and unpredictably the day Tim Berners Lee got bitten by a radi

Re: Xlink support

2006-03-24 Thread Joseph Kesselman
>3. The idea is to use XLink to "Include" an XML document into >another. By "Include" I mean reference but still access the node as >if it were in the current document. At least that was the idea I was >given. Sounds like you want to look at XInclude as well as XLink. I think more implementatio

Re: White spaces before Processing Instruction

2006-03-27 Thread Joseph Kesselman
>Can Parser be forgiving about the white spaces before instruction ? Per the XML spec, nothing may preceed the XML Declaration except a Byte Order Mark, and the XML Parser should enforce that rule. I'd suggest you set up a stream filter which discards leading space, and parse from that, if you re

RE: Converting HTML to XML For Printing

2006-03-27 Thread Joseph Kesselman
You could try using the NekoHTML parser (based on Xerces) and feeding its output to Xalan for XSLT processing. I don't think we have a canned off-the-shelf demonstration of that combination, but it ought to be straightforward. I believe the W3C's "tidy" tool can also be persuaded to function as an

[OT] A Sax response to Stax

2006-03-29 Thread Joseph Kesselman
On Tuesday, 03/28/2006 at 09:02 EST, "Dave Brosius" <[EMAIL PROTECTED]> wrote: > I've always wondered why ContentHandler's startElement didn't return a > boolean as to whether child content event notification was desired. Seems > like that would improve sax performance significantly for many applic

Re: [OT] A Sax response to Stax

2006-03-29 Thread Joseph Kesselman
FWIW, I didn't mean "instead", I meant "as well" -- skip-this-node's-descendants is a perfectly reasonable concept. If anyone's seriously pursuring this, it may be worth reviewing the DOM Level 2 Traversal feature, specifically the NodeFilter API, to see how someone else addressed the concept of f

Re: Notation problems: parser reports undeclared although it is declared in my schema

2006-04-27 Thread Joseph Kesselman
DTD validation occurs before schema validation. If the DTD references a notation, it must define that notation, independently of whether the schema defines it. __ "... Three things are most perilous: Connectors that corrode, Unproven algorithms, and self-modif

Re: Convert HTML to XHTML with namespace prefix using Neko + Xerces

2006-04-27 Thread Joseph Kesselman
>But then I have to pick the content of body tag, already serialized, using > substring operation, Don't use string operations to manipulate XML. Use XML APIs. They're namespace-aware and will Do The Right Things. __ "... Three things are most perilous: Connect

Re: NullPointerException when calling Document.importNode(Node importedNode,boolean deep)

2006-05-02 Thread Joseph Kesselman
You might want to try running this under a debugger to see what field is actually coming up as null. I'd be inclined to suspect that the node you're importing from is damaged and/or otherwise isn't properly implementing the DOM APIs, hence is returning null at a time when null isn't expected. If n

Re: Should I use SAX or DOM?

2006-05-22 Thread Joseph Kesselman
http://www.w3.org/DOM/faq.html#SAXandDOM __ "... Three things are most perilous: Connectors that corrode, Unproven algorithms, and self-modifying code! ..." -- "Threes" Rev 1.1 - Duane Elms / Leslie Fish (http://www.ovff.org/pegasus/songs/threes-rev-11.html)

Re: Method getCharacterOffset() of XMLLocator

2006-06-23 Thread Joseph Kesselman
I'd bet the problem here is that you're confusing offset in (Unicode) characters with offset in (file) bytes. When you have an encoding such as UTF8, where some characters take more than one byte, the difference becomes important. __ "... Three things are most p

Re: Is the xerces implementation of SAX thread-safe?

2006-06-23 Thread Joseph Kesselman
Are you sure that your own SAX handler code is thread-safe? (Standard reminder of the single most common SAX coding error: if text content is being truncated, you probably forgot to deal with the possiblity of several successive calls to characters().) __

Re: Validating data against a schema

2006-06-26 Thread Joseph Kesselman
DOM Level 3 introduces some ability to validate subtrees on demand. I'm not sure whether the Xerces implementation of the DOM has added those features. __ "... Three things are most perilous: Connectors that corrode, Unproven algorithms, and self-modifying cod

Re: Using grammarpool with included schemas

2006-07-07 Thread Joseph Kesselman
A namespace name, although it is expressed as a URI, is just a name. Normal XML processing never never attempt to retrieve anything from it, so it is never processed by the EntityResolver. (The Semantic Web group may eventually define what, if anything, might be accessable through the namespace UR

Re: XML size validations

2024-03-07 Thread Joseph Kesselman
What concerns are you actually trying to address? For SAX, document length could be limited by running tests in the handler and throwing an exception if "reasonable" count or time is exceeded. And I *think* I remember Xerces adding the ability to limit depth of parsed entity recursion, if you'r

Re: Parsing HTML

2025-05-27 Thread Joseph Kesselman
Supporting an HTML DOM, and being able serialize to HTML, does not necessarily imply being able to parse HTML. As far as I know, that last is not supported by Xerces. I was able to (ab)use the W3C's _tidy_ tool to do some basic HTML parsing. Inelegant but it sufficed for what I needed. --