Standard reminder that SAX may return contiguous text as multiple
characters() events; forgetting that is the usual cause of this particular
complaint.
__
Joe Kesselman, IBM Next-Generation Web Technologies: XML, XSL and more.
"The world changed profoundly and u
Schema has no concept of entities, so to do this you have to validate
against a DTD (or an internal subset) to expand the entities, then validate
again against the schema. I _think_ simply turning on both kinds of
validation and having the proper doctype in the source file will do the
right thing.
FWIW, this is why the DOM included the "specified" flag on Attr nodes. I
think DOM Level 3 may address how/whether one asks the same question about
schemas.
Of course the real question is whether the default is present in the
_output_ DTD/schema, which is not necessarily the same as the input and
The official DOM answer is "create the DocumentType first, then use it when
creating the Document node." This is because some DOM implementations may
specialize themselves differently depending on what kind of document
they're processing.
Some (not all!) DOMs will also permit you to simply add the
The idea of being able to have the schema grammar directly trigger
processing, somewhat like YACC and similar grammar-to-action-binding tools,
has been proposed in the past. I don't _think_ Xerces explicitly supports
it, but I wouldn't mind being wrong.
The more usual approach is to plug in SAX- o
Note that is not valid XML; that should have been .
__
Joe Kesselman, IBM Next-Generation Web Technologies: XML, XSL and more.
"The world changed profoundly and unpredictably the day Tim Berners Lee
got bitten by a radioactive spider." -- Rafe Culpin, in r.m.fi
Per the XML Recommendation, XML parsers normalize all newline sequences
into the XML newline character. There's no information retained about which
version of newline was read in.
__
Joe Kesselman, IBM Next-Generation Web Technologies: XML, XSL and more.
"The wo
The DOM spec says the Document node can't contain text elements, so there's
really no way to record this whitespace in a standard DOM Document.
__
Joe Kesselman, IBM Next-Generation Web Technologies: XML, XSL and more.
"The world changed profoundly and unpredict
Check the XML spec for the definition of "prolog" -- I believe this message
means you have something not permitted (eg non-whitespace text) before the
root element of the document.
__
Joe Kesselman, IBM Next-Generation Web Technologies: XML, XSL and more.
"The w
Write a wrapper document that pulls in this file as an external parsed
entity.
__
Joe Kesselman, IBM Next-Generation Web Technologies: XML, XSL and more.
"The world changed profoundly and unpredictably the day Tim Berners Lee
got bitten by a radioactive spider."
Sounds like the file you're trying to read in isn't even a well-formed XML
entity. Fix it?
__
Joe Kesselman, IBM Next-Generation Web Technologies: XML, XSL and more.
"The world changed profoundly and unpredictably the day Tim Berners Lee
got bitten by a radioact
The usual/simplest solution is to set up a filtering stream wrapper which
prepends the appropriate doctype declaration if one isn't provided in the
file, and parse from that. This may not be elegant, but it's simple and it
works.
__
Joe Kesselman, IBM Next-Gener
Note too that a well-formed XML document can only have one top-level
element -- everything after that is normally discarded -- so that too could
be used as a clue for diviing a multiple-document stream.
Or you could invent some new marker between documents, and have your
input-stream filter use th
>I believe that you can have PI, comments, whitespace etc after the root
>element, is that significant for you ?
They can exist in the file. They aren't supposed to be significant to the
parser. Obviously, if present, they're a problem for dividing up a stream
into multiple documents, which brings
>I have a case where I want to apply slight changes to a document. The
>most part of the document should be left unchanged, though. By
>"unchanged" I mean *really* unchanged: In particular the documents
>syntactical representation must not be changed.
Process it as text?
Seriously, if that's the
>The FAQ[1], declares xerces DOM implementation is not thread safe.
Most DOMs are not threadsafe, as the DOM REC points out.
Threadsafety at such a low level of a system tends to be expensive and
redundant, and often insufficient since what you're concerned about is
safety over a complete transac
>Would it work if length and previously accessed position were stored in
>ThreadLocal variables?
Don't go there. You're starting to talk about imparing performance for all
users to protect a few who really should be coding higher-level interlocks
in any case.
_
Remember that text content may be spread over several successive calls to
characters(). (Single most common SAX coding error...)
__
Joe Kesselman -- Beware of Blueshift!
"The world changed profoundly and unpredictably the day Tim Berners Lee
got bitten by a radi
>3. The idea is to use XLink to "Include" an XML document into
>another. By "Include" I mean reference but still access the node as
>if it were in the current document. At least that was the idea I was
>given.
Sounds like you want to look at XInclude as well as XLink. I think more
implementatio
>Can Parser be forgiving about the white spaces before instruction ?
Per the XML spec, nothing may preceed the XML Declaration except a Byte
Order Mark, and the XML Parser should enforce that rule.
I'd suggest you set up a stream filter which discards leading space, and
parse from that, if you re
You could try using the NekoHTML parser (based on Xerces) and feeding its
output to Xalan for XSLT processing. I don't think we have a canned
off-the-shelf demonstration of that combination, but it ought to be
straightforward.
I believe the W3C's "tidy" tool can also be persuaded to function as an
On Tuesday, 03/28/2006 at 09:02 EST, "Dave Brosius" <[EMAIL PROTECTED]>
wrote:
> I've always wondered why ContentHandler's startElement didn't return a
> boolean as to whether child content event notification was desired. Seems
> like that would improve sax performance significantly for many
applic
FWIW, I didn't mean "instead", I meant "as well" --
skip-this-node's-descendants is a perfectly reasonable concept.
If anyone's seriously pursuring this, it may be worth reviewing the DOM
Level 2 Traversal feature, specifically the NodeFilter API, to see how
someone else addressed the concept of f
DTD validation occurs before schema validation. If the DTD references a
notation, it must define that notation, independently of whether the schema
defines it.
__
"... Three things are most perilous: Connectors that corrode,
Unproven algorithms, and self-modif
>But then I have to pick the content of body tag, already serialized, using
> substring operation,
Don't use string operations to manipulate XML. Use XML APIs. They're
namespace-aware and will Do The Right Things.
__
"... Three things are most perilous: Connect
You might want to try running this under a debugger to see what field is
actually coming up as null. I'd be inclined to suspect that the node you're
importing from is damaged and/or otherwise isn't properly implementing the
DOM APIs, hence is returning null at a time when null isn't expected.
If n
http://www.w3.org/DOM/faq.html#SAXandDOM
__
"... Three things are most perilous: Connectors that corrode,
Unproven algorithms, and self-modifying code! ..."
-- "Threes" Rev 1.1 - Duane Elms / Leslie Fish
(http://www.ovff.org/pegasus/songs/threes-rev-11.html)
I'd bet the problem here is that you're confusing offset in (Unicode)
characters with offset in (file) bytes. When you have an encoding such as
UTF8, where some characters take more than one byte, the difference becomes
important.
__
"... Three things are most p
Are you sure that your own SAX handler code is thread-safe?
(Standard reminder of the single most common SAX coding error: if text
content is being truncated, you probably forgot to deal with the possiblity
of several successive calls to characters().)
__
DOM Level 3 introduces some ability to validate subtrees on demand. I'm not
sure whether the Xerces implementation of the DOM has added those features.
__
"... Three things are most perilous: Connectors that corrode,
Unproven algorithms, and self-modifying cod
A namespace name, although it is expressed as a URI, is just a name. Normal
XML processing never never attempt to retrieve anything from it, so it is
never processed by the EntityResolver.
(The Semantic Web group may eventually define what, if anything, might be
accessable through the namespace UR
What concerns are you actually trying to address?
For SAX, document length could be limited by running tests in the handler and
throwing an exception if "reasonable" count or time is exceeded.
And I *think* I remember Xerces adding the ability to limit depth of parsed
entity recursion, if you'r
Supporting an HTML DOM, and being able serialize to HTML, does not necessarily
imply being able to parse HTML. As far as I know, that last is not supported
by Xerces.
I was able to (ab)use the W3C's _tidy_ tool to do some basic HTML parsing.
Inelegant but it sufficed for what I needed.
--
33 matches
Mail list logo