Hi Michael,
I confirm, in my case I am working with ANSI documents and the encoding
returned in the startDocument() method would consistently return an
"UTF-8" encoding, which is wrong.
So the best bet is to read the prolog, or otherwise to rely on the
parser's guessing...
BR,
Olivier DURAND - odur...@clever-age.com
Clever Age - http://www.clever-age.com
37, Bd des Capucines - 75002 Paris
Tel:..................+33 1 53 34 66 10
FAX:..................+33 1 53 34 65 20
Michael Glavassevich wrote:
Hi Elliotte,
I had a peek at your article and see in the code snippets that what
you're calling the "actual encoding" or "real encoding" actually
isn't. The one passed to startDocument() in XNI is the auto-detected
encoding, the one which Xerces guessed by peeking at the first few
bytes in the document. The actual encoding may not be known until the
XML declaration has been read and at this point it hasn't been read yet.
In SAX it's not legal to read from the Locator in startDocument() so
any calls to the Locator you make in that method may not be correct
and generally won't be with Xerces because at the point it calls
startDocument() it hasn't read enough of the document yet to be sure
of what the actual encoding is. If it looked like it was working you
were probably just getting lucky because the documents you tried were
in UTF-8 or UTF-16. Specifically the Javadoc [1] says: "Note that the
locator will return correct information only during the invocation SAX
event callbacks after startDocument returns and before endDocument is
called. The application should not attempt to use it at any other
time." So you have to wait until an event following startDocument()
before you can read the encoding (or anything else) from the Locator.
Thanks.
[1]
http://xerces.apache.org/xerces2-j/javadocs/api/org/xml/sax/ContentHandler.html#setDocumentLocator(org.xml.sax.Locator)
<http://xerces.apache.org/xerces2-j/javadocs/api/org/xml/sax/ContentHandler.html#setDocumentLocator%28org.xml.sax.Locator%29>
Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: mrgla...@ca.ibm.com
E-mail: mrgla...@apache.org
Elliotte Harold <elh...@ibiblio.org> wrote on 04/24/2009 08:48:52 AM:
> Do you want the declared encoding or the real encoding? If the
> latter, see here:
>
> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>
> --
> Elliotte Rusty Harold
> elh...@ibiblio.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
> For additional commands, e-mail: j-users-h...@xerces.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
For additional commands, e-mail: j-users-h...@xerces.apache.org