RE: EXTERNAL: Re: SAX Parser includes ignorable whitespaces in the character() method

2014-09-15 Thread Zhu, Joe
Michael,
Thanks for your reply. The XSD does not allow mixed content. Attached is my 
test Java code, test xml and test xsd for your reference. 

Also included below is the run log for Xerces parser and for a Sun parser. 
When it runs with the Xerces parser, the whitespaces are reported in the 
characters() method and nothing is reported in ignorablewhitespaces(). 
But when it runs with the Sun parser, the text content is reported in 
characters() and the whitespaces are reported in  ignorablewhitesapces() 
method, as expected.

Joe

 Log for Xerces parser ---
factory = org.apache.xerces.jaxp.SAXParserFactoryImpl@110c424
parser = org.apache.xerces.jaxp.SAXParserImpl@1bd2664
startElement howto
characters = "
  "
startElement topic
characters = "
  "
startElement title
characters = "Java"
endElement title
characters = "
  "
...

-- Log for Sun parser -
factory = com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl@1e8a1f6
parser = com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl@1e152c5
startElement howto
ignorableWhitespace = "
  "
startElement topic
ignorableWhitespace = "
  "
startElement title
characters = "Java"
endElement title
ignorableWhitespace = "
  "
...


-Original Message-
From: Michael Glavassevich [mailto:mrgla...@ca.ibm.com] 
Sent: Friday, September 12, 2014 9:54 AM
To: j-users@xerces.apache.org
Subject: EXTERNAL: Re: SAX Parser includes ignorable whitespaces in the 
character() method

Your XML document requires a DTD with element declarations which specify that 
they contain element-only content. Without that a SAX parser cannot determine 
which whitespaces are 'ignorable'.

Thanks.

Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: mrgla...@ca.ibm.com
E-mail: mrgla...@apache.org

"Zhu, Joe"  wrote on 09/11/2014 07:00:11 PM:

> I am writing an app which need to access all text content in XML. 
> According to the ContentHandler API, this could be accomplished by 
> using a validating parser and the characters() method.
> 
> But with the Xerces parser, the characters() method could contain 
> ignorable whitespaces (XML formatting whitespaces). I have no way to 
> tell if the whitespace is ignorable whitespace or is part of the XML
content.
> 
> Has anybody else run into the problem? I tested with both Xerces 2.
> 9.1 and Xerces 2.11. They behave the same way.
> 
> Joe Zhu


-
To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
For additional commands, e-mail: j-users-h...@xerces.apache.org



SaxIgnorableWhiteSpaceTest.java
Description: SaxIgnorableWhiteSpaceTest.java

http://www.w3.org/2001/XMLSchema-instance";>
  
  Java
  http://www.rgagnon.com/topics/java-xml.html
  
  
  PowerBuilder
  http://www.rgagnon.com/topics/pb-powerscript.htm
  
  
Javascript
http://www.rgagnon.com/topics/js-language.html
  
  
VBScript
http://www.rgagnon.com/topics/wsh-vbs.html
  


http://www.w3.org/2001/XMLSchema";>

  
 
  

  

  
  

  

  

  
  
  
  
http://.*"; />
  
  
 


-
To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
For additional commands, e-mail: j-users-h...@xerces.apache.org

RE: EXTERNAL: Re: SAX Parser includes ignorable whitespaces in the character() method

2014-09-15 Thread Michael Glavassevich
ignorableWhitespace() was only defined for use with DTDs. Sun's 
implementation may be doing something for XSD but there's nothing in the 
specification which requires that. Xerces is behaving correctly.

Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: mrgla...@ca.ibm.com
E-mail: mrgla...@apache.org

"Zhu, Joe"  wrote on 09/15/2014 09:41:33 AM:

> Michael,
> Thanks for your reply. The XSD does not allow mixed content. 
> Attached is my test Java code, test xml and test xsd for your reference. 

> 
> Also included below is the run log for Xerces parser and for a Sun 
parser. 
> When it runs with the Xerces parser, the whitespaces are reported in
> the characters() method and nothing is reported in 
ignorablewhitespaces(). 
> But when it runs with the Sun parser, the text content is reported 
> in characters() and the whitespaces are reported in 
> ignorablewhitesapces() method, as expected.
> 
> Joe
> 
>  Log for Xerces parser 
---
> factory = org.apache.xerces.jaxp.SAXParserFactoryImpl@110c424
> parser = org.apache.xerces.jaxp.SAXParserImpl@1bd2664
> startElement howto
> characters = "
>   "
> startElement topic
> characters = "
>   "
> startElement title
> characters = "Java"
> endElement title
> characters = "
>   "
> ...
> 
> -- Log for Sun parser 
-
> factory = 
com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl@1e8a1f6
> parser = com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl@1e152c5
> startElement howto
> ignorableWhitespace = "
>   "
> startElement topic
> ignorableWhitespace = "
>   "
> startElement title
> characters = "Java"
> endElement title
> ignorableWhitespace = "
>   "
> ...
> 
> 
> -Original Message-
> From: Michael Glavassevich [mailto:mrgla...@ca.ibm.com] 
> Sent: Friday, September 12, 2014 9:54 AM
> To: j-users@xerces.apache.org
> Subject: EXTERNAL: Re: SAX Parser includes ignorable whitespaces in 
> the character() method
> 
> Your XML document requires a DTD with element declarations which 
> specify that they contain element-only content. Without that a SAX 
> parser cannot determine which whitespaces are 'ignorable'.
> 
> Thanks.
> 
> Michael Glavassevich
> XML Technologies and WAS Development
> IBM Toronto Lab
> E-mail: mrgla...@ca.ibm.com
> E-mail: mrgla...@apache.org
> 
> "Zhu, Joe"  wrote on 09/11/2014 07:00:11 PM:
> 
> > I am writing an app which need to access all text content in XML. 
> > According to the ContentHandler API, this could be accomplished by 
> > using a validating parser and the characters() method.
> > 
> > But with the Xerces parser, the characters() method could contain 
> > ignorable whitespaces (XML formatting whitespaces). I have no way to 
> > tell if the whitespace is ignorable whitespace or is part of the XML
> content.
> > 
> > Has anybody else run into the problem? I tested with both Xerces 2.
> > 9.1 and Xerces 2.11. They behave the same way.
> > 
> > Joe Zhu
> 
> 
> -
> To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
> For additional commands, e-mail: j-users-h...@xerces.apache.org


-
To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
For additional commands, e-mail: j-users-h...@xerces.apache.org



RE: EXTERNAL: Re: SAX Parser includes ignorable whitespaces in the character() method

2014-09-15 Thread Zhu, Joe
Hmm. Why does it distinguish between DTD and XSD? They are all schema 
definitions!

It is useful to be able to distinguish between ignorable whitespaces and 
allowable whitespaces. Xerces can't, which makes it less useful . It also 
violates the W3C XML Recommendation, as shown below.

Everywhere I read, it implies that ignorable whitespaces shall be reported in 
ignorableWhitespace():

 org.sax.xml.ContentHandler API 
(http://docs.oracle.com/javase/6/docs/api/index.html?javax/xml/stream/package-summary.html)
 

ignorableWhitespace
---
Validating Parsers must use this method to report each chunk of whitespace in 
element content (see the W3C XML 1.0 recommendation, section 2.10): 
non-validating parsers may also use this method if they are capable of parsing 
and using content models.

- W3C XML Recommendation  
(http://www.w3.org/TR/REC-xml/#sec-white-space) 
---
2.10 White Space Handling
...
An XML processor MUST always pass all characters in a document that are not 
markup through to the application. A validating XML processor MUST also inform 
the application which of these characters constitute white space appearing in 
element content
---

Joe 

-Original Message-
From: Michael Glavassevich [mailto:mrgla...@ca.ibm.com] 
Sent: Monday, September 15, 2014 9:40 AM
To: j-users@xerces.apache.org
Subject: RE: EXTERNAL: Re: SAX Parser includes ignorable whitespaces in the 
character() method

ignorableWhitespace() was only defined for use with DTDs. Sun's implementation 
may be doing something for XSD but there's nothing in the specification which 
requires that. Xerces is behaving correctly.

Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: mrgla...@ca.ibm.com
E-mail: mrgla...@apache.org

"Zhu, Joe"  wrote on 09/15/2014 09:41:33 AM:

> Michael,
> Thanks for your reply. The XSD does not allow mixed content. 
> Attached is my test Java code, test xml and test xsd for your reference. 

> 
> Also included below is the run log for Xerces parser and for a Sun
parser. 
> When it runs with the Xerces parser, the whitespaces are reported in 
> the characters() method and nothing is reported in
ignorablewhitespaces(). 
> But when it runs with the Sun parser, the text content is reported in 
> characters() and the whitespaces are reported in
> ignorablewhitesapces() method, as expected.
> 
> Joe
> 
>  Log for Xerces parser
---
> factory = org.apache.xerces.jaxp.SAXParserFactoryImpl@110c424
> parser = org.apache.xerces.jaxp.SAXParserImpl@1bd2664
> startElement howto
> characters = "
>   "
> startElement topic
> characters = "
>   "
> startElement title
> characters = "Java"
> endElement title
> characters = "
>   "
> ...
> 
> -- Log for Sun parser
-
> factory =
com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl@1e8a1f6
> parser = com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl@1e152c5
> startElement howto
> ignorableWhitespace = "
>   "
> startElement topic
> ignorableWhitespace = "
>   "
> startElement title
> characters = "Java"
> endElement title
> ignorableWhitespace = "
>   "
> ...
> 
> 
> -Original Message-
> From: Michael Glavassevich [mailto:mrgla...@ca.ibm.com]
> Sent: Friday, September 12, 2014 9:54 AM
> To: j-users@xerces.apache.org
> Subject: EXTERNAL: Re: SAX Parser includes ignorable whitespaces in 
> the character() method
> 
> Your XML document requires a DTD with element declarations which 
> specify that they contain element-only content. Without that a SAX 
> parser cannot determine which whitespaces are 'ignorable'.
> 
> Thanks.
> 
> Michael Glavassevich
> XML Technologies and WAS Development
> IBM Toronto Lab
> E-mail: mrgla...@ca.ibm.com
> E-mail: mrgla...@apache.org
> 
> "Zhu, Joe"  wrote on 09/11/2014 07:00:11 PM:
> 
> > I am writing an app which need to access all text content in XML. 
> > According to the ContentHandler API, this could be accomplished by 
> > using a validating parser and the characters() method.
> > 
> > But with the Xerces parser, the characters() method could contain 
> > ignorable whitespaces (XML formatting whitespaces). I have no way to 
> > tell if the whitespace is ignorable whitespace or is part of the XML
> content.
> > 
> > Has anybody else run into the problem? I tested with both Xerces 2.
> > 9.1 and Xerces 2.11. They behave the same way.
> > 
> > Joe Zhu
> 
> 
> -
> To unsubscribe, e-mail: j-users-unsubscr...@xerces.apache.org
> For additional commands, e-mail: j-users-h...@xerces.apache.org


-
To unsubscri

RE: EXTERNAL: Re: SAX Parser includes ignorable whitespaces in the character() method

2014-09-15 Thread Michael Glavassevich
The specification you are quoting from is only concerned with DTDs. See 
the definition of validating XML processors here [1].

If XSD wanted something similar it would need to set the [element content 
whitespace] boolean property [2] to true on character information items, 
but there's nothing in the XSD specification which suggests that XML 
schema processors are supposed to mutate the XML Infoset in this way.

Thanks.

[1] http://www.w3.org/TR/REC-xml/#proc-types
[2] http://www.w3.org/TR/xml-infoset/#infoitem.character

Michael Glavassevich
XML Technologies and WAS Development
IBM Toronto Lab
E-mail: mrgla...@ca.ibm.com
E-mail: mrgla...@apache.org

"Zhu, Joe"  wrote on 09/15/2014 11:37:13 AM:

> Hmm. Why does it distinguish between DTD and XSD? They are all 
> schema definitions!
> 
> It is useful to be able to distinguish between ignorable whitespaces
> and allowable whitespaces. Xerces can't, which makes it less useful 
> . It also violates the W3C XML Recommendation, as shown below.
> 
> Everywhere I read, it implies that ignorable whitespaces shall be 
> reported in ignorableWhitespace():
> 
>  org.sax.xml.ContentHandler API (http://
> docs.oracle.com/javase/6/docs/api/index.html?javax/xml/stream/
> package-summary.html) 
> 
> ignorableWhitespace
> ---
> Validating Parsers must use this method to report each chunk of 
> whitespace in element content (see the W3C XML 1.0 recommendation, 
> section 2.10): non-validating parsers may also use this method if 
> they are capable of parsing and using content models.
> 
> - W3C XML Recommendation  (http://www.w3.org/TR/
> REC-xml/#sec-white-space) 
> ---
> 2.10 White Space Handling
> ...
> An XML processor MUST always pass all characters in a document that 
> are not markup through to the application. A validating XML 
> processor MUST also inform the application which of these characters
> constitute white space appearing in element content
> 
---
> 
> Joe 
> 
> -Original Message-
> From: Michael Glavassevich [mailto:mrgla...@ca.ibm.com] 
> Sent: Monday, September 15, 2014 9:40 AM
> To: j-users@xerces.apache.org
> Subject: RE: EXTERNAL: Re: SAX Parser includes ignorable whitespaces
> in the character() method
> 
> ignorableWhitespace() was only defined for use with DTDs. Sun's 
> implementation may be doing something for XSD but there's nothing in
> the specification which requires that. Xerces is behaving correctly.
> 
> Michael Glavassevich
> XML Technologies and WAS Development
> IBM Toronto Lab
> E-mail: mrgla...@ca.ibm.com
> E-mail: mrgla...@apache.org
> 
> "Zhu, Joe"  wrote on 09/15/2014 09:41:33 AM:
> 
> > Michael,
> > Thanks for your reply. The XSD does not allow mixed content. 
> > Attached is my test Java code, test xml and test xsd for your 
reference. 
> 
> > 
> > Also included below is the run log for Xerces parser and for a Sun
> parser. 
> > When it runs with the Xerces parser, the whitespaces are reported in 
> > the characters() method and nothing is reported in
> ignorablewhitespaces(). 
> > But when it runs with the Sun parser, the text content is reported in 
> > characters() and the whitespaces are reported in
> > ignorablewhitesapces() method, as expected.
> > 
> > Joe
> > 
> >  Log for Xerces parser
> ---
> > factory = org.apache.xerces.jaxp.SAXParserFactoryImpl@110c424
> > parser = org.apache.xerces.jaxp.SAXParserImpl@1bd2664
> > startElement howto
> > characters = "
> >   "
> > startElement topic
> > characters = "
> >   "
> > startElement title
> > characters = "Java"
> > endElement title
> > characters = "
> >   "
> > ...
> > 
> > -- Log for Sun parser
> -
> > factory =
> com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl@1e8a1f6
> > parser = com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl@1e152c5
> > startElement howto
> > ignorableWhitespace = "
> >   "
> > startElement topic
> > ignorableWhitespace = "
> >   "
> > startElement title
> > characters = "Java"
> > endElement title
> > ignorableWhitespace = "
> >   "
> > ...
> > 
> > 
> > -Original Message-
> > From: Michael Glavassevich [mailto:mrgla...@ca.ibm.com]
> > Sent: Friday, September 12, 2014 9:54 AM
> > To: j-users@xerces.apache.org
> > Subject: EXTERNAL: Re: SAX Parser includes ignorable whitespaces in 
> > the character() method
> > 
> > Your XML document requires a DTD with element declarations which 
> > specify that they contain element-only content. Without that a SAX 
> > parser cannot determine which whitespaces are 'ignorable'.
> > 
> > Thanks.
> > 
> > Michael Glavassevich
> > XML Technologies and WAS Development
> > IBM Toronto Lab
> > E-mail: mrgla...@ca.ibm.com
> > E-m