Copilot commented on code in PR #2850:
URL: https://github.com/apache/tika/pull/2850#discussion_r3324852130
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/src/main/java/org/apache/tika/parser/odf/FlatOpenDocumentParser.java:
##########
@@ -111,6 +119,9 @@ public void parse(TikaInputStream tis, ContentHandler
handler, Metadata metadata
if (detected != null) {
metadata.set(Metadata.CONTENT_TYPE, detected.toString());
}
+ } catch (SAXException e) {
+ balancer.drainOpenElements();
+ throw e;
Review Comment:
XHTMLBalancingHandler is only drained on SAXException.
XMLReaderUtils.parseSAX(...) can also throw IOException/TikaException; in those
cases xhtml.endDocument() in the finally can still throw due to unbalanced open
elements and mask the original failure. Drain the balancer for all exception
types that can escape the parse loop.
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/xml/AbstractXML2003Parser.java:
##########
@@ -99,6 +108,8 @@ public void parse(TikaInputStream tis, ContentHandler
handler, Metadata metadata
getContentHandler(tagged, metadata, context)));
} catch (SAXException e) {
WriteLimitReachedException.throwIfWriteLimitReached(e);
+ // Close anything the aborted parse left open, then propagate.
+ balancer.drainOpenElements();
throw new TikaException("XML parse error", e);
} finally {
Review Comment:
XHTMLBalancingHandler is only drained on SAXException. The SAX parser call
can also throw IOException (e.g., truncated stream), and then
xhtml.endDocument() in finally may throw due to unbalanced open elements and
mask the original IOException. Drain the balancer for IOException too.
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/src/main/java/org/apache/tika/parser/odf/FlatOpenDocumentParser.java:
##########
@@ -169,27 +194,44 @@ MediaType getDetectedType() {
@Override
public void startElement(String namespaceURI, String localName, String
qName,
Attributes attrs) throws SAXException {
-
- if (META.equals(localName)) {
- currentHandler = metadataHandler;
- } else if (BODY.equals(localName)) {
- currentHandler = bodyHandler;
- } else if (extractMacros && SCRIPTS.equals(localName)) {
- currentHandler = macroHandler;
- }
-
- //trust the mimetype element if it exists for the subtype
if (DOCUMENT.equals(localName)) {
- String mime = XMLReaderUtils.getAttrValue("mimetype", attrs);
- if (mime != null) {
- if (mime.equals(ODT.toString())) {
- detectedType = FLAT_ODT;
- } else if (mime.equals(ODP.toString())) {
- detectedType = FLAT_ODP;
- } else if (mime.equals(ODS.toString())) {
- detectedType = FLAT_ODS;
+ documentDepth++;
+ //trust the mimetype element if it exists for the subtype.
+ //Only consult the outer <office:document>; inner ones (e.g.,
+ //an embedded chart's document) would otherwise overwrite the
+ //container's detected subtype.
+ if (documentDepth == 1) {
+ String mime = XMLReaderUtils.getAttrValue("mimetype",
attrs);
+ if (mime != null) {
+ if (mime.equals(ODT.toString())) {
+ detectedType = FLAT_ODT;
+ } else if (mime.equals(ODP.toString())) {
+ detectedType = FLAT_ODP;
+ } else if (mime.equals(ODS.toString())) {
+ detectedType = FLAT_ODS;
+ }
}
}
+ } else if (documentDepth == 1 && outerSectionDepth == 0) {
+ // Only outermost-level META/BODY/SCRIPTS triggers a handler
+ // switch. Inner-document occurrences are forwarded to
whichever
+ // outer section's handler is already active.
+ if (META.equals(localName)) {
+ currentHandler = metadataHandler;
+ outerSectionDepth = 1;
+ } else if (BODY.equals(localName)) {
+ currentHandler = bodyHandler;
+ outerSectionDepth = 1;
+ } else if (extractMacros && SCRIPTS.equals(localName)) {
+ currentHandler = macroHandler;
+ outerSectionDepth = 1;
+ }
+ } else if (outerSectionDepth > 0
+ && (META.equals(localName) || BODY.equals(localName)
+ || SCRIPTS.equals(localName))) {
Review Comment:
The nested-section depth counter increments on any <office:scripts> start,
even when extractMacros is false. In that case, outerSectionDepth will
increment but never decrement (endElement only decrements scripts when
extractMacros is true), which can prevent currentHandler from ever switching
back to defaultHandler after leaving <office:body>/<office:meta>.
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/rtf/RTFParser.java:
##########
@@ -90,11 +91,22 @@ public void parse(TikaInputStream tis, ContentHandler
handler, Metadata metadata
TaggedInputStream tagged = new TaggedInputStream(tis);
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata,
context);
xhtml.startDocument();
+ // Wrap xhtml in a balancing handler so the finally's endDocument
+ // doesn't fire on an unbalanced stack if the RTF state machine
+ // emits inconsistent SAX events (e.g., </b> while <p> is topmost
+ // due to state-vs-stack drift on certain corpus files). Without
+ // this, StrictXHTMLValidator's </body> vs <p>/<b> at endDocument
+ // masks the real well-formedness error from inside extract().
+ XHTMLBalancingHandler balancer = new XHTMLBalancingHandler(xhtml);
try {
- parseInline(tis, xhtml, metadata, context);
+ parseInline(tis, balancer, metadata, context);
} catch (IOException e) {
Review Comment:
parseInline(...) can throw TikaException, but this method only drains
XHTMLBalancingHandler for IOException/SAXException. If a TikaException escapes,
xhtml.endDocument() in finally can still throw on an unbalanced stack and mask
the real failure. Drain and rethrow for TikaException as well.
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/src/main/java/org/apache/tika/parser/odf/FlatOpenDocumentParser.java:
##########
@@ -169,27 +194,44 @@ MediaType getDetectedType() {
@Override
public void startElement(String namespaceURI, String localName, String
qName,
Attributes attrs) throws SAXException {
-
- if (META.equals(localName)) {
- currentHandler = metadataHandler;
- } else if (BODY.equals(localName)) {
- currentHandler = bodyHandler;
- } else if (extractMacros && SCRIPTS.equals(localName)) {
- currentHandler = macroHandler;
- }
-
- //trust the mimetype element if it exists for the subtype
if (DOCUMENT.equals(localName)) {
- String mime = XMLReaderUtils.getAttrValue("mimetype", attrs);
- if (mime != null) {
- if (mime.equals(ODT.toString())) {
- detectedType = FLAT_ODT;
- } else if (mime.equals(ODP.toString())) {
- detectedType = FLAT_ODP;
- } else if (mime.equals(ODS.toString())) {
- detectedType = FLAT_ODS;
+ documentDepth++;
+ //trust the mimetype element if it exists for the subtype.
+ //Only consult the outer <office:document>; inner ones (e.g.,
+ //an embedded chart's document) would otherwise overwrite the
+ //container's detected subtype.
+ if (documentDepth == 1) {
+ String mime = XMLReaderUtils.getAttrValue("mimetype",
attrs);
+ if (mime != null) {
+ if (mime.equals(ODT.toString())) {
+ detectedType = FLAT_ODT;
+ } else if (mime.equals(ODP.toString())) {
+ detectedType = FLAT_ODP;
+ } else if (mime.equals(ODS.toString())) {
+ detectedType = FLAT_ODS;
+ }
}
}
+ } else if (documentDepth == 1 && outerSectionDepth == 0) {
+ // Only outermost-level META/BODY/SCRIPTS triggers a handler
+ // switch. Inner-document occurrences are forwarded to
whichever
+ // outer section's handler is already active.
+ if (META.equals(localName)) {
+ currentHandler = metadataHandler;
+ outerSectionDepth = 1;
+ } else if (BODY.equals(localName)) {
+ currentHandler = bodyHandler;
+ outerSectionDepth = 1;
+ } else if (extractMacros && SCRIPTS.equals(localName)) {
+ currentHandler = macroHandler;
+ outerSectionDepth = 1;
+ }
+ } else if (outerSectionDepth > 0
+ && (META.equals(localName) || BODY.equals(localName)
+ || SCRIPTS.equals(localName))) {
Review Comment:
The nested-section depth counter increments on any <office:scripts> start,
even when extractMacros is false. In that case, outerSectionDepth will
increment but never decrement (endElement only decrements scripts when
extractMacros is true), which can prevent currentHandler from ever switching
back to defaultHandler after leaving <office:body>/<office:meta>.
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/src/main/java/org/apache/tika/parser/odf/FlatOpenDocumentParser.java:
##########
@@ -111,6 +119,9 @@ public void parse(TikaInputStream tis, ContentHandler
handler, Metadata metadata
if (detected != null) {
metadata.set(Metadata.CONTENT_TYPE, detected.toString());
}
+ } catch (SAXException e) {
+ balancer.drainOpenElements();
+ throw e;
Review Comment:
XHTMLBalancingHandler is only drained on SAXException.
XMLReaderUtils.parseSAX(...) can also throw IOException/TikaException; in those
cases xhtml.endDocument() in the finally can still throw due to unbalanced open
elements and mask the original failure. Drain the balancer for all exception
types that can escape the parse loop.
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/rtf/RTFParser.java:
##########
@@ -90,11 +91,22 @@ public void parse(TikaInputStream tis, ContentHandler
handler, Metadata metadata
TaggedInputStream tagged = new TaggedInputStream(tis);
XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata,
context);
xhtml.startDocument();
+ // Wrap xhtml in a balancing handler so the finally's endDocument
+ // doesn't fire on an unbalanced stack if the RTF state machine
+ // emits inconsistent SAX events (e.g., </b> while <p> is topmost
+ // due to state-vs-stack drift on certain corpus files). Without
+ // this, StrictXHTMLValidator's </body> vs <p>/<b> at endDocument
+ // masks the real well-formedness error from inside extract().
+ XHTMLBalancingHandler balancer = new XHTMLBalancingHandler(xhtml);
try {
- parseInline(tis, xhtml, metadata, context);
+ parseInline(tis, balancer, metadata, context);
} catch (IOException e) {
Review Comment:
parseInline(...) can throw TikaException, but this method only drains
XHTMLBalancingHandler for IOException/SAXException. If a TikaException escapes,
xhtml.endDocument() in finally can still throw on an unbalanced stack and mask
the real failure. Drain and rethrow for TikaException as well.
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/xml/AbstractXML2003Parser.java:
##########
@@ -99,6 +108,8 @@ public void parse(TikaInputStream tis, ContentHandler
handler, Metadata metadata
getContentHandler(tagged, metadata, context)));
} catch (SAXException e) {
WriteLimitReachedException.throwIfWriteLimitReached(e);
+ // Close anything the aborted parse left open, then propagate.
+ balancer.drainOpenElements();
throw new TikaException("XML parse error", e);
} finally {
Review Comment:
XHTMLBalancingHandler is only drained on SAXException. The SAX parser call
can also throw IOException (e.g., truncated stream), and then
xhtml.endDocument() in finally may throw due to unbalanced open elements and
mask the original IOException. Drain the balancer for IOException too.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]