[jira] [Commented] (TIKA-4744) Further xhtml fixes in 4.x

ASF GitHub Bot (Jira) Fri, 29 May 2026 07:08:07 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084378#comment-18084378
 ]


ASF GitHub Bot commented on TIKA-4744:
--------------------------------------

Copilot commented on code in PR #2850:
URL: https://github.com/apache/tika/pull/2850#discussion_r3324852130


##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/src/main/java/org/apache/tika/parser/odf/FlatOpenDocumentParser.java:
##########
@@ -111,6 +119,9 @@ public void parse(TikaInputStream tis, ContentHandler 
handler, Metadata metadata
             if (detected != null) {
                 metadata.set(Metadata.CONTENT_TYPE, detected.toString());
             }
+        } catch (SAXException e) {
+            balancer.drainOpenElements();
+            throw e;

Review Comment:
   XHTMLBalancingHandler is only drained on SAXException. 
XMLReaderUtils.parseSAX(...) can also throw IOException/TikaException; in those 
cases xhtml.endDocument() in the finally can still throw due to unbalanced open 
elements and mask the original failure. Drain the balancer for all exception 
types that can escape the parse loop.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/xml/AbstractXML2003Parser.java:
##########
@@ -99,6 +108,8 @@ public void parse(TikaInputStream tis, ContentHandler 
handler, Metadata metadata
                             getContentHandler(tagged, metadata, context)));
         } catch (SAXException e) {
             WriteLimitReachedException.throwIfWriteLimitReached(e);
+            // Close anything the aborted parse left open, then propagate.
+            balancer.drainOpenElements();
             throw new TikaException("XML parse error", e);
         } finally {

Review Comment:
   XHTMLBalancingHandler is only drained on SAXException. The SAX parser call 
can also throw IOException (e.g., truncated stream), and then 
xhtml.endDocument() in finally may throw due to unbalanced open elements and 
mask the original IOException. Drain the balancer for IOException too.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/src/main/java/org/apache/tika/parser/odf/FlatOpenDocumentParser.java:
##########
@@ -169,27 +194,44 @@ MediaType getDetectedType() {
         @Override
         public void startElement(String namespaceURI, String localName, String 
qName,
                                  Attributes attrs) throws SAXException {
-
-            if (META.equals(localName)) {
-                currentHandler = metadataHandler;
-            } else if (BODY.equals(localName)) {
-                currentHandler = bodyHandler;
-            } else if (extractMacros && SCRIPTS.equals(localName)) {
-                currentHandler = macroHandler;
-            }
-
-            //trust the mimetype element if it exists for the subtype
             if (DOCUMENT.equals(localName)) {
-                String mime = XMLReaderUtils.getAttrValue("mimetype", attrs);
-                if (mime != null) {
-                    if (mime.equals(ODT.toString())) {
-                        detectedType = FLAT_ODT;
-                    } else if (mime.equals(ODP.toString())) {
-                        detectedType = FLAT_ODP;
-                    } else if (mime.equals(ODS.toString())) {
-                        detectedType = FLAT_ODS;
+                documentDepth++;
+                //trust the mimetype element if it exists for the subtype.
+                //Only consult the outer <office:document>; inner ones (e.g.,
+                //an embedded chart's document) would otherwise overwrite the
+                //container's detected subtype.
+                if (documentDepth == 1) {
+                    String mime = XMLReaderUtils.getAttrValue("mimetype", 
attrs);
+                    if (mime != null) {
+                        if (mime.equals(ODT.toString())) {
+                            detectedType = FLAT_ODT;
+                        } else if (mime.equals(ODP.toString())) {
+                            detectedType = FLAT_ODP;
+                        } else if (mime.equals(ODS.toString())) {
+                            detectedType = FLAT_ODS;
+                        }
                     }
                 }
+            } else if (documentDepth == 1 && outerSectionDepth == 0) {
+                // Only outermost-level META/BODY/SCRIPTS triggers a handler
+                // switch. Inner-document occurrences are forwarded to 
whichever
+                // outer section's handler is already active.
+                if (META.equals(localName)) {
+                    currentHandler = metadataHandler;
+                    outerSectionDepth = 1;
+                } else if (BODY.equals(localName)) {
+                    currentHandler = bodyHandler;
+                    outerSectionDepth = 1;
+                } else if (extractMacros && SCRIPTS.equals(localName)) {
+                    currentHandler = macroHandler;
+                    outerSectionDepth = 1;
+                }
+            } else if (outerSectionDepth > 0
+                    && (META.equals(localName) || BODY.equals(localName)
+                        || SCRIPTS.equals(localName))) {

Review Comment:
   The nested-section depth counter increments on any <office:scripts> start, 
even when extractMacros is false. In that case, outerSectionDepth will 
increment but never decrement (endElement only decrements scripts when 
extractMacros is true), which can prevent currentHandler from ever switching 
back to defaultHandler after leaving <office:body>/<office:meta>.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/rtf/RTFParser.java:
##########
@@ -90,11 +91,22 @@ public void parse(TikaInputStream tis, ContentHandler 
handler, Metadata metadata
         TaggedInputStream tagged = new TaggedInputStream(tis);
         XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata, 
context);
         xhtml.startDocument();
+        // Wrap xhtml in a balancing handler so the finally's endDocument
+        // doesn't fire on an unbalanced stack if the RTF state machine
+        // emits inconsistent SAX events (e.g., </b> while <p> is topmost
+        // due to state-vs-stack drift on certain corpus files). Without
+        // this, StrictXHTMLValidator's </body> vs <p>/<b> at endDocument
+        // masks the real well-formedness error from inside extract().
+        XHTMLBalancingHandler balancer = new XHTMLBalancingHandler(xhtml);
         try {
-            parseInline(tis, xhtml, metadata, context);
+            parseInline(tis, balancer, metadata, context);
         } catch (IOException e) {

Review Comment:
   parseInline(...) can throw TikaException, but this method only drains 
XHTMLBalancingHandler for IOException/SAXException. If a TikaException escapes, 
xhtml.endDocument() in finally can still throw on an unbalanced stack and mask 
the real failure. Drain and rethrow for TikaException as well.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/src/main/java/org/apache/tika/parser/odf/FlatOpenDocumentParser.java:
##########
@@ -169,27 +194,44 @@ MediaType getDetectedType() {
         @Override
         public void startElement(String namespaceURI, String localName, String 
qName,
                                  Attributes attrs) throws SAXException {
-
-            if (META.equals(localName)) {
-                currentHandler = metadataHandler;
-            } else if (BODY.equals(localName)) {
-                currentHandler = bodyHandler;
-            } else if (extractMacros && SCRIPTS.equals(localName)) {
-                currentHandler = macroHandler;
-            }
-
-            //trust the mimetype element if it exists for the subtype
             if (DOCUMENT.equals(localName)) {
-                String mime = XMLReaderUtils.getAttrValue("mimetype", attrs);
-                if (mime != null) {
-                    if (mime.equals(ODT.toString())) {
-                        detectedType = FLAT_ODT;
-                    } else if (mime.equals(ODP.toString())) {
-                        detectedType = FLAT_ODP;
-                    } else if (mime.equals(ODS.toString())) {
-                        detectedType = FLAT_ODS;
+                documentDepth++;
+                //trust the mimetype element if it exists for the subtype.
+                //Only consult the outer <office:document>; inner ones (e.g.,
+                //an embedded chart's document) would otherwise overwrite the
+                //container's detected subtype.
+                if (documentDepth == 1) {
+                    String mime = XMLReaderUtils.getAttrValue("mimetype", 
attrs);
+                    if (mime != null) {
+                        if (mime.equals(ODT.toString())) {
+                            detectedType = FLAT_ODT;
+                        } else if (mime.equals(ODP.toString())) {
+                            detectedType = FLAT_ODP;
+                        } else if (mime.equals(ODS.toString())) {
+                            detectedType = FLAT_ODS;
+                        }
                     }
                 }
+            } else if (documentDepth == 1 && outerSectionDepth == 0) {
+                // Only outermost-level META/BODY/SCRIPTS triggers a handler
+                // switch. Inner-document occurrences are forwarded to 
whichever
+                // outer section's handler is already active.
+                if (META.equals(localName)) {
+                    currentHandler = metadataHandler;
+                    outerSectionDepth = 1;
+                } else if (BODY.equals(localName)) {
+                    currentHandler = bodyHandler;
+                    outerSectionDepth = 1;
+                } else if (extractMacros && SCRIPTS.equals(localName)) {
+                    currentHandler = macroHandler;
+                    outerSectionDepth = 1;
+                }
+            } else if (outerSectionDepth > 0
+                    && (META.equals(localName) || BODY.equals(localName)
+                        || SCRIPTS.equals(localName))) {

Review Comment:
   The nested-section depth counter increments on any <office:scripts> start, 
even when extractMacros is false. In that case, outerSectionDepth will 
increment but never decrement (endElement only decrements scripts when 
extractMacros is true), which can prevent currentHandler from ever switching 
back to defaultHandler after leaving <office:body>/<office:meta>.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/src/main/java/org/apache/tika/parser/odf/FlatOpenDocumentParser.java:
##########
@@ -111,6 +119,9 @@ public void parse(TikaInputStream tis, ContentHandler 
handler, Metadata metadata
             if (detected != null) {
                 metadata.set(Metadata.CONTENT_TYPE, detected.toString());
             }
+        } catch (SAXException e) {
+            balancer.drainOpenElements();
+            throw e;

Review Comment:
   XHTMLBalancingHandler is only drained on SAXException. 
XMLReaderUtils.parseSAX(...) can also throw IOException/TikaException; in those 
cases xhtml.endDocument() in the finally can still throw due to unbalanced open 
elements and mask the original failure. Drain the balancer for all exception 
types that can escape the parse loop.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/rtf/RTFParser.java:
##########
@@ -90,11 +91,22 @@ public void parse(TikaInputStream tis, ContentHandler 
handler, Metadata metadata
         TaggedInputStream tagged = new TaggedInputStream(tis);
         XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata, 
context);
         xhtml.startDocument();
+        // Wrap xhtml in a balancing handler so the finally's endDocument
+        // doesn't fire on an unbalanced stack if the RTF state machine
+        // emits inconsistent SAX events (e.g., </b> while <p> is topmost
+        // due to state-vs-stack drift on certain corpus files). Without
+        // this, StrictXHTMLValidator's </body> vs <p>/<b> at endDocument
+        // masks the real well-formedness error from inside extract().
+        XHTMLBalancingHandler balancer = new XHTMLBalancingHandler(xhtml);
         try {
-            parseInline(tis, xhtml, metadata, context);
+            parseInline(tis, balancer, metadata, context);
         } catch (IOException e) {

Review Comment:
   parseInline(...) can throw TikaException, but this method only drains 
XHTMLBalancingHandler for IOException/SAXException. If a TikaException escapes, 
xhtml.endDocument() in finally can still throw on an unbalanced stack and mask 
the real failure. Drain and rethrow for TikaException as well.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/xml/AbstractXML2003Parser.java:
##########
@@ -99,6 +108,8 @@ public void parse(TikaInputStream tis, ContentHandler 
handler, Metadata metadata
                             getContentHandler(tagged, metadata, context)));
         } catch (SAXException e) {
             WriteLimitReachedException.throwIfWriteLimitReached(e);
+            // Close anything the aborted parse left open, then propagate.
+            balancer.drainOpenElements();
             throw new TikaException("XML parse error", e);
         } finally {

Review Comment:
   XHTMLBalancingHandler is only drained on SAXException. The SAX parser call 
can also throw IOException (e.g., truncated stream), and then 
xhtml.endDocument() in finally may throw due to unbalanced open elements and 
mask the original IOException. Drain the balancer for IOException too.





> Further xhtml fixes in 4.x
> --------------------------
>
>                 Key: TIKA-4744
>                 URL: https://issues.apache.org/jira/browse/TIKA-4744
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> In prep for 4.0.0-beta-1, I ran the full corpus with the strict xhtml 
> validator on. This surfaced a few further areas for improvement.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4744) Further xhtml fixes in 4.x

Reply via email to