Copilot commented on code in PR #2885:
URL: https://github.com/apache/tika/pull/2885#discussion_r3383111877


##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java:
##########
@@ -614,7 +614,14 @@ private void collectPictureSlides(ShapeContainer 
container, int slideNum,
         }
         for (HSLFShape shape : shapes) {
             if (shape instanceof HSLFPictureShape) {
-                HSLFPictureData pd = ((HSLFPictureShape) 
shape).getPictureData();
+                HSLFPictureData pd;
+                try {
+                    pd = ((HSLFPictureShape) shape).getPictureData();
+                } catch (IndexOutOfBoundsException e) {
+                    // corrupt Escher BSE record -- skip page anchoring for 
this shape
+                    EmbeddedDocumentUtil.recordEmbeddedStreamException(e, 
parentMetadata);
+                    continue;

Review Comment:
   New error-handling path (catching IndexOutOfBoundsException from 
HSLFPictureShape#getPictureData) isn’t covered by existing unit tests. Adding a 
regression test with a minimal corrupt .ppt that triggers this exception would 
help ensure parsing continues and the exception is recorded in parent metadata 
as intended.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to