Programmer-yyds opened a new issue, #881:
URL: https://github.com/apache/poi/issues/881

   **Description**
    When reading embedded images in an XLSX file, if the image size and count 
are both large, calling `XSSFPicture.getPictureData()` will immediately load 
the entire image into memory (`byte[]`).
    This quickly consumes heap space and leads to an `OutOfMemoryError`.
    In multi-threaded batch reading, the problem is worse because memory usage 
grows linearly with the number of threads.
   
   ------
   
   **Use Case**
    In real-world business scenarios, we often need to associate image data 
with other columns in the same row. For example:
   
   - Column A: Product ID
   - Column B: Product Name
   - Column C: Product Image
   
   When parsing, we need to accurately locate the image using its **row number 
and column number**, and combine it with the text or numeric columns in the 
same row to form a complete business record.
    Therefore, during image parsing, it is necessary to map the image data 
along with its **row and column indices**, rather than simply returning a flat 
collection of images.
    This structure makes it easier to align with other columns’ data, 
especially in batch reading or multi-threaded processing.
   
   ------
   
   **Environment**
   
   - Java 21
   - Apache POI 5.4.0
   - Test file: An XLSX file containing 1000 images, each 1 MB in size
   
   ------
   
   **Steps to Reproduce**
   
   ```
   public static Map<Integer, Map<Integer, PictureData>> getPictures(XSSFSheet 
sheet) {
       List<POIXMLDocumentPart> list = sheet.getRelations();
       Map<Integer, Map<Integer, PictureData>> rowToDataMap = new 
HashMap<>(list.size());
       for (POIXMLDocumentPart part : list) {
           if (part instanceof XSSFDrawing) {
               XSSFDrawing drawing = (XSSFDrawing) part;
               for (XSSFShape shape : drawing.getShapes()) {
                   XSSFPicture picture = (XSSFPicture) shape;
                   XSSFClientAnchor anchor = picture.getPreferredSize();
                   CTMarker marker = anchor.getFrom();
                   int row = marker.getRow() + 1;
                   int col = marker.getCol() + 1;
   
                   // Problem: This loads the entire image into memory as byte[]
                   PictureData pictureData = picture.getPictureData();
   
                   if (pictureData != null) {
                       rowToDataMap
                               .computeIfAbsent(row, r -> new HashMap<>())
                               .put(col, pictureData);
                   }
               }
           }
       }
       return rowToDataMap;
   }
   
   public static void main(String[] args) throws IOException {
       try (XSSFWorkbook workbook = new 
XSSFWorkbook(Files.newInputStream(Paths.get("large_images.xlsx")))) {
           XSSFSheet sheet = workbook.getSheetAt(0);
           Map<Integer, Map<Integer, PictureData>> pictures = 
getPictures(sheet);
       }
   }
   ```
   
   ------
   
   **Expected Behavior**
   
   - Provide a **lazy loading** mechanism so that the image data is not loaded 
into memory until explicitly requested
   - Provide an API that returns an `InputStream` instead of a `byte[]`
   - Allow users to skip image parsing and only retrieve positional metadata
   
   ------
   
   **Actual Behavior**
   
   - `getPictureData()` immediately loads the entire image into a `byte[]`, 
which can easily cause OOM for large files or when reading in multiple threads
   
   ------
   
   **Possible Solutions**
   
   - **Lazy loading**: Only load image data when explicitly requested by the 
user
   - **Streaming read**: Return `InputStream` instead of `byte[]`
   - **Skip mode**: Allow ignoring image parsing when opening the XLSX file
   - **Temporary file storage**: Write image data to temp files to reduce 
memory pressure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
For additional commands, e-mail: dev-h...@poi.apache.org

Reply via email to