[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

bugzilla Fri, 04 Aug 2017 01:32:06 -0700

https://bz.apache.org/bugzilla/show_bug.cgi?id=61371


Javen O'Neal <one...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |NEEDINFO

--- Comment #10 from Javen O'Neal <one...@apache.org> ---
attachment 35199 references a different file that was uploaded to Google Sheets
which is 2.75 MB instead of 8 KB in size.
Unzipping the 2.75 MB file expands to 27.9 MB (90% compression ratio is typical
for workbooks that do not contain embedded objects).
xl/worksheets/sheet1.xml is 27.9 MB (leaving a few KB for the rest of the
extracted files).

Inspecting the XML, I see that the workbook defines 1 cell at A1 and 1 million
rows:
<sheetFormatPr defaultColWidth="0" defaultRowHeight="15" zeroHeight="1"/>
<cols>
  <col min="1" max="1" width="9.140625" customWidth="1"/>
  <col min="2" max="16384" width="9.140625" hidden="1"/>
</cols>
<sheetData>
  <row r="1" spans="1:1">
    <c r="A1" t="s"><v>0</v></c>
  </row>
  <row r="2" spans="1:1" hidden="1"/>
  <row r="3" spans="1:1" hidden="1"/>
  ...
  <row r="16" spans="1:1" hidden="1"/>
  <row r="17" hidden="1"/>
  ...
  <row r="1048556" hidden="1"/>
  <row r="1048557" hidden="1"/>
</sheetData>

POI uses a TreeMap<int rowNumber, XSSFRow row> to store these rows for fast
random and sequential access, at the cost of some memory. Oracle Java 8 docs
state that it implements TreeMap using a self-balancing pointer-based red-black
tree. This should be fine.

To test that your JVM can handle 1 million items in a TreeMap, do something
like the following:
private static final Random rand = new Random();

private Object createFakeRow(int nbytes) {
    byte[] row = new byte[nbytes];
    rand.nextBytes(row);
    return row;
}

SortedMap<Integer, Object> rows = new TreeMap<>();
for (int r=1; r<=1048557; r++) {
    // create an object that consumes 10 KB of RAM in place of a real XSSFRow
    rows.put(i, createFakeRow(10*1024));
}

If that works without issue, then the next focus is on what XMLBeans is doing
as it's reading Sheet1.xml.
You could look at how POI unzips the file into memory and reads each XML file
into an XML DOM using XMLBeans and how much extra memory is consumed by the CT
classes.
Before we blame XMLBeans, we'd have to fairly compare it with JAXB.

Here's a Google Drive link to this file since the file exceeds the file size
limit of bugzilla:
https://drive.google.com/file/d/0B2v9cndcBwIWeURpNXVrSjJYbDg/view

Based on your reported 9 GB of RAM, the amortized size of each XSSFRow would be
roughly 10 KB.

To make sure this really isn't a bug in POI's handling of hidden rows or
columns, we'd need to test for OOM on a workbook with the same number of rows
defined but all of them visible.

I think you have enough ideas here to try to figure out where the problem is,
which is needed before a potential fix can be written.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
For additional commands, e-mail: dev-h...@poi.apache.org

[Bug 61371] 20^20 hidden cells in a .xlsx file causes heap memory full exception

Reply via email to