Hi,

we are experiencing problems with the memory consumption of HSSF and in 
particular of XSSF documents which is by far too large for the application
we have. We’d appreciate someone taking the time to look into this. Perhaps
we’re just doing something the wrong way.


Our application has the following constraints:

We are generating Excel spreadsheets as yet another output option to a 
reporting tool similar to Jasper or Birt. The data arrives in streaming
manner.

Our users produce large documents and we can't tell whether that is silly or 
not. We don't know how big the largest documents will be but the report we
were given produces a sheet of 150,000 rows x 33 cells.

We are asked to use XSSF over HSSF in order to get all the data in a single
sheet and not have to create spillover sheets every 65536 rows. 

We need different styles, colspan, rowspan, etc. because the output is
supposed to resemble the layout of the report as closely as possible. This
keeps us from using the csv trick. For the same reason, we suspect that the
XML zip injection trick (see  http://www.realdevelopers.com/blog/code/excel
Streaming xlsx files ) that can also be found on this forum 
cannot be applied either. 
Is this assumption correct? The XML for the data looks straightforward but 
what about other issues like cell styles?  

Our documents are square and contain no empty rows or columns.



We made some tests with HSSF and XSSF. The test document contains a matrix 
of cells, each cell containg a "double" value. No cell styles are set.
To measure the memory consumption we ran the test program with varying 
column numbers (8, 16 and 33) and with as many rows as possible before 
running into an out-of-memory exception. We ran the test against the default 
heap space (-Xmx64m) and against 256 MB and 1.1 GB which is the limit 
for Windows.


The relevant part of the code is as follows (The entire program is 104 lines
long and can be posted by request):


import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import java.io.FileOutputStream;
import java.io.IOException;

public class PoiTest
{
    public static void main(String[] args)
    {
        if(args.length!=4) usage("need four command arguments");

        Workbook workBook=createWorkbook(args[0]);
        boolean isHType=workBook instanceof HSSFWorkbook;

        int rows=parseInt(args[1],"Failed to parse rows value as integer");
        int cols=parseInt(args[2],"Failed to parse cols value as integer");
        boolean saveFile=parseInt(args[3],"Failed to parse saveFile 
value as integer")!=0;

        Sheet sheet=workBook.createSheet("Main Sheet");
        int sheetNo=0;
        int rowIndexInSheet=0;
        double value=0;
        for(int rowIndex=0;rowIndex<rows;rowIndex++)
        {
            if(isHType&&sheetNo!=rowIndex/0x10000)
            {
                sheet=workBook.createSheet("Spillover from sheet 
"+(++sheetNo));
                rowIndexInSheet=0;
            }
    
            Row row=sheet.createRow(rowIndexInSheet);
            for(int colIndex=0;colIndex<cols;colIndex++)
            {
                Cell cell=row.createCell(colIndex);
                cell.setCellValue(value++);
            }
            rowIndexInSheet++;
        }


The good news is that processing time and memory consumption grow linearly
with the document size. The bad news is that in particular the memory need
is much too high for us.
The values are:


HSSF: 77 bytes/cell

XSFF: about 630 bytes/cell



Translated into rows and cells this means that with a heap space of 265 MB,
one can produce 101,000 rows using HSSF and only 12,300 rows 
using XSSF. Using XSSF we can't even get over the 65535 limit with the
maximum of 1.1 GB heap space.



In order to understand the numbers, we wrote a "naive" model using 
TreeMaps for both rows and cells and got figures similar to the HSSF 
values. By replacing the TreeMap in the rows by an array, the memory 
consumption dropped to 37 byte per cell.
We naively assumed that a cell needed only two member variables, a 
reference to the row (or sheet) and a reference to the value object. We 
looked at the class HSSFCell to see what member variables were actually 
used and found the following:




There are references to both the sheet (_sheet) and the workbook 
(_book). Isn’t it possible remove _book and implement getBoundWorkbook() as 
getSheet().getWorkbook()?


The values are apparently stored in _record based on _cellType to 
cater for the different data types (double, date string ..). Why not get 
rid of the type field and query the value for the type (getCellType() { 
return _record.getCellType(); }? The case of setting a style before a 
value can be handled by assigning a "type only" value.


It seems that the member variable _stringValue is used to store string 
values. Couldn't this be stored in _record?


The member variable _comment apparently stores a cell comment. 
Assuming that per average there are more values than comments one could 
surely find a more efficient storage strategy. As an example one could
introduce extra value types so that for every cell record type there is a
commented and a non commented version (e.g. DoubleCellValueRecord,
CommentedDoubleCellValueRecord).


Looking at the storage method used in the rows (HSFFRow) to store the cells,
there 
is also potential for simple memory optimization. Currently, the rows are
stored in a vector that grows the capacity by doubling, starting with an
initial size of 5. A spread sheet of 81 columns and 400,000 rows wastes
(79*400,000=32MB). Keeping a list of the row widths seen so far can make the
allocation much faster and avoid the 
waste. Aren't most sheets square so that the list would have only one entry?




All these remarks are made based on a quick glance at the code so 
there might be a very good explanation why things need to be the way 
they are.


Regarding XSSF it seems that there is a more basic problem. Can an all
purpose (xmlbeans) model be as a efficient as a custom model? Can the memory
consumption realistically be lowered from now 630 byte/cell to 37 bytes/cell
without significant loss of performance (which isn't great to begin with)?
An optimized model can make use of the knowledge that sheets are commonly
square, that they tend to be taller than wide, that there is usually a lot
more data than comments, that the cell style tends to be the same in larger
rectangular areas, etc.. Can the all purpose model ever have the same
efficiency as a model taking these issues into account?
Wouldn’t it likely treat comments the same way as values and manage cell
styles in a wasteful way since it doesn't know anything about the typical
use of styles?


A solution that would perhaps solve the problem would be to have a 
common in-memory model for both HSSF and XSSF and just have two separate 
serializers for the different formats. We would appreciate the possibility
to define a custom model since the general model needs to be efficient for
random access and we don’t need that at all in our application. Instead, the
model is written left-to-right, top-to-bottom and it is accessed for reading
only at the very end for the purpose of saving the document to disk. A model
designed for this purpose only can be implemented very efficiently.
 

Find all the results of the test in the table below (All tests were done 
with version 3.7):



#Space tests

#time java  -Xmx64m  PoiTest        HSSF   25200 33 0 #  77 byte/cell, 104 %
#time java  -Xmx64m  NaiveModelTest NAIVE  24310 33 0 #  80 byte/cell, 100 %
#time java  -Xmx64m  PoiTest        XSSF    3050 33 0 # 636 byte/cell,  12 %

#time java -Xmx256m  PoiTest        HSSF  101000 33 0 #  76 byte/cell, 104 %
#time java -Xmx256m  NaiveModelTest NAIVE  97300 33 0 #  80 byte/cell, 100 %
#time java -Xmx256m  NaiveModelTest ARRAY 210000 33 0 #  37 byte/cell, 216 %
!
#time java -Xmx256m  PoiTest        XSSF   12300 33 0 # 631 byte/cell,  13 %

#time java -Xmx256m  PoiTest        HSSF  192500 16 0 #  83 byte/cell, 100 %
#time java -Xmx256m  NaiveModelTest NAIVE 193000 16 0 #  83 byte/cell, 100 %
#time java -Xmx256m  PoiTest        XSSF   25000 16 0 # 640 byte/cell,  13 %

#time java -Xmx256m  PoiTest        HSSF  336000  8 0 #  95 byte/cell,  93 %
#time java -Xmx256m  NaiveModelTest NAIVE 361000  8 0 #  83 byte/cell, 100 %
#time java -Xmx256m  PoiTest        XSSF   48000 8 0  # 640 byte/cell,  13 %

#time java -Xmx1100m PoiTest        HSSF  434000 33 0 #  77 byte/cell, 104 %
#time java -Xmx1100m NaiveModelTest NAIVE 417000 33 0 #  80 byte/cell, 100 %
#time java -Xmx1100m PoiTest        XSSF   53100 33 0 # 628 byte/cell,  13 %

#Speed tests (test document generation in memory without saving to disk)

#time java -Xmx1100m PoiTest        HSSF  380000 33 0 # (16s),  784,000
cells/s, 113 %
#time java -Xmx1100m NaiveModelTest NAIVE 380000 33 0 # (18s),  667,000
cells/s, 100 %
#time java -Xmx1100m NaiveModelTest ARRAY 380000 33 0 # (7s), 1,791,000
cells/s, 269 %
#time java -Xmx1100m PoiTest        XSSF   50000 33 0 # (36s)    45,800
cells/s,   7 % (5,8 % of HSSF performance) !

#Speed tests including saving to the disk

#time java -Xmx256m  PoiTest        HSSF  50000 33 1 # ( 13s, 31   MB),
127,000 cells/s, 19   byte/cell on disk
#time java -Xmx256m  NaiveModelTest NAIVE 50000 33 1 # ( 45s, 59   MB),
37,000 cells/s, 35   byte/cell on disk
#time java -Xmx256m  NaiveModelTest ARRAY 50000 33 1 # ( 25s, 42   MB),
66,000 cells/s, 25   byte/cell on disk
#time java -Xmx1100m PoiTest        XSSF  40000 33 1 # (110s,  4.6 MB),
12,000 cells/s   3,5 byte/cell on disk !



I’m looking forward to your comments. Thanks a lot for your time,

Alex



--
View this message in context: 
http://apache-poi.1045710.n5.nabble.com/HSSF-and-XSSF-memory-usage-some-numbers-tp4312784p4312784.html
Sent from the POI - User mailing list archive at Nabble.com.

Reply via email to