Re: HSSF and XSSF memory usage, some numbers

Nick Burch Tue, 19 Apr 2011 04:50:29 -0700

On Tue, 19 Apr 2011, Alex Geller wrote:

We need different styles, colspan, rowspan, etc. because the output is
supposed to resemble the layout of the report as closely as possible. This
keeps us from using the csv trick. For the same reason, we suspect that the
XML zip injection trick (see  http://www.realdevelopers.com/blog/code/excel
Streaming xlsx files ) that can also be found on this forum
cannot be applied either.
Is this assumption correct? The XML for the data looks straightforward but
what about other issues like cell styles?

Take a look a the BigGridDemo - it may do what you need it to. The ideathere is to use the friendly UserModel code to generate fiddly bits likefonts, styles, formatting etc. Then, generate the data with a low levelxml streaming, and merge the two.

This should hopefully let you generate a fairly rich file, with lots ofdata, without using much memory

Translated into rows and cells this means that with a heap space of 265 MB,
one can produce 101,000 rows using HSSF and only 12,300 rows
using XSSF. Using XSSF we can't even get over the 65535 limit with the
maximum of 1.1 GB heap space.

The usual answer for XSSF is either to use something along thingBigGridDemo style, or just bump up your heap size (8gb memory modulesusually cost something like half a day's billable rate for a programmer,so you can buy a lot of memory for the price of someone optimising thecode...)

There are references to both the sheet (_sheet) and the workbook(_book). Isn’t it possible remove _book and implement getBoundWorkbook()as getSheet().getWorkbook()?

Possibly. Are you able to use your test rig to check the performanceimpact of this?

The values are apparently stored in _record based on _cellType to caterfor the different data types (double, date string ..). Why not get ridof the type field and query the value for the type (getCellType() {return _record.getCellType(); }? The case of setting a style before avalue can be handled by assigning a "type only" value.

I think we've generally gone for the simplest option. If you can see howthis'd work and would save memory, please send in a patch and we'll lookat applying it

It seems that the member variable _stringValue is used to store string
values. Couldn't this be stored in _record?

We need to store the parsed form somewhere. Wouldn't it be the same memoryuse no matter if we stored it against the cell or the cell's record?

The member variable _comment apparently stores a cell comment.

Finding a cell's comment is a bit tricky, so we cache it once we locateit.

Assuming that per average there are more values than comments one could
surely find a more efficient storage strategy. As an example one could
introduce extra value types so that for every cell record type there is a
commented and a non commented version (e.g. DoubleCellValueRecord,
CommentedDoubleCellValueRecord).

Hmm, that doesn't look very clean to me. One thing that we could do ispush the cache down into the sheet, since that's where the records arestored. If we used a map there to cache the comments once created fromrecords, that'd probably help with the memory footprint, wouldn't it?Assuming so, please send in a patch and I'll review + apply.

Looking at the storage method used in the rows (HSFFRow) to store thecells, there is also potential for simple memory optimization.Currently, the rows are stored in a vector that grows the capacity bydoubling, starting with an initial size of 5. A spread sheet of 81columns and 400,000 rows wastes (79*400,000=32MB).

I'd be tempted to switch this to just using an ArrayList, instead ofhandling it ourself. We could probably also do something smart with thesizing of the row when reading in, because we can probably figure out thenhow many cells we have. Would that help for your case? If so, pleaseeither send in a patch, or give me a shout and I'd be happy to tackle that

Keeping a list of the row widths seen so far can make the allocationmuch faster and avoid the waste. Aren't most sheets square so that thelist would have only one entry?

Not sure where that logic is to check, but if you'd like to send in apatch I'll happily review it, or point me at the code and I'll look andcomment :)

Regarding XSSF it seems that there is a more basic problem. Can an allpurpose (xmlbeans) model be as a efficient as a custom model?


Almost certainly not. It's a hell of a lot quicker to code though!

Can the memory consumption realistically be lowered from now 630byte/cell to 37 bytes/cell without significant loss of performance(which isn't great to begin with)?

It's not impossible that something specific and lightweight could be codedup for a few hot bits, though I've never tried it. The issue is that atthe moment, most of the people volunteering their time to work on POIcan't spend as long as they'd like working on POI. The resource that'sshort is programmer time, and for us it isn't memory (there areworkarounds like BigGridDemo that work well enough)

If the XSSF memory is a problem for you, and if you have some programmertime to throw at it, we'd love for you to help! However so far everyonewho's hit problems has either switched to BigGridDemo, or thrown a grandat their favourite server manufacturer and bought 16gb of memory to makethe problem go away...

A solution that would perhaps solve the problem would be to have a
common in-memory model for both HSSF and XSSF and just have two separate
serializers for the different formats.

The two formats probably aren't quite close enough for this. We've gotcommon interfaces, but the code underneath is different enough that itneeds different logic. Some bits are common, see the concrete classes inorg.apache.poi.ss.usermodel and ss.util for those, but the rest currentlyneeds to differ

Otherwise, thanks for doing all this checking! And if you have some timeto help work on solutions, we'd love for you to help and send in patchesto improve things :)


Cheers
Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: HSSF and XSSF memory usage, some numbers

Reply via email to