Hi,
I am glad that you are raising this subject since I had a prophylactic look
into this a while back.

If I understand you correctly the your concern is that the shared
strings table currently used in SXSSF can exhaust memory.

Let me sketch out what I have done so far on this topic (It's been a
while so I might be inaccurate on some details).

The current implementation stores shares strings in a HashMap. This can
exhaust memory on very large sheets.
The first solution that came to my mind is a solution where we write
every string to a file without checking for uniqueness ( I am not sure
if the Excel format actually requires the strings in the shared strings
file to be unique or not)
That would create a large file but the subsequent compressing would
probably take care of that. There may be problems when Excel loads the
file. Perhaps it runs out of memory but perhaps not. That would have to
be tried.

The next thing I explored was doing string interning on disk.
I tried three methods:
1) By trie (An unoptimized radix tree)
2) By B-Tree
3) By a disk hash map

I benchmarked these 3 and it turned out that the hash gave the best
performance (I will share the code with anyone interested upon request)
Normally B-Trees are very good but since the strings do not have a fixed
length we need an extra indirection to a string table and that ruins the
performance (Minimizing the number of seeks is crucial for the performance).

Implementation based on HashMap:
For a test I took the original code of java.util.HashMap and created
from that a class called DiskHashMap.
That class uses an implementation of "RandomAccessStorage" to store the
data in the hash map.
I create two implementations of RandomAccessStorage, namely
ByteArrayRandomAccessStorage and FileRandomAccessStorage.
The first uses the heap in form of a growing byte array and the latter
uses the disk.
Since both implement the same interface we could use the much faster
byte array bases storage until a certain threshold is reached and then
switch to the file based storage and continue from there.
The threshold can be specified in a very accurate was (e.g. 10 MB) since
at all time we know exactly how many bytes the byte array occupies and
we don't rely on unpredictable JVM heap space numbers.

You say that you experimented with "MapDB" but I have personally never
used it so I can't say if it is applicable for this problem. The disk
hash map I have made has no memory usage worth mentioning and should be
quite solid since it is dead simple.
Shooting from the hip, I would say that a "whole database" might by a
bit of an overkill for such a small problem and it also has the risk
that you cannot be sure that the main issue (very low memory footprint) 
has been properly addressed.
I don't know yet how much code in SXSSF needs to be changed to get the
maximum benefit from this. Naively I would say that we keep regular Java
strings in the rows of SXSSFs sliding window and when we flush
then we use the disk hash map, string interner to compute the string
table index which we then write to the file. Then, when the document is
complete, we create an XML representation of the disk hash map file.
Looking at the current code I see that things are different (Apparently
there is a CTRst instance for every interned string) but I don't know
the existing code well enough (didn't make that patch in SXSSF) to know
if there are lots of issues that I have not considered in my naive approach.

At this point we really need Nick, Yegor or someone else to give his advice.

Kind regards,
Alex



--
View this message in context: 
http://apache-poi.1045710.n5.nabble.com/Using-MapDB-to-reduce-memory-footprint-of-shared-strings-table-in-SXSSF-tp5717375p5717378.html
Sent from the POI - Dev mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to