On Mon, 15 Dec 2014, Alex Geller wrote:
The next thing I explored was doing string interning on disk.
I tried three methods:
1) By trie (An unoptimized radix tree)
2) By B-Tree
3) By a disk hash map

I benchmarked these 3 and it turned out that the hash gave the best
performance (I will share the code with anyone interested upon request)
Normally B-Trees are very good but since the strings do not have a fixed
length we need an extra indirection to a string table and that ruins the
performance (Minimizing the number of seeks is crucial for the performance).

It's probably worth asking the Apache Lucene community for advice on this. If someone will have exhaustively tested the best ways to store and retrieve blocks of test to disk and memory, under an Apache license, it'll be that lot!

(They have quite a number of storage related code, which can be plugged in as needed, along with advice on what to use where, so we might be able to take a class or two that they recommend and use that for people with very large strings tables)

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to