Is there a Trie-based term index? Seems like this would be smaller, and very fast on non-leading wildcards.

On 07/09/2013 02:34 PM, Uwe Schindler wrote:
Hi,

You can replace the term by their hash directly in the analyzer chain. Just 
write a custom TermToBytesRef attribute that hashes the term to a 
constant-length byte[] (using a AttributeFactory)! :-) This would give you all 
features of hashed, constant length terms, but you would lose prefix and 
wildcard queries. In fact, NumericTokenStream is doing this for numeric!

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


-----Original Message-----
From: Adrien Grand [mailto:jpou...@gmail.com]
Sent: Tuesday, July 09, 2013 11:25 PM
To: java-user@lucene.apache.org
Subject: Re: posting list strings

Hi,

Lucene stores the string because it may need it to run prefix or range
queries. We don't have a hash-based terms dictionary right now but I know
some people wrote one since they don't need support for these queries, see
for instance the Earlybird paper[1]. Then if you can find a perfect hashing
function, you can just replace your terms by their hash.

[1]
http://www.umiacs.umd.edu/~jimmylin/publications/Busch_etal_ICDE2012.
pdf

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to