effectiveness of compression

Jamie Wed, 15 Feb 2012 00:19:12 -0800

Greetings All.

I'd like to index data corresponding to different versions of the samefile. These files consists of PDF documents, word documents, and thelike. So as to ensure that no information is lost, I'd like to create anew Lucene document for every version (or change) in a file. Eachversion of a file will have text added and removed, however, there islikely to be a high degree data duplication across the differentversions. Assuming this indexed data is largely tokenized, to whatextent will Lucene compress the data? Will it take into account that thedata already exists in the index? I am worried about our index sizegrowing too large when pursuing this strategy (i.e. one of creating anew Lucene document for every version of a file).


Many thanks for your consideration.

Jamie





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

effectiveness of compression

Reply via email to