Greetings All.

I'd like to index data corresponding to different versions of the same file. These files consists of PDF documents, word documents, and the like. So as to ensure that no information is lost, I'd like to create a new Lucene document for every version (or change) in a file. Each version of a file will have text added and removed, however, there is likely to be a high degree data duplication across the different versions. Assuming this indexed data is largely tokenized, to what extent will Lucene compress the data? Will it take into account that the data already exists in the index? I am worried about our index size growing too large when pursuing this strategy (i.e. one of creating a new Lucene document for every version of a file).

Many thanks for your consideration.

Jamie





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to