On 2/25/2011 12:26 AM, Lokendra Singh wrote:
Hi all,

I am seeking for some guidelines to directly convert an already existing index to Lucene index. The index available to me is of a set of <value1,value2> pairs. Where each pair is :
< word ,  fileName >
i.e a word as a 'value1', and the 'value2' being the fileName containing that word.

A word might appear in several fileNames as well a same file can contain multiple copies of a word. For eg, following index is possible:
< "my"  , "file1" >
< "you" , "file2" >
< "my",  "file2" >
< "my", "file1">

My actual problem is that the index available to me is very large in size, hence I am bit reluctant to create 'Document' object for each file because for that I will have to read through all the pairs first and store them in memory. Or I will have to 'update' the 'Document' object of a particular file while iterating through the Pairs of my index, this 'update', again, is a costly operation.

Please correct me if my understanding of Lucene is wrong or other alternative ways.

Regards
Lokendra


Er, sorry for the blank email, hit the wrong button!

There are basically two ways to do this:

1) Buffer everything in RAM and then write all at once - this is probably the quickest way to do it, but the most resource intensive and prone to failure (OOM will lose all work, for example). 2) Iterate through the list, collecting some number of values and then periodically committing them to the index.

There's not really any other way: you either write it out in chunks or you write it out all at once. However, there is some leeway in how you iterate through your old index. Iterating through the entire index and buffering everything in RAM and writing it all out at once is, like you said, probably prohibitively resource intensive. You could, on the other hand, iterate through the index and only collect values for a particular file, then commit that, then iterate again. I would imagine this is a much slower approach, but it will be less memory intensive.

Personally, the way I'd approach this problem... I'd iterate through the old index in one pass. Every time I encountered a new file, I'd create a new Document and store it somewhere (something trivial like Map<String, Document> where the key is the filename). I'd also ensure that the Documents have a field called "file" so that I could easily query them later. Every iteration, I'd continue to add to the Documents and every n iterations I'd commit all the Documents to the index (ostensibly calling IndexWriter.updateDocument). By tuning the number of iterations that triggers an index write optimization, you can adjust the balance between RAM usage and CPU/IO time spent. n=1 would obviously be the most CPU/IO intensive and n=inf would be the most RAM intensive and the "sweet spot" for your requirements is very probably somewhere between those two points.

How big is this old index, by the way? Have you run tests to ensure that the memory limit or cpu cost in either method is actually a problem? I think you may be surprised at the speeds you get, if you haven't run tests already.

Thanks,
Eddie

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to