Re: Converting an existing index format to Lucene Index

Edward Drapkin Fri, 25 Feb 2011 00:58:47 -0800

On 2/25/2011 12:26 AM, Lokendra Singh wrote:

Hi all,
I am seeking for some guidelines to directly convert an alreadyexisting index to Lucene index.The index available to me is of a set of <value1,value2> pairs. Whereeach pair is :
< word ,  fileName >
i.e a word as a 'value1', and the 'value2' being the fileNamecontaining that word.
A word might appear in several fileNames as well a same file cancontain multiple copies of a word. For eg, following index is possible:
< "my"  , "file1" >
< "you" , "file2" >
< "my",  "file2" >
< "my", "file1">
My actual problem is that the index available to me is very large insize, hence I am bit reluctant to create 'Document' object for eachfile because for that I will have to read through all the pairs firstand store them in memory. Or I will have to 'update' the 'Document'object of a particular file while iterating through the Pairs of myindex, this 'update', again, is a costly operation.
Please correct me if my understanding of Lucene is wrong or otheralternative ways.
Regards
Lokendra



Er, sorry for the blank email, hit the wrong button!

There are basically two ways to do this:

1) Buffer everything in RAM and then write all at once - this isprobably the quickest way to do it, but the most resource intensive andprone to failure (OOM will lose all work, for example).2) Iterate through the list, collecting some number of values and thenperiodically committing them to the index.

There's not really any other way: you either write it out in chunks oryou write it out all at once. However, there is some leeway in how youiterate through your old index. Iterating through the entire index andbuffering everything in RAM and writing it all out at once is, like yousaid, probably prohibitively resource intensive. You could, on theother hand, iterate through the index and only collect values for aparticular file, then commit that, then iterate again. I would imaginethis is a much slower approach, but it will be less memory intensive.

Personally, the way I'd approach this problem... I'd iterate through theold index in one pass. Every time I encountered a new file, I'd createa new Document and store it somewhere (something trivial likeMap<String, Document> where the key is the filename). I'd also ensurethat the Documents have a field called "file" so that I could easilyquery them later. Every iteration, I'd continue to add to the Documentsand every n iterations I'd commit all the Documents to the index(ostensibly calling IndexWriter.updateDocument). By tuning the numberof iterations that triggers an index write optimization, you can adjustthe balance between RAM usage and CPU/IO time spent. n=1 wouldobviously be the most CPU/IO intensive and n=inf would be the most RAMintensive and the "sweet spot" for your requirements is very probablysomewhere between those two points.

How big is this old index, by the way? Have you run tests to ensurethat the memory limit or cpu cost in either method is actually aproblem? I think you may be surprised at the speeds you get, if youhaven't run tests already.


Thanks,
Eddie

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Converting an existing index format to Lucene Index

Reply via email to