Eddie and Aditya: Thanks a lot for your suggestions!!. The collect-and-commit approach looks good.
@Eddie: My old index size is of order of ~Milion pairs. Though, I haven't run any such tests, I will be running them quickly for analysis. Regards Lokendra On Fri, Feb 25, 2011 at 2:23 PM, Edward Drapkin <edwa...@wolfram.com> wrote: > On 2/25/2011 12:26 AM, Lokendra Singh wrote: > >> Hi all, >> >> I am seeking for some guidelines to directly convert an already existing >> index to Lucene index. >> The index available to me is of a set of <value1,value2> pairs. Where each >> pair is : >> < word , fileName > >> i.e a word as a 'value1', and the 'value2' being the fileName containing >> that word. >> >> A word might appear in several fileNames as well a same file can contain >> multiple copies of a word. For eg, following index is possible: >> < "my" , "file1" > >> < "you" , "file2" > >> < "my", "file2" > >> < "my", "file1"> >> >> My actual problem is that the index available to me is very large in size, >> hence I am bit reluctant to create 'Document' object for each file because >> for that I will have to read through all the pairs first and store them in >> memory. Or I will have to 'update' the 'Document' object of a particular >> file while iterating through the Pairs of my index, this 'update', again, is >> a costly operation. >> >> Please correct me if my understanding of Lucene is wrong or other >> alternative ways. >> >> Regards >> Lokendra >> > > > Er, sorry for the blank email, hit the wrong button! > > There are basically two ways to do this: > > 1) Buffer everything in RAM and then write all at once - this is probably > the quickest way to do it, but the most resource intensive and prone to > failure (OOM will lose all work, for example). > 2) Iterate through the list, collecting some number of values and then > periodically committing them to the index. > > There's not really any other way: you either write it out in chunks or you > write it out all at once. However, there is some leeway in how you iterate > through your old index. Iterating through the entire index and buffering > everything in RAM and writing it all out at once is, like you said, probably > prohibitively resource intensive. You could, on the other hand, iterate > through the index and only collect values for a particular file, then commit > that, then iterate again. I would imagine this is a much slower approach, > but it will be less memory intensive. > > Personally, the way I'd approach this problem... I'd iterate through the > old index in one pass. Every time I encountered a new file, I'd create a > new Document and store it somewhere (something trivial like Map<String, > Document> where the key is the filename). I'd also ensure that the > Documents have a field called "file" so that I could easily query them > later. Every iteration, I'd continue to add to the Documents and every n > iterations I'd commit all the Documents to the index (ostensibly calling > IndexWriter.updateDocument). By tuning the number of iterations that > triggers an index write optimization, you can adjust the balance between RAM > usage and CPU/IO time spent. n=1 would obviously be the most CPU/IO > intensive and n=inf would be the most RAM intensive and the "sweet spot" for > your requirements is very probably somewhere between those two points. > > How big is this old index, by the way? Have you run tests to ensure that > the memory limit or cpu cost in either method is actually a problem? I > think you may be surprised at the speeds you get, if you haven't run tests > already. > > Thanks, > Eddie > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >