Hi all, In a clustered environment I search the index from the web application. In the web application I am creating IndexReader on each request. is it expensive to do like this? I read somewhere in the web that try using the same reader as much as possible. Can i keep the initially created IndexReader in the session/application scopes and use the same for each request? Any other idea?
Viay On 5/3/10, Vijay Veeraraghavan <vijay.raghava...@gmail.com> wrote: > dear all, > > as replied below, does searching again for the document in the index > and if found skip the indexing else index it, is this not similar to > indexing all pdf documents once again, is not this overhead? As I am > not going to index the details of the pdf (so if an indexed pdf was > recreated i need not reindex it) but just the paths of the documents. > > Vijay > >>> Hey there, >>> >>> you might have to implement a some kind of unique identifier using an >>> indexed lucene field. When you are indexing you should fire a query with >>> the >>> uuid of your document (maybe the path to you pdf document) and check if >>> the >>> document is in the index already. You could also do a boolean query >>> combining UUID, timestamp and / or a hash value to see if the document >>> has >>> been changed. if so you can simply update the document by its UUID >>> (something like indexwriter.updateDocument(new Term("uuid", >>> value),document);) >>> >>> Unfortunately you have to implement this yourself but it should not be >>> that >>> much of a deal. >>> >>> simon >>> >>> On Mon, May 3, 2010 at 9:21 AM, Vijay Veeraraghavan < >>> vijay.raghava...@gmail.com> wrote: >>> >>>> Dear all, >>>> I am using lucene 3.0 to index the pdf reports that I generate >>>> dynamically. I index the pdf file name (without extension), file path >>>> and its absolute path as fields. I search with the file name without >>>> extension; it retrieves a list, as usually 2 or more files are present >>>> in the same name in different sub directories. As I create the index >>>> for the first time it updates, assuming 100 pdf files in different >>>> directories, the files meta info. If again I do indexing, while my >>>> report generator scheduler has the produced 500 more pdf files >>>> totaling to 600 files in different directories, I wish to index only >>>> the new files to the index. But presently it’s doing the whole thing >>>> again (600 files). How to implement this functionality? Think of the >>>> thousands of pdf files created on each run. >>>> >>>> P.S: I cannot keep the meta-info of generated pdf files in the java >>>> memory, as it exceeds thousands in a single run, and update the index >>>> looping this list. >>>> >>>> new IndexWriter(FSDirectory.open(this.indexDir), new StandardAnalyzer( >>>> Version.LUCENE_CURRENT), true, >>>> >>>> IndexWriter.MaxFieldLength.LIMITED); >>>> >>>> is the boolean parameter is for this purpose? Please guide me. >>>> >>>> -- >>>> Thanks >>>> Vijay Veeraraghavan >>>> >>>> >>>> >>>> -- >>>> Thanks & Regards >>>> Vijay Veeraraghavan >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>>> >>> >> >> >> -- >> Thanks & Regards >> Vijay Veeraraghavan >> > > > -- > Thanks & Regards > Vijay Veeraraghavan > -- Thanks & Regards Vijay Veeraraghavan --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org