We are using Lucene 3.6 to perform incremental indexing. We use an algorithm we found on the web to perform the incremental indexing.
1. For each file that we indexed, we create a UID field to associate with it. The UID is calculated using the file path and the last updated time. 2. When performing reindexing, we used the following lines to obtain an UID interator that we can iterate through the UIDs in alphabetical order. IndexReader reader = IndexReader.open(writer,true); TermEnum uidIter = reader.terms(new Term("uid", "")); 3. We then sorts the files to be indexed and compare the files' UID with the UID returned by the UID iterator. If the UID is the same, it means that the file has not been changed. If the UID of the iterator is less than the file UID, it means the file associate with the iterator pointed UID has been deleted, we then delete the file from the index. If the UID of the iterator is greater than the file UID, it means the file is newly added, or it is old document that has been updated, we add the document to the index. Here is the code code snippet for the algo: private void indexDirectory(File docDir, File catalogDir) { try{ Directory dir = FSDirectory.open(catalogDir); boolean indexExists = IndexReader.indexExists(dir); IndexWriter writer = getIndexWriter(dir); IndexReader reader = null; TermEnum uidIter = null; if (indexExists) { reader = IndexReader.open(writer,true); uidIter = reader.terms(new Term("uid", "")); // init uid iterator } // 2: AddNewAndUpdatedDocs // Adds all new and updated (removed above) documents to the index updateDocumentIndexes(uidIter, writer, docDir, results); //Clean up indexes that haven't been iterated. It means deleted files from the file system that has not be removed from the indexes. if(uidIter != null){ cleanupIndexes(uidIter, writer, results); } writer.commit(); if (indexExists) { uidIter.close(); reader.close(); } writer.close(); }catch(IOException ex){ writeUserMessage(Level.ERROR, "Index failed for directory: "+ docDir.getPath(), ex); } } private void updateDocumentIndexes(TermEnum uidIter, IndexWriter writer, File fileToBeIndexed, IndexingResults results) { try { if (uidIter != null) { String docUid = FileDocument.uid(fileToBeIndexed); while (uidIter.term() != null && uidIter.term().field() == "uid" && uidIter.term().text().compareTo(docUid) < 0){ writer.deleteDocuments(uidIter.term()); uidIter.next(); } if (uidIter.term() != null && uidIter.term().field() == "uid" && uidIter.term().text().compareTo(docUid) ==0) { uidIter.next(); results.incrementUnchangedFiles(); } else { if(uidIter.term() != null){ if(isIndexableFile(fileToBeIndexed.getName())){ Document doc = FileDocument.Document(fileToBeIndexed); writer.addDocument(doc); results.incrementIndexedFiles(); } }else{ addDocument(writer, fileToBeIndexed, results); } } } else { addDocument(writer, fileToBeIndexed, results); } } catch (IOException fnfe) { results.incrementErrors(); fnfe.printStackTrace(); logger.log(Level.ERROR, " Unable to process document at: " + fileToBeIndexed.getPath(), fnfe); } catch (Exception ex){ results.incrementErrors(); ex.printStackTrace(); logger.log(Level.ERROR, " Unable to process document at: " + fileToBeIndexed.getPath(), ex); } } Now we are trying to upgrade to Lucene 4.7. The "reader.terms(new Term("uid", "")); " is no longer supported in 4.7. I tried to workaround it by following the Apache Lucene Migration Guide (http://lucene.apache.org/core/4_0_0/MIGRATE.html). Instead of "reader.terms(new Term("uid", "")); ", I used the following: Fields fields = MultiFields.getFields(reader); if (fields != null) { Terms terms = fields.terms("uid"); if (terms != null) { uidIter = terms.iterator(null); } } However, I found that the terms the uidIter iterates are no longer in alphabetical order. Therefore, it breaks the algorithm. Is there anyway to workaround this? Thank you! -- View this message in context: http://lucene.472066.n3.nabble.com/Incremental-Indexing-in-Lucene-4-7-tp4126620.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org