I noticed the following code that builds the "docMap" array in 
SegmentMergeInfo.java for the case where some documents might be deleted from 
an index:

    // build array which maps document numbers around deletions 
    if (reader.hasDeletions()) {
      int maxDoc = reader.maxDoc();
      docMap = new int[maxDoc];
      int j = 0;
      for (int i = 0; i < maxDoc; i++) {
        if (reader.isDeleted(i))
          docMap[i] = -1;
        else
          docMap[i] = j++;
      }
    }
  }

For a very large index where we might want to delete/replace some documents, 
this would require a lot of memory (for 100 million documents, this would need 
381 MB of memory). Is there any reason why this was implemented this way?

It seems like this could be implemented as a much smaller array that only keeps 
track of the deleted document numbers and it would still be very efficient to 
calculate the new document number by using this much smaller array. Has this 
been done by anyone else or been considered for change in the Lucene code?

Lokesh

Reply via email to