I have one more question about term vector positions and offsets being 
preserved. My co-worker is working on updating the documents in an index with a 
field that contains a numerical value derived from the term frequencies and 
inverse document frequencies of terms in the document. His first pass at doing 
this calculates these values, writes them along with document ids to a text 
file and then updates the documents by reading lines from the file, searching 
for the document that contains the id, adding the field to the document, and 
replacing the document in the index. Some of the fields in these documents have 
term vectors with offsets and positions. After the revised document is updated 
in the index, those fields' term vector offsets and positions are still found. 
After closing the searcher, reader and writer that are used in this process, 
the fields that have term vectors no longer have positions and offsets in them. 
His code looks like this:

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, _analyzer);
IndexWriter writer = new IndexWriter(indexDir, config);
IndexReader reader = IndexReader.open(writer, true);
IndexSearcher searcher = new IndexSearcher(reader);

while ((s = in.readLine()) != null) {
    String[] tokens = s.split(",");
    float fieldValue = Float.parseFloat(tokens[1].trim());
    NumericField nField = new NumericField("freqVal", Field.Store.YES, true);
    nField.setFloatValue(fieldValue);
    String docId = tokens[0].trim();
    Term docIdTerm = new Term("DocId", docId);
    TermQuery query = new TermQuery(docIdTerm);
    TopDocs hits = searcher.search(query, 2);
  
    if (hits.scoreDocs.length != 1) {
        throw new Exception("Unexpected number of documents in index with docId 
= " + docId);
    }
    int docNum = hits.scoreDocs[0].doc;
    Document doc = searcher.doc(docNum);
    doc.add(nField);
    writer.updateDocument(docIdTerm, doc);
}
displayTermVectorInfo(dir);   // for debugging
writer.close();
displayTermVectorInfo(dir);   // for debugging
reader.close();
searcher.close();

private static void displayTermVectorInfo(Directory dir) throws IOException, 
CorruptIndexException {
    IndexReader reader = null;

    try {
        reader = IndexReader.open(dir);

        for (int i = 0; i < reader.numDocs; i++) {
            Document doc = reader.document(j);
            List<Fieldable> docFields = doc.getFields();

            for (Fieldable field : docFields) {
                TermFreqVector termFreqVector = reader.getTermFreqVector(i, 
field.name());
      
                if (termFreqVector != null && termFreqVector instanceof 
TermPositionVector) {
                    TermPositionVector termPositionVector = 
(TermPositionVector)termFreqVector;
                    System.out.println("Field " + field.name());

                    for (int j = 0; j < termFreqVector.size(); j++) {
                        TermVectorOffsetInfo[] offsets = 
termPositionVector.getOffsets(j);
                        
                        for (TermVectorOffsetInfo offsetInfo : offsets) {
                            System.out.println("offset: " + 
offsetInfo.getStartOffset() + " " + offsetInfo.getEndOffset());
                        }
                    }
                    for (int k = 0; k < termFreqVector.size(); k++) {
                        int[] positions = 
termPositionVector.getTermPositions(k);

                        for (int position : positions) {
                            System.out.println("position: " + position);
                        }
                    }
                }
            }
        }
    } finally {
        if (reader != null) {
            reader.close();
        }
    }
}

The first time displayTermVectorInfo is called, it displays offsets and 
positions for the fields that have term vectors with offsets and positions. The 
second time it is called, it doesn't display anything because none of the term 
vectors satisfy termFreqVector instanceof TermPositionVector. Is it supposed to 
work this way? What is it about closing the writer that alters the term vectors 
in the affected fields? Is there a way to add a field to the documents in an 
index in which this doesn't occur?
Thanks,
Mike


-----Original Message-----
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Friday, July 20, 2012 5:59 PM
To: java-user@lucene.apache.org
Subject: Re: Problem with TermVector offsets and positions not being preserved

On Fri, Jul 20, 2012 at 8:24 PM, Mike O'Leary <tmole...@uw.edu> wrote:
> Hi Robert,
> I'm not trying to determine whether a document has term vectors, I'm trying 
> to determine whether the term vectors that are in the index have offsets and 
> positions > stored.

Right: what i'm trying to tell you is that offsets and positions is not an 
index-wide setting for a field: its per-document.

I think all the tools you are using to check these values are not doing it 
correctly:
1. DumpIndex is wrongly using values from the Document returned by 
IndexReader.document(), but that doesn't and never did retrieve these values 
(it would be 2 extra disk seeks per document to figure out the term vector 
flags) 2. I havent looked at Luke, but its probably printing the "global"
bits from FieldInfos. It used to be that we wrote some bits for these options, 
I don't ever know what the purpose was since these options can be controlled 
on/off at a per-document level: they make no sense.
Because of this we stopped writing these bits in 3.6 (we only write into 
FieldInfos if the field has any term vectors at all), and thats probably whats 
confusing you there.

Again, if you really want to validate that a specific document has 
offsets/positions in its term vectors, you need to check that specific document 
with IndexReader.getTermFreqVector, there is no other way, since this can be 
controlled on a per-document basis for a field.


--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to