RE: Problem with TermVector offsets and positions not being preserved

Mike O'Leary Fri, 20 Jul 2012 14:13:21 -0700

Hi Robert,
I put together the following two small applications to try to separate the 
problem I am having from my own software and any bugs it contains. One of the 
applications is called CreateTestIndex, and it comes with the Lucene In Action 
book's source code that you can download from Manning Publications. I changed 
it a tiny bit to get rid of a special analyzer that is irrelevant to what I am 
looking at, to get rid of a few warnings about deprecated functions, and to add 
a loop that writes names of fields and their TermVector, offset and position 
settings to the console.


The other application is called DumpIndex, and got it from a web site somewhere 
about 6 months ago. I changed a few lines to get rid of deprecated function 
warnings and added the same line of code to it that writes field information to 
the console.

What I am seeing is that when I run CreateTestIndex, when the fields are first 
created, added to a document, and are about to be added to the index, the 
fields for which Field.TermVector.WITH_POSITIONS_OFFSETS is specified correctly 
print out that the values of field.isTermVectorStored(), 
field.isStoreOffsetWithTermVector() and field.isStorePositionWithTermVector() 
are true. When I run DumpIndex on the index that was created, those fields 
print out true for field.isTermVectorStored() and false for the other two 
functions.
Thanks,
Mike

This is the source code for CreateTextIndex:

////////////////////////////////////////////////////////////////////////////////
package myLucene;

/**
 * Copyright Manning Publications Co.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific lan      
*/

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Fieldable;
import org.apache.lucene.document.NumericField;
import org.apache.lucene.document.DateTools;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.util.Version;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Properties;
import java.util.Date;
import java.util.List;
import java.util.ArrayList;
import java.text.ParseException;

public class CreateTestIndex {
  
  public static Document getDocument(String rootDir, File file) throws 
IOException {
    Properties props = new Properties();
    props.load(new FileInputStream(file));

    Document doc = new Document();

    // category comes from relative path below the base directory
    String category = file.getParent().substring(rootDir.length());    //1
    category = category.replace(File.separatorChar, '/');              //1

    String isbn = props.getProperty("isbn");         //2
    String title = props.getProperty("title");       //2
    String author = props.getProperty("author");     //2
    String url = props.getProperty("url");           //2
    String subject = props.getProperty("subject");   //2

    String pubmonth = props.getProperty("pubmonth"); //2

    System.out.println(title + "\n" + author + "\n" + subject + "\n" + pubmonth 
+ "\n" + category + "\n---------");

    doc.add(new Field("isbn",                     // 3
                      isbn,                       // 3
                      Field.Store.YES,            // 3
                      Field.Index.NOT_ANALYZED)); // 3
    doc.add(new Field("category",                 // 3
                      category,                   // 3
                      Field.Store.YES,            // 3
                      Field.Index.NOT_ANALYZED)); // 3
    doc.add(new Field("title",                    // 3
                      title,                      // 3
                      Field.Store.YES,            // 3
                      Field.Index.ANALYZED,       // 3
                      Field.TermVector.WITH_POSITIONS_OFFSETS));   // 3
    doc.add(new Field("title2",                   // 3
                      title.toLowerCase(),        // 3
                      Field.Store.YES,            // 3
                      Field.Index.NOT_ANALYZED_NO_NORMS,   // 3
                      Field.TermVector.WITH_POSITIONS_OFFSETS));  // 3

    // split multiple authors into unique field instances
    String[] authors = author.split(",");            // 3
    for (String a : authors) {                       // 3
      doc.add(new Field("author",                    // 3
                        a,                           // 3
                        Field.Store.YES,             // 3
                        Field.Index.NOT_ANALYZED,    // 3
                        Field.TermVector.WITH_POSITIONS_OFFSETS));   // 3
    }

    doc.add(new Field("url",                        // 3
                      url,                           // 3
                      Field.Store.YES,                // 3
                      Field.Index.NOT_ANALYZED_NO_NORMS));   // 3
    doc.add(new Field("subject",                     // 3  //4
                      subject,                       // 3  //4
                      Field.Store.YES,               // 3  //4
                      Field.Index.ANALYZED,          // 3  //4
                      Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3  //4

    doc.add(new NumericField("pubmonth",          // 3
                             Field.Store.YES,     // 3
                             true).setIntValue(Integer.parseInt(pubmonth)));   
// 3

    Date d; // 3
    try { // 3
      d = DateTools.stringToDate(pubmonth); // 3
    } catch (ParseException pe) { // 3
      throw new RuntimeException(pe); // 3
    }                                             // 3
    doc.add(new NumericField("pubmonthAsDay")      // 3
                 .setIntValue((int) (d.getTime()/(1000*3600*24))));   // 3

    for(String text : new String[] {title, subject, author, category}) {        
   // 3 // 5
      doc.add(new Field("contents", text,                             // 3 // 5
                        Field.Store.NO, Field.Index.ANALYZED,         // 3 // 5
                        Field.TermVector.WITH_POSITIONS_OFFSETS));    // 3 // 5
    }

    List<Fieldable> fields = doc.getFields();
    
    for (Fieldable field : fields) {
        System.out.println(field.name() + " " + field.isTermVectorStored() + " 
" +
                        field.isStoreOffsetWithTermVector() + " " + 
field.isStorePositionWithTermVector());
    }
    return doc;
  }

  private static void findFiles(List<File> result, File dir) {
    for(File file : dir.listFiles()) {
      if (file.getName().endsWith(".properties")) {
        result.add(file);
      } else if (file.isDirectory()) {
        findFiles(result, file);
      }
    }
  }

  public static void main(String[] args) throws IOException {
    String dataDir = args[0];
    String indexDir = args[1];
    List<File> results = new ArrayList<File>();
    findFiles(results, new File(dataDir));
    System.out.println(results.size() + " books to index");
    Directory dir = FSDirectory.open(new File(indexDir));
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, new 
StandardAnalyzer(Version.LUCENE_36));
    IndexWriter w = new IndexWriter(dir, config);
    for(File file : results) {
      Document doc = getDocument(dataDir, file);
      w.addDocument(doc);
    }
    w.close();
    dir.close();
  }
}

/*
  #1 Get category
  #2 Pull fields
  #3 Add fields to Document instance
  #4 Flag subject field
  #5 Add catch-all contents field
  #6 Custom analyzer to override multi-valued position increment
*/
////////////////////////////////////////////////////////////////////////////////
And for DumpIndex:
////////////////////////////////////////////////////////////////////////////////
package myLucene;

import org.apache.lucene.document.Document;
import org.apache.lucene.document.Fieldable;

import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexReader;

import org.apache.lucene.store.FSDirectory;

import java.io.File;
import java.io.IOException;

import javax.xml.stream.FactoryConfigurationError;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamWriter;

/**
 * Dumps a Lucene index as XML. Dumps all documents with their fields and 
values to stdout.
 * 
 * Blog post at
 * http://ktulu.com.ar/blog/2009/10/12/dumping-lucene-indexes-as-xml/
 * 
 * @author Luis Parravicini
 */
public class DumpIndex {
        /**
         * Reads the index from the directory passed as argument or "index" if 
no arguments are given.
         */
        public static void main(String[] args) throws Exception {
                String index = (args.length > 0 ? args[0] : "index");

                new DumpIndex(index).dump();
        }

        private String dir;

        public DumpIndex(String dir) {
                this.dir = dir;
        }

        public void dump() throws XMLStreamException, 
FactoryConfigurationError, CorruptIndexException, IOException {
                XMLStreamWriter out = 
XMLOutputFactory.newInstance().createXMLStreamWriter(System.out);
                IndexReader reader = IndexReader.open(FSDirectory.open(new 
File(dir)));

                out.writeStartDocument();
                out.writeStartElement("documents");

                for (int i = 0; i < reader.numDocs(); i++) {
                        dumpDocument(reader.document(i), out);
                }
                out.writeEndElement();
                out.writeEndDocument();
                out.flush();
                reader.close();
        }

        private void dumpDocument(Document document, XMLStreamWriter out) 
throws XMLStreamException {
                out.writeStartElement("document");

                for (Fieldable field : document.getFields()) {
                System.out.println(field.name() + " " + 
field.isTermVectorStored() + " " +
                                field.isStoreOffsetWithTermVector() + " " + 
field.isStorePositionWithTermVector());
                
                        out.writeStartElement("field");
                        out.writeAttribute("name", field.name());
                        out.writeAttribute("value", field.stringValue());
                        out.writeEndElement();
                }
                out.writeEndElement();
        }
}
////////////////////////////////////////////////////////////////////////////////

-----Original Message-----
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Friday, July 20, 2012 6:11 AM
To: java-user@lucene.apache.org
Subject: Re: Problem with TermVector offsets and positions not being preserved

Hi Mike:

I wrote up some tests last night against 3.6 trying to find some way to 
reproduce what you are seeing, e.g. adding additional segments with the field 
specified without term vectors, without tv offsets, omitting TF, and merging 
them and checking everything out. I couldnt find any problems.

Can you provide more information?

On Thu, Jul 19, 2012 at 7:16 PM, Mike O'Leary <tmole...@uw.edu> wrote:
> I created an index using Lucene 3.6.0 in which I specified that a certain 
> text field in each document should be indexed, stored, analyzed with no 
> norms, with term vectors, offsets and positions. Later I looked at that index 
> in Luke, and it said that term vectors were created for this field, but 
> offsets and positions were not. The code I used for indexing couldn't be 
> simpler. It looks like this for the relevant field:
>
> doc.add(new Field("ReportText", reportTextContents, Field.Store.YES, 
> Field.Index.ANALYZED_NO_NORMS, 
> Field.TermVector.WITH_POSITIONS_OFFSETS);
>
> The indexer adds these documents to the index and commits them. I ran the 
> indexer in a debugger and watched the Lucene code set the Field instance 
> variables called storeTermVector, storeOffsetWithTermVector and 
> storePositionWithTermVector to true for this field.
>
> When the indexing was done, I ran a simple program in a debugger that opens 
> an index, reads each document and writes out its information as XML. The 
> values of storeOffsetWithTermVector and storePositionWithTermVector in the 
> ReportText Field objects were false. Is there something other than specifying 
> Field.TermVector.WITH_POSITIONS_OFFSETS when constructing a Field that needs 
> to be done in order for offsets and positions to be saved in the index? Or 
> are there circumstances under which the Field.TermVector setting for a Field 
> object is ignored? This doesn't make sense to me, and I could swear that 
> offsets and positions were being saved in some older indexes I created that I 
> unfortunately no longer have around for comparison. I'm sure that I am just 
> overlooking something or have made some kind of mistake, but I can't see what 
> it is at the moment. Thanks for any help or advice you can give me.
> Mike



--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Problem with TermVector offsets and positions not being preserved

Reply via email to