RE: Problem with TermVector offsets and positions not being preserved

Mike O'Leary Fri, 20 Jul 2012 17:24:55 -0700

Hi Robert,
I'm not trying to determine whether a document has term vectors, I'm trying to 
determine whether the term vectors that are in the index have offsets and 
positions stored. Shouldn't the Field instance variables called 
storeOffsetWithTermVector and storePositionWithTermVector be set to true for a 
field that is defined to store offsets and positions in term vectors? They are 
set to true in 3.5, but not in 3.6. When I open an index that I created with 
3.6 in Luke, it says the fields in question have term vectors enabled, but 
offsets and positions are not stored. Maybe once term vectors with offsets and 
positions are created, it doesn't matter anymore what the values of 
storeOffsetWithTermVector and storePositionWithTermVector happen to be, but I'd 
like to find out for sure if offsets and positions are being handled right in 
3.6 or not because I need to produce indexes that a co-worker can use with a UI 
that uses fast vector term highlighting, and I'd like to be sure I have created 
indexes that work for him.
Thanks,
Mike


-----Original Message-----
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: Friday, July 20, 2012 4:05 PM
To: java-user@lucene.apache.org
Subject: Re: Problem with TermVector offsets and positions not being preserved

I think its wrong for DumpIndex to look at term vector information from the 
Document that was retrieved from IndexReader.document, thats basically just a 
way of getting access to your stored fields.

This tool should be using something like IndexReader.getTermFreqVector for the 
document to determine if it has term vectors.

On Fri, Jul 20, 2012 at 5:10 PM, Mike O'Leary <tmole...@uw.edu> wrote:
> Hi Robert,
> I put together the following two small applications to try to separate the 
> problem I am having from my own software and any bugs it contains. One of the 
> applications is called CreateTestIndex, and it comes with the Lucene In 
> Action book's source code that you can download from Manning Publications. I 
> changed it a tiny bit to get rid of a special analyzer that is irrelevant to 
> what I am looking at, to get rid of a few warnings about deprecated 
> functions, and to add a loop that writes names of fields and their 
> TermVector, offset and position settings to the console.
>
> The other application is called DumpIndex, and got it from a web site 
> somewhere about 6 months ago. I changed a few lines to get rid of deprecated 
> function warnings and added the same line of code to it that writes field 
> information to the console.
>
> What I am seeing is that when I run CreateTestIndex, when the fields are 
> first created, added to a document, and are about to be added to the index, 
> the fields for which Field.TermVector.WITH_POSITIONS_OFFSETS is specified 
> correctly print out that the values of field.isTermVectorStored(), 
> field.isStoreOffsetWithTermVector() and field.isStorePositionWithTermVector() 
> are true. When I run DumpIndex on the index that was created, those fields 
> print out true for field.isTermVectorStored() and false for the other two 
> functions.
> Thanks,
> Mike
>
> This is the source code for CreateTextIndex:
>
> //////////////////////////////////////////////////////////////////////
> //////////
> package myLucene;
>
> /**
>  * Copyright Manning Publications Co.
>  *
>  * Licensed under the Apache License, Version 2.0 (the "License");
>  * you may not use this file except in compliance with the License.
>  * You may obtain a copy of the License at
>  *
>  *     http://www.apache.org/licenses/LICENSE-2.0
>  *
>  * Unless required by applicable law or agreed to in writing, software
>  * distributed under the License is distributed on an "AS IS" BASIS,
>  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
>  * See the License for the specific lan */
>
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field; import 
> org.apache.lucene.document.Fieldable;
> import org.apache.lucene.document.NumericField;
> import org.apache.lucene.document.DateTools;
> import org.apache.lucene.store.Directory;
> import org.apache.lucene.store.FSDirectory;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.index.IndexWriter;
> import org.apache.lucene.index.IndexWriterConfig;
> import org.apache.lucene.util.Version;
>
> import java.io.File;
> import java.io.FileInputStream;
> import java.io.IOException;
> import java.util.Properties;
> import java.util.Date;
> import java.util.List;
> import java.util.ArrayList;
> import java.text.ParseException;
>
> public class CreateTestIndex {
>
>   public static Document getDocument(String rootDir, File file) throws 
> IOException {
>     Properties props = new Properties();
>     props.load(new FileInputStream(file));
>
>     Document doc = new Document();
>
>     // category comes from relative path below the base directory
>     String category = file.getParent().substring(rootDir.length());    //1
>     category = category.replace(File.separatorChar, '/');              //1
>
>     String isbn = props.getProperty("isbn");         //2
>     String title = props.getProperty("title");       //2
>     String author = props.getProperty("author");     //2
>     String url = props.getProperty("url");           //2
>     String subject = props.getProperty("subject");   //2
>
>     String pubmonth = props.getProperty("pubmonth"); //2
>
>     System.out.println(title + "\n" + author + "\n" + subject + "\n" + 
> pubmonth + "\n" + category + "\n---------");
>
>     doc.add(new Field("isbn",                     // 3
>                       isbn,                       // 3
>                       Field.Store.YES,            // 3
>                       Field.Index.NOT_ANALYZED)); // 3
>     doc.add(new Field("category",                 // 3
>                       category,                   // 3
>                       Field.Store.YES,            // 3
>                       Field.Index.NOT_ANALYZED)); // 3
>     doc.add(new Field("title",                    // 3
>                       title,                      // 3
>                       Field.Store.YES,            // 3
>                       Field.Index.ANALYZED,       // 3
>                       Field.TermVector.WITH_POSITIONS_OFFSETS));   // 3
>     doc.add(new Field("title2",                   // 3
>                       title.toLowerCase(),        // 3
>                       Field.Store.YES,            // 3
>                       Field.Index.NOT_ANALYZED_NO_NORMS,   // 3
>                       Field.TermVector.WITH_POSITIONS_OFFSETS));  // 3
>
>     // split multiple authors into unique field instances
>     String[] authors = author.split(",");            // 3
>     for (String a : authors) {                       // 3
>       doc.add(new Field("author",                    // 3
>                         a,                           // 3
>                         Field.Store.YES,             // 3
>                         Field.Index.NOT_ANALYZED,    // 3
>                         Field.TermVector.WITH_POSITIONS_OFFSETS));   // 3
>     }
>
>     doc.add(new Field("url",                        // 3
>                       url,                           // 3
>                       Field.Store.YES,                // 3
>                       Field.Index.NOT_ANALYZED_NO_NORMS));   // 3
>     doc.add(new Field("subject",                     // 3  //4
>                       subject,                       // 3  //4
>                       Field.Store.YES,               // 3  //4
>                       Field.Index.ANALYZED,          // 3  //4
>                       Field.TermVector.WITH_POSITIONS_OFFSETS)); // 3  
> //4
>
>     doc.add(new NumericField("pubmonth",          // 3
>                              Field.Store.YES,     // 3
>                              true).setIntValue(Integer.parseInt(pubmonth)));  
>  // 3
>
>     Date d; // 3
>     try { // 3
>       d = DateTools.stringToDate(pubmonth); // 3
>     } catch (ParseException pe) { // 3
>       throw new RuntimeException(pe); // 3
>     }                                             // 3
>     doc.add(new NumericField("pubmonthAsDay")      // 3
>                  .setIntValue((int) (d.getTime()/(1000*3600*24))));   // 3
>
>     for(String text : new String[] {title, subject, author, category}) {      
>      // 3 // 5
>       doc.add(new Field("contents", text,                             // 3 // 
> 5
>                         Field.Store.NO, Field.Index.ANALYZED,         // 3 // 
> 5
>                         Field.TermVector.WITH_POSITIONS_OFFSETS));    // 3 // 
> 5
>     }
>
>     List<Fieldable> fields = doc.getFields();
>
>     for (Fieldable field : fields) {
>         System.out.println(field.name() + " " + field.isTermVectorStored() + 
> " " +
>                         field.isStoreOffsetWithTermVector() + " " + 
> field.isStorePositionWithTermVector());
>     }
>     return doc;
>   }
>
>   private static void findFiles(List<File> result, File dir) {
>     for(File file : dir.listFiles()) {
>       if (file.getName().endsWith(".properties")) {
>         result.add(file);
>       } else if (file.isDirectory()) {
>         findFiles(result, file);
>       }
>     }
>   }
>
>   public static void main(String[] args) throws IOException {
>     String dataDir = args[0];
>     String indexDir = args[1];
>     List<File> results = new ArrayList<File>();
>     findFiles(results, new File(dataDir));
>     System.out.println(results.size() + " books to index");
>     Directory dir = FSDirectory.open(new File(indexDir));
>     IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_36, new 
> StandardAnalyzer(Version.LUCENE_36));
>     IndexWriter w = new IndexWriter(dir, config);
>     for(File file : results) {
>       Document doc = getDocument(dataDir, file);
>       w.addDocument(doc);
>     }
>     w.close();
>     dir.close();
>   }
> }
>
> /*
>   #1 Get category
>   #2 Pull fields
>   #3 Add fields to Document instance
>   #4 Flag subject field
>   #5 Add catch-all contents field
>   #6 Custom analyzer to override multi-valued position increment */ 
> //////////////////////////////////////////////////////////////////////
> //////////
> And for DumpIndex:
> //////////////////////////////////////////////////////////////////////
> //////////
> package myLucene;
>
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Fieldable;
>
> import org.apache.lucene.index.CorruptIndexException;
> import org.apache.lucene.index.IndexReader;
>
> import org.apache.lucene.store.FSDirectory;
>
> import java.io.File;
> import java.io.IOException;
>
> import javax.xml.stream.FactoryConfigurationError;
> import javax.xml.stream.XMLOutputFactory;
> import javax.xml.stream.XMLStreamException;
> import javax.xml.stream.XMLStreamWriter;
>
> /**
>  * Dumps a Lucene index as XML. Dumps all documents with their fields and 
> values to stdout.
>  *
>  * Blog post at
>  * http://ktulu.com.ar/blog/2009/10/12/dumping-lucene-indexes-as-xml/
>  *
>  * @author Luis Parravicini
>  */
> public class DumpIndex {
>         /**
>          * Reads the index from the directory passed as argument or "index" 
> if no arguments are given.
>          */
>         public static void main(String[] args) throws Exception {
>                 String index = (args.length > 0 ? args[0] : "index");
>
>                 new DumpIndex(index).dump();
>         }
>
>         private String dir;
>
>         public DumpIndex(String dir) {
>                 this.dir = dir;
>         }
>
>         public void dump() throws XMLStreamException, 
> FactoryConfigurationError, CorruptIndexException, IOException {
>                 XMLStreamWriter out = 
> XMLOutputFactory.newInstance().createXMLStreamWriter(System.out);
>                 IndexReader reader = 
> IndexReader.open(FSDirectory.open(new File(dir)));
>
>                 out.writeStartDocument();
>                 out.writeStartElement("documents");
>
>                 for (int i = 0; i < reader.numDocs(); i++) {
>                         dumpDocument(reader.document(i), out);
>                 }
>                 out.writeEndElement();
>                 out.writeEndDocument();
>                 out.flush();
>                 reader.close();
>         }
>
>         private void dumpDocument(Document document, XMLStreamWriter out) 
> throws XMLStreamException {
>                 out.writeStartElement("document");
>
>                 for (Fieldable field : document.getFields()) {
>                 System.out.println(field.name() + " " + 
> field.isTermVectorStored() + " " +
>                                 field.isStoreOffsetWithTermVector() + 
> " " + field.isStorePositionWithTermVector());
>
>                         out.writeStartElement("field");
>                         out.writeAttribute("name", field.name());
>                         out.writeAttribute("value", field.stringValue());
>                         out.writeEndElement();
>                 }
>                 out.writeEndElement();
>         }
> }
> //////////////////////////////////////////////////////////////////////
> //////////
>
> -----Original Message-----
> From: Robert Muir [mailto:rcm...@gmail.com]
> Sent: Friday, July 20, 2012 6:11 AM
> To: java-user@lucene.apache.org
> Subject: Re: Problem with TermVector offsets and positions not being 
> preserved
>
> Hi Mike:
>
> I wrote up some tests last night against 3.6 trying to find some way to 
> reproduce what you are seeing, e.g. adding additional segments with the field 
> specified without term vectors, without tv offsets, omitting TF, and merging 
> them and checking everything out. I couldnt find any problems.
>
> Can you provide more information?
>
> On Thu, Jul 19, 2012 at 7:16 PM, Mike O'Leary <tmole...@uw.edu> wrote:
>> I created an index using Lucene 3.6.0 in which I specified that a certain 
>> text field in each document should be indexed, stored, analyzed with no 
>> norms, with term vectors, offsets and positions. Later I looked at that 
>> index in Luke, and it said that term vectors were created for this field, 
>> but offsets and positions were not. The code I used for indexing couldn't be 
>> simpler. It looks like this for the relevant field:
>>
>> doc.add(new Field("ReportText", reportTextContents, Field.Store.YES, 
>> Field.Index.ANALYZED_NO_NORMS, 
>> Field.TermVector.WITH_POSITIONS_OFFSETS);
>>
>> The indexer adds these documents to the index and commits them. I ran the 
>> indexer in a debugger and watched the Lucene code set the Field instance 
>> variables called storeTermVector, storeOffsetWithTermVector and 
>> storePositionWithTermVector to true for this field.
>>
>> When the indexing was done, I ran a simple program in a debugger that opens 
>> an index, reads each document and writes out its information as XML. The 
>> values of storeOffsetWithTermVector and storePositionWithTermVector in the 
>> ReportText Field objects were false. Is there something other than 
>> specifying Field.TermVector.WITH_POSITIONS_OFFSETS when constructing a Field 
>> that needs to be done in order for offsets and positions to be saved in the 
>> index? Or are there circumstances under which the Field.TermVector setting 
>> for a Field object is ignored? This doesn't make sense to me, and I could 
>> swear that offsets and positions were being saved in some older indexes I 
>> created that I unfortunately no longer have around for comparison. I'm sure 
>> that I am just overlooking something or have made some kind of mistake, but 
>> I can't see what it is at the moment. Thanks for any help or advice you can 
>> give me.
>> Mike
>
>
>
> --
> lucidimagination.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>



--
lucidimagination.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Problem with TermVector offsets and positions not being preserved

Reply via email to