-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hello community, dear Grant
I have build a JUnit test case that illustrates the problem - there, I try to cut out the right substring with the offset values given from Lucene - and fail :( A few remarks: In this example, the 'é' from 'Bosé' makes that the '\w' pattern don't matches - it is recognized, unlike in StandardAnalyzer - as delimiter sign. Analysis: It seems that Lucene calculates the offset values by adding a virtual delimiter between every field value. But Lucene forgets the last characters of a field value when these are analyzer-specific delimiter values. (I seem this because of DocumentWriter, line 245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)' With this line of code, only the end offset of the last token is considered - by forgetting potential, trimmed delimiter chars. Thus, solving would be: 1. Add a single delimiter char between the field values 2. Substract (from the Lucene Offset) the count of analyzer-specific delimiters that are at the end of all field values before the match For this, someone needs to know what a delimiter for an specific analyzer is. The other possibility of course is to change the behaviour inside Lucene, because the current offset values are more or less useless / hard to use (I currently have no idea how to get analyzer-specific delimiter chars). For me, this looks like a bug - am I wrong? Any ideas/hints/remarks? I would be very lucky about :) Greetings Christian Grant Ingersoll schrieb: > Hi Christian, > > Is there anyway you can post a complete, self-contained example > preferably as a JUnit test? I think it would be useful to know more > about how you are indexing (i.e. what Analyzer, etc.) > The offsets should be taken from whatever is set in on the Token during > Analysis. I, too, am trying to remember where in the code this is > taking place > > Also, what version of Lucene are you using? > > -Grant > > On Aug 16, 2007, at 5:50 AM, [EMAIL PROTECTED] wrote: > > Hello, > > I have an index with an 'actor' field, for each actor there exists an > single field value entry, e.g. > > stored/compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPosition > <movie_actors> > > movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) > movie_actors:Miguel Bosé > movie_actors:Anna Lizaran (as Ana Lizaran) > movie_actors:Raquel Sanchís > movie_actors:Angelina Llongueras > > I try to get the term offset, e.g. for 'angelina' with > > termPositionVector = (TermPositionVector) > reader.getTermFreqVector(docNumber, "movie_actors"); > int iTermIndex = termPositionVector.indexOf("angelina"); > TermVectorOffsetInfo[] termOffsets = > termPositionVector.getOffsets(iTermIndex); > > > I get one TermVectorOffsetInfo for the field - with offset numbers > that are bigger than one single > Field entry. > I guessed that Lucene gives the offset number for the situation that > all values were concatenated, > which is for the single (virtual) string: > > movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna > Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras > > This fits in nearly no situation, so my second guess was that lucene > adds some virtual delimiters between the single > field entries for offset calculation. I added a delimiter, so the > result would be: > > movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé Anna > Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras > (note the ' ' between each actor name) > > ..this also fits not for each situation - there are too much > delimiters there now, so, further, I guessed that Lucene don't add > a delimiter in each situation. So I added only one when the last > character of an entry was no alphanumerical one, with: > StringBuilder strbAttContent = new StringBuilder(); > for (String strAttValue : m_luceneDocument.getValues(strFieldName)) > { > strbAttContent.append(strAttValue); > if(strbAttContent.substring(strbAttContent.length() - > 1).matches("\\w")) > strbAttContent.append(' '); > } > > where I get the result (virtual) entry: > movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna > Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras > > this fits in ~96% of all my queries....but still its not 100% the way > lucene calculates the offset value for fields with multiple > value entries. > > > ..maybe the problem is that there are special characters inside my > database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches. > I have looked to this specific situation, but considering this one > character don't solves the problem. > > > How do Lucene calculates these offsets? I also searched inside the > source code, but can't find the correct place. > > > Thanks in advance! > > Christian Reuschling > > > > > > -- > ______________________________________________________________________________ > > Christian Reuschling, Dipl.-Ing.(BA) > Software Engineer > > Knowledge Management Department > German Research Center for Artificial Intelligence DFKI GmbH > Trippstadter Straße 122, D-67663 Kaiserslautern, Germany > > Phone: +49.631.20575-125 > mailto:[EMAIL PROTECTED] http://www.dfki.uni-kl.de/~reuschling/ > > ------------Legal Company Information Required by German > Law------------------ > Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster > (Vorsitzender) > Dr. Walter Olthoff > Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes > Amtsgericht Kaiserslautern, HRB 2313= > ______________________________________________________________________________ > >> - --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] >> > -------------------------- > Grant Ingersoll > http://lucene.grantingersoll.com > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (GNU/Linux) iD8DBQFGxdBkQoTr50f1tpcRAjyHAJ9jvKORu0Ciepprg70z2qOdFhmnIgCgoaYA nkji0oFq2UGEkh7H9YLdqbE= =wKe1 -----END PGP SIGNATURE-----
package org.dynaq.index; import static org.junit.Assert.assertTrue; import java.io.IOException; import java.util.LinkedList; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.index.CorruptIndexException; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.TermPositionVector; import org.apache.lucene.index.TermVectorOffsetInfo; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.search.Hits; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.Searcher; import org.apache.lucene.store.Directory; import org.apache.lucene.store.LockObtainFailedException; import org.apache.lucene.store.RAMDirectory; import org.junit.After; import org.junit.Before; import org.junit.Test; public class TokenizerTest { IndexReader m_indexReader; Analyzer m_analyzer = new StandardAnalyzer(); @Before public void createIndex() throws CorruptIndexException, LockObtainFailedException, IOException { Directory ramDirectory = new RAMDirectory(); // we create a fist little set of actors names LinkedList<String> llActorNames = new LinkedList<String>(); llActorNames.add("Mayrata O'Wisiedo (as Mairata O'Wisiedo)"); llActorNames.add("Miguel Bosé"); llActorNames.add("Anna Lizaran (as Ana Lizaran)"); llActorNames.add("Raquel SanchÃs"); llActorNames.add("Angelina Llongueras"); // store them into a single document with multiple values for one Field Document testDoc = new Document(); for (String strActorsName : llActorNames) { Field testEntry = new Field("movie_actors", strActorsName, Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS_OFFSETS); testDoc.add(testEntry); } // now we write it into the index IndexWriter indexWriter = new IndexWriter(ramDirectory, true, m_analyzer, true); indexWriter.addDocument(testDoc); indexWriter.close(); m_indexReader = IndexReader.open(ramDirectory); } @Test public void checkOffsetValues() throws ParseException, IOException { // first, we search for 'angelina' String strSearchTerm = "Angelina"; Searcher searcher = new IndexSearcher(m_indexReader); QueryParser parser = new QueryParser("movie_actors", m_analyzer); Query query = parser.parse(strSearchTerm); Hits hits = searcher.search(query); Document resultDoc = hits.doc(0); String[] straValues = resultDoc.getValues("movie_actors"); // now, we get the field values and build a single string value out of them StringBuilder strbSimplyConcatenated = new StringBuilder(); StringBuilder strbWithDelimiters = new StringBuilder(); StringBuilder strbWithDelimitersAfterAlphaNumChar = new StringBuilder(); for (String strActorName : straValues) { // first situation: we simply concatenate all field value entries strbSimplyConcatenated.append(strActorName); // second: we add a single delimiter char between the field values strbWithDelimiters.append(strActorName).append('$'); // third try: we add a single delimiter, but only if the last char of the actor before was a alphanum-char strbWithDelimitersAfterAlphaNumChar.append(strActorName); String strLastChar = strActorName.substring(strActorName.length() - 1, strActorName.length()); if(strLastChar.matches("\\w")) strbWithDelimitersAfterAlphaNumChar.append('$'); } // this is the offset value from lucene. This should be the place of 'angelina' in one of the concatenated value Strings above TermPositionVector termPositionVector = (TermPositionVector) m_indexReader.getTermFreqVector(0, "movie_actors"); int iTermIndex = termPositionVector.indexOf(strSearchTerm.toLowerCase()); TermVectorOffsetInfo[] termOffsets = termPositionVector.getOffsets(iTermIndex); int iStartOffset = termOffsets[0].getStartOffset(); int iEndOffset = termOffsets[0].getEndOffset(); //we create the substrings according to the offset value given from lucene String strSubString1 = strbSimplyConcatenated.substring(iStartOffset, iEndOffset); String strSubString2 = strbWithDelimiters.substring(iStartOffset, iEndOffset); String strSubString3 = strbWithDelimitersAfterAlphaNumChar.substring(iStartOffset, iEndOffset); System.out.println("Offset value: " + iStartOffset + "-" + iEndOffset); System.out.println("simply concatenated:"); System.out.println(strbSimplyConcatenated); System.out.println("SubString for offset: '" + strSubString1 + "'"); System.out.println(); System.out.println("with delimiters:"); System.out.println(strbWithDelimiters); System.out.println("SubString for offset: '" + strSubString2 + "'"); System.out.println(); System.out.println("with delimiter after alphanum character:"); System.out.println(strbWithDelimitersAfterAlphaNumChar); System.out.println("SubString for offset: '" + strSubString3 + "'"); //is the offset value correct for one of the concatenated strings? //this fails for all situations assertTrue(strSubString1.equals(strSearchTerm) || strSubString2.equals(strSearchTerm) || strSubString3.equals(strSearchTerm)); /* * Comments: In this example, the 'é' from 'Bosé' makes that the '\w' pattern don't matches - it is recognized, unlike in * StandardAnalyzer - as delimiter sign. * * Analysis: It seems that Lucene calculates the offset values by adding a virtual delimiter between every field value. * But Lucene forgets the last characters of a field value when these are analyzer-specific delimiter values. * (I seem this because of DocumentWriter, line 245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)' * With this line of code, only the end offset of the last token is considered - by forgetting potential, trimmed delimiter * chars. * * Thus, solving would be: * 1. Add a single delimiter char between the field values * 2. Substract (from the Lucene Offset) the count of analyzer-specific delimiters that are at the end of all field values before the match * * For this, someone needs to know what a delimiter for an specific analyzer is. */ } @After public void closeIndex() throws CorruptIndexException, IOException { m_indexReader.close(); } }
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]