Issue with indexed tokens position
Hi, Lucene doesn't find following value. Some issues with PhraseQuery. indexed value: pink-I Indexed tokens:1: [pink:0->5] 2: [pinki:0->5] 3: [i:5->6] (ex. explanation: "pink" is a term "0->5" term-position) And I have indexed in a field called "fieldName". My lucene search with the query [fieldName:"pink i"] can't find above indexed value. Can anyone help me out here. Thx in advance, Jelda - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Issue with indexed tokens position
Strangely.. My lucene query: fieldName:"pinki i" finds document. (see "i" in "pinki") Jelda > -Original Message- > From: Ramana Jelda [mailto:[EMAIL PROTECTED] > Sent: Friday, August 17, 2007 12:33 PM > To: java-user@lucene.apache.org > Subject: Issue with indexed tokens position > > Hi, > Lucene doesn't find following value. Some issues with PhraseQuery. > > indexed value: pink-I > Indexed tokens:1: [pink:0->5] 2: [pinki:0->5] 3: [i:5->6] > (ex. explanation: > "pink" is a term "0->5" term-position) > > And I have indexed in a field called "fieldName". > My lucene search with the query [fieldName:"pink i"] can't > find above indexed value. > > Can anyone help me out here. > > Thx in advance, > Jelda > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Issue with indexed tokens position
You'd get much better answers if you posted a concise example (or possibly code snippets), especially including the analyzers you used. Have you used Luke to examine your index and see if it's indexed as you expect? Best Erick On 8/17/07, Ramana Jelda <[EMAIL PROTECTED]> wrote: > > Strangely.. > My lucene query: fieldName:"pinki i" finds document. (see "i" > in "pinki") > > Jelda > > > -Original Message- > > From: Ramana Jelda [mailto:[EMAIL PROTECTED] > > Sent: Friday, August 17, 2007 12:33 PM > > To: java-user@lucene.apache.org > > Subject: Issue with indexed tokens position > > > > Hi, > > Lucene doesn't find following value. Some issues with PhraseQuery. > > > > indexed value: pink-I > > Indexed tokens:1: [pink:0->5] 2: [pinki:0->5] 3: [i:5->6] > > (ex. explanation: > > "pink" is a term "0->5" term-position) > > > > And I have indexed in a field called "fieldName". > > My lucene search with the query [fieldName:"pink i"] can't > > find above indexed value. > > > > Can anyone help me out here. > > > > Thx in advance, > > Jelda > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
RE: Issue with indexed tokens position
Hi Erick, Thanks. Here I try here my best to provide Pseudo code. Indexed Value: "pink-i" I have used a Custom Analyzer. My Analyzer looks a littlebit like following.. public class KeyWordFilter extends TokenFilter{ public KeyWordFilter(TokenStream in) { super(in); keywordStack = new LinkedList(); } org.apache.lucene.analysis.Token next(){ if(keywordStack.size() > 0){ return (Token) keywordStack.poll(); } //token = "pink-i" makeTokens(token); } void makeTokens(Token token){ //make following tokens and add to stack.. //[(pink,0,5,type=HYPENWORD_DIVIDED), (pinki,0,5,type=HYPENWORD_DIVIDED,posIncr=0), (i,5,6,type=HYPENWORD_DIVIDED)] } } I am 100% sure that there is a problem with token-positions. And PhraseQuery "pink i" is not working where as PhraseQuery "pinki i" works. And it seems positions are totally ignored by PhraseQuery. Any thoughts? Thx, Jelda > -Original Message- > From: Erick Erickson [mailto:[EMAIL PROTECTED] > Sent: Friday, August 17, 2007 3:31 PM > To: java-user@lucene.apache.org > Subject: Re: Issue with indexed tokens position > > You'd get much better answers if you posted a concise example > (or possibly code snippets), especially including the > analyzers you used. > > Have you used Luke to examine your index and see if it's > indexed as you expect? > > Best > Erick > > On 8/17/07, Ramana Jelda <[EMAIL PROTECTED]> wrote: > > > > Strangely.. > > My lucene query: fieldName:"pinki i" finds document. (see "i" > > in "pinki") > > > > Jelda > > > > > -Original Message- > > > From: Ramana Jelda [mailto:[EMAIL PROTECTED] > > > Sent: Friday, August 17, 2007 12:33 PM > > > To: java-user@lucene.apache.org > > > Subject: Issue with indexed tokens position > > > > > > Hi, > > > Lucene doesn't find following value. Some issues with PhraseQuery. > > > > > > indexed value: pink-I > > > Indexed tokens:1: [pink:0->5] 2: [pinki:0->5] 3: [i:5->6] (ex. > > > explanation: > > > "pink" is a term "0->5" term-position) > > > > > > And I have indexed in a field called "fieldName". > > > My lucene search with the query [fieldName:"pink i"] can't find > > > above indexed value. > > > > > > Can anyone help me out here. > > > > > > Thx in advance, > > > Jelda > > > > > > > > > > > > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: getting term offset information for fields with multiple value entiries
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello community, dear Grant I have build a JUnit test case that illustrates the problem - there, I try to cut out the right substring with the offset values given from Lucene - and fail :( A few remarks: In this example, the 'é' from 'Bosé' makes that the '\w' pattern don't matches - it is recognized, unlike in StandardAnalyzer - as delimiter sign. Analysis: It seems that Lucene calculates the offset values by adding a virtual delimiter between every field value. But Lucene forgets the last characters of a field value when these are analyzer-specific delimiter values. (I seem this because of DocumentWriter, line 245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)' With this line of code, only the end offset of the last token is considered - by forgetting potential, trimmed delimiter chars. Thus, solving would be: 1. Add a single delimiter char between the field values 2. Substract (from the Lucene Offset) the count of analyzer-specific delimiters that are at the end of all field values before the match For this, someone needs to know what a delimiter for an specific analyzer is. The other possibility of course is to change the behaviour inside Lucene, because the current offset values are more or less useless / hard to use (I currently have no idea how to get analyzer-specific delimiter chars). For me, this looks like a bug - am I wrong? Any ideas/hints/remarks? I would be very lucky about :) Greetings Christian Grant Ingersoll schrieb: > Hi Christian, > > Is there anyway you can post a complete, self-contained example > preferably as a JUnit test? I think it would be useful to know more > about how you are indexing (i.e. what Analyzer, etc.) > The offsets should be taken from whatever is set in on the Token during > Analysis. I, too, am trying to remember where in the code this is > taking place > > Also, what version of Lucene are you using? > > -Grant > > On Aug 16, 2007, at 5:50 AM, [EMAIL PROTECTED] wrote: > > Hello, > > I have an index with an 'actor' field, for each actor there exists an > single field value entry, e.g. > > stored/compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPosition > > > movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) > movie_actors:Miguel Bosé > movie_actors:Anna Lizaran (as Ana Lizaran) > movie_actors:Raquel Sanchís > movie_actors:Angelina Llongueras > > I try to get the term offset, e.g. for 'angelina' with > > termPositionVector = (TermPositionVector) > reader.getTermFreqVector(docNumber, "movie_actors"); > int iTermIndex = termPositionVector.indexOf("angelina"); > TermVectorOffsetInfo[] termOffsets = > termPositionVector.getOffsets(iTermIndex); > > > I get one TermVectorOffsetInfo for the field - with offset numbers > that are bigger than one single > Field entry. > I guessed that Lucene gives the offset number for the situation that > all values were concatenated, > which is for the single (virtual) string: > > movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna > Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras > > This fits in nearly no situation, so my second guess was that lucene > adds some virtual delimiters between the single > field entries for offset calculation. I added a delimiter, so the > result would be: > > movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé Anna > Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras > (note the ' ' between each actor name) > > ..this also fits not for each situation - there are too much > delimiters there now, so, further, I guessed that Lucene don't add > a delimiter in each situation. So I added only one when the last > character of an entry was no alphanumerical one, with: > StringBuilder strbAttContent = new StringBuilder(); > for (String strAttValue : m_luceneDocument.getValues(strFieldName)) > { >strbAttContent.append(strAttValue); >if(strbAttContent.substring(strbAttContent.length() - > 1).matches("\\w")) > strbAttContent.append(' '); > } > > where I get the result (virtual) entry: > movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna > Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras > > this fits in ~96% of all my queriesbut still its not 100% the way > lucene calculates the offset value for fields with multiple > value entries. > > > ..maybe the problem is that there are special characters inside my > database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches. > I have looked to this specific situation, but considering this one > character don't solves the problem. > > > How do Lucene calculates these offsets? I also searched inside the > source code, but can't find the correct place. > > > Thanks in advance! > > Christian Reuschling > > > > > > -- > __ > > Christian Reuschling, Dipl.-Ing.(BA) > Software E
ArrayIndexOutOfBoundsException
When I add a field containing a really long term I get an AIOOBE. Is this a documented feature? public static void main(String[] args) throws Exception { RAMDirectory dir = new RAMDirectory(); IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer (Collections.emptySet()), true); StringBuffer buf = new StringBuffer(65535); for (int i=0; i<32767; i++) { buf.append("ha"); } Document doc = new Document(); doc.add(new Field("f", "three tokens here " + buf.toString(), Field.Store.NO, Field.Index.TOKENIZED)); iw.addDocument(doc); iw.close(); dir.close(); } Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException at java.lang.System.arraycopy(Native Method) at org.apache.lucene.index.DocumentsWriter$ThreadState $FieldData.addPosition(DocumentsWriter.java:1462) at org.apache.lucene.index.DocumentsWriter$ThreadState $FieldData.invertField(DocumentsWriter.java:1285) at org.apache.lucene.index.DocumentsWriter$ThreadState $FieldData.processField(DocumentsWriter.java:1215) at org.apache.lucene.index.DocumentsWriter $ThreadState.processDocument(DocumentsWriter.java:936) at org.apache.lucene.index.DocumentsWriter.addDocument (DocumentsWriter.java:2147) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Custom SynonymMap
Hi all, I'd like to add more words into SynonymMap for my application, but the HashMap that holds all the words is not visible (private). Is there any other Class that I can use to implement SynonymAnalyzer? I am using Lucene version 2.2.0 Antonius Ng
Re: Issue with indexed tokens position
Sure. I'd recommend that you start by taking out our custom tokenizer and looking at what Lucene does rather than what you've tried to emulate. For instance, the StandardTokenizer returns offsets that are one more than the end of the previous token. That is, the following program (Lucene 2.1) import java.io.Reader; import java.io.StringReader; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.standard.StandardTokenizer; public class Analysis { public static void main(String[] args) { try { Reader r = new StringReader("this is some text"); Tokenizer tzer = new StandardTokenizer(r); Token t; while ((t = tzer.next()) != null) { System.out.println( String.format( "Text: %s, start: %d, end: %d", t.termText(), t.startOffset(), t.endOffset())); } } catch (Exception e) { e.printStackTrace(); } } } outputs: Text: this, start: 0, end: 4 Text: is, start: 5, end: 7 Text: some, start: 8, end: 12 Text: text, start: 13, end: 17 Which, if I'm reading your code correctly is different in that the end of one token is the same offset as the beginning of the next token in your example. So the off-by-one error you're claiming is perhaps the result of an off-by-one error of your tokenizer. In general, a lot of people depend on offset positions and phrase queries, so I'd be very surprised if something this basic is out there without anyone being aware of it. But you never know. Of course, I may be way off. If so can you post a self-contained program using standard analyzers/tokenizers illustrating the problem? Most often, when I try to create such a thing I can't and it then points me back to my own code.. Best Erick On 8/17/07, Ramana Jelda <[EMAIL PROTECTED]> wrote: > > Hi Erick, > Thanks. > Here I try here my best to provide Pseudo code. > > Indexed Value: "pink-i" > > I have used a Custom Analyzer. My Analyzer looks a littlebit like > following.. > public class KeyWordFilter extends TokenFilter{ > public KeyWordFilter(TokenStream in) { > super(in); > keywordStack = new LinkedList(); > } > > org.apache.lucene.analysis.Token next(){ > if(keywordStack.size() > 0){ > return (Token) keywordStack.poll(); > } > //token = "pink-i" > makeTokens(token); > } > > void makeTokens(Token token){ > //make following tokens and add to stack.. > //[(pink,0,5,type=HYPENWORD_DIVIDED), > (pinki,0,5,type=HYPENWORD_DIVIDED,posIncr=0), > (i,5,6,type=HYPENWORD_DIVIDED)] > } > } > > > I am 100% sure that there is a problem with token-positions. And > PhraseQuery > "pink i" is not working where as PhraseQuery "pinki i" works. > And it seems positions are totally ignored by PhraseQuery. > > Any thoughts? > > Thx, > Jelda > > -Original Message- > > From: Erick Erickson [mailto:[EMAIL PROTECTED] > > Sent: Friday, August 17, 2007 3:31 PM > > To: java-user@lucene.apache.org > > Subject: Re: Issue with indexed tokens position > > > > You'd get much better answers if you posted a concise example > > (or possibly code snippets), especially including the > > analyzers you used. > > > > Have you used Luke to examine your index and see if it's > > indexed as you expect? > > > > Best > > Erick > > > > On 8/17/07, Ramana Jelda <[EMAIL PROTECTED]> wrote: > > > > > > Strangely.. > > > My lucene query: fieldName:"pinki i" finds document. (see "i" > > > in "pinki") > > > > > > Jelda > > > > > > > -Original Message- > > > > From: Ramana Jelda [mailto:[EMAIL PROTECTED] > > > > Sent: Friday, August 17, 2007 12:33 PM > > > > To: java-user@lucene.apache.org > > > > Subject: Issue with indexed tokens position > > > > > > > > Hi, > > > > Lucene doesn't find following value. Some issues with PhraseQuery. > > > > > > > > indexed value: pink-I > > > > Indexed tokens:1: [pink:0->5] 2: [pinki:0->5] 3: [i:5->6] (ex. > > > > explanation: > > > > "pink" is a term "0->5" term-position) > > > > > > > > And I have indexed in a field called "fieldName". > > > > My lucene search with the query [fieldName:"pink i"] can't find > > > > above indexed value. > > > > > > > > Can anyone help me out here. > > > > > > > > Thx in advance, > > > > Jelda > > > > > > > > > > > > > > > > > > > > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > > - > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMA
Re: ArrayIndexOutOfBoundsException
Hmmm ... good catch. With DocumentsWriter there is a max term length (currently 16384 chars). I think we should fix it to raise a clearer exception? I'll open an issue. Mike On Fri, 17 Aug 2007 19:53:09 +0200, "karl wettin" <[EMAIL PROTECTED]> said: > When I add a field containing a really long term I get an AIOOBE. Is > this a documented feature? > >public static void main(String[] args) throws Exception { > RAMDirectory dir = new RAMDirectory(); > IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer > (Collections.emptySet()), true); > StringBuffer buf = new StringBuffer(65535); > for (int i=0; i<32767; i++) { >buf.append("ha"); > } > Document doc = new Document(); > doc.add(new Field("f", "three tokens here " + buf.toString(), > Field.Store.NO, Field.Index.TOKENIZED)); > iw.addDocument(doc); > iw.close(); > dir.close(); >} > > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(Native Method) > at org.apache.lucene.index.DocumentsWriter$ThreadState > $FieldData.addPosition(DocumentsWriter.java:1462) > at org.apache.lucene.index.DocumentsWriter$ThreadState > $FieldData.invertField(DocumentsWriter.java:1285) > at org.apache.lucene.index.DocumentsWriter$ThreadState > $FieldData.processField(DocumentsWriter.java:1215) > at org.apache.lucene.index.DocumentsWriter > $ThreadState.processDocument(DocumentsWriter.java:936) > at org.apache.lucene.index.DocumentsWriter.addDocument > (DocumentsWriter.java:2147) > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ArrayIndexOutOfBoundsException
Ignore the part about "much longer strings", I overlooked that this was a single term But it still works on my machine, Lucene 2.1... Erick On 8/17/07, Michael McCandless <[EMAIL PROTECTED]> wrote: > > > Hmmm ... good catch. With DocumentsWriter there is a max term length > (currently 16384 chars). I think we should fix it to raise a clearer > exception? I'll open an issue. > > Mike > > On Fri, 17 Aug 2007 19:53:09 +0200, "karl wettin" <[EMAIL PROTECTED]> > said: > > When I add a field containing a really long term I get an AIOOBE. Is > > this a documented feature? > > > >public static void main(String[] args) throws Exception { > > RAMDirectory dir = new RAMDirectory(); > > IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer > > (Collections.emptySet()), true); > > StringBuffer buf = new StringBuffer(65535); > > for (int i=0; i<32767; i++) { > >buf.append("ha"); > > } > > Document doc = new Document(); > > doc.add(new Field("f", "three tokens here " + buf.toString(), > > Field.Store.NO, Field.Index.TOKENIZED)); > > iw.addDocument(doc); > > iw.close(); > > dir.close(); > >} > > > > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException > > at java.lang.System.arraycopy(Native Method) > > at org.apache.lucene.index.DocumentsWriter$ThreadState > > $FieldData.addPosition(DocumentsWriter.java:1462) > > at org.apache.lucene.index.DocumentsWriter$ThreadState > > $FieldData.invertField(DocumentsWriter.java:1285) > > at org.apache.lucene.index.DocumentsWriter$ThreadState > > $FieldData.processField(DocumentsWriter.java:1215) > > at org.apache.lucene.index.DocumentsWriter > > $ThreadState.processDocument(DocumentsWriter.java:936) > > at org.apache.lucene.index.DocumentsWriter.addDocument > > (DocumentsWriter.java:2147) > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: ArrayIndexOutOfBoundsException
I've added MUCH larger strings to a document without any problem, but it was an FSDir. I admit that it is kind of "interesting" that this happens just as you cross the magic number. But I tried it on my machine and it works just fine, go figure .. Erick On 8/17/07, karl wettin <[EMAIL PROTECTED]> wrote: > > When I add a field containing a really long term I get an AIOOBE. Is > this a documented feature? > >public static void main(String[] args) throws Exception { > RAMDirectory dir = new RAMDirectory(); > IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer > (Collections.emptySet()), true); > StringBuffer buf = new StringBuffer(65535); > for (int i=0; i<32767; i++) { >buf.append("ha"); > } > Document doc = new Document(); > doc.add(new Field("f", "three tokens here " + buf.toString(), > Field.Store.NO, Field.Index.TOKENIZED)); > iw.addDocument(doc); > iw.close(); > dir.close(); >} > > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException > at java.lang.System.arraycopy(Native Method) > at org.apache.lucene.index.DocumentsWriter$ThreadState > $FieldData.addPosition(DocumentsWriter.java:1462) > at org.apache.lucene.index.DocumentsWriter$ThreadState > $FieldData.invertField(DocumentsWriter.java:1285) > at org.apache.lucene.index.DocumentsWriter$ThreadState > $FieldData.processField(DocumentsWriter.java:1215) > at org.apache.lucene.index.DocumentsWriter > $ThreadState.processDocument(DocumentsWriter.java:936) > at org.apache.lucene.index.DocumentsWriter.addDocument > (DocumentsWriter.java:2147) > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Re: Custom SynonymMap
Try searching the mail archives for SynonymMap, as I know this was discussed a while ago but don't remember the specifics. Erick On 8/17/07, Antonius Ng <[EMAIL PROTECTED]> wrote: > > Hi all, > > I'd like to add more words into SynonymMap for my application, but the > HashMap that holds all the words is not visible (private). > > Is there any other Class that I can use to implement SynonymAnalyzer? I am > using Lucene version 2.2.0 > > Antonius Ng >
Lucene and DRBD
I'm currently trying to figure how I could provide a Lucene-based search functionality to an existing system. Though the application is hosted in multiple boxes, they do NOT share a SAN where we can put the index directory. Each of the nodes need to update Lucene documents but it's not going to be a common use case -- probably 100x a day from the 7-8M documents. Has anyone here tried storing the Lucene index on top of DRBD? I'm curious to hear your experience in setting up and maintaining such a solution. Were there any performance issues? DRBD http://en.wikipedia.org/wiki/DRBD Thanks, Jeff - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Similarities lucene(particularly using doc id's)
Hi, On Aug 16, 2007, at 2:20 PM, Lokeya wrote: Hi All, I have the following set up: a) Indexed set of docs. b) Ran 1st query and got tops docs c) Fetched the id's from that and stored in a data structure. d) Ran 2nd query , got top docs , fetched id's and stored in a data structure. Now i have 2 sets of doc ids (set 1) and (set 1). I want to find out the document content similarity between these 2 sets(just using doc ids information which i have). Not sure what you mean here. What do the doc ids have to do with the content? Qn 1: Is it possible using any lucene api's. In that case can you point me to the appropriate API's. I did a search at :http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ javadoc/index.html But couldn't find anything. It is possible if you use Term Vectors (see IndexReader.getTermFreqVector). You will need to store (when you construct your Field) and load the term vectors and then calculate the similarity. A common way of doing this is by calculating the cosine of the angle between the two vectors. -Grant -- Grant Ingersoll http://lucene.grantingersoll.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Modification to IndexWriter / IndexReader
I've noticed a few threads on this so far... maybe it's useful or maybe somebody's already done this, or maybe it's insane and bug-prone. Anyways, our application requires lucene to act as a non-critical database, as in each record is composed of denormalized data derived from the real DBMS. The index can be regenerated at any time from the database. However, information added to the index must be searchable immediately after being added. The index is written to concurrently by many users. Therefore, flushing the IndexWriter to disk, and re-opening a IndexReader is not really feasible. Therefore, I worked up this hack to compensate. Note that this solution precludes multiple readers from reading an index. Also, a reader cannot be allowed to delete documents (but really, why can you delete using a reader, anyway? Or has this been deprecated?) Essentially, a IndexWriter owns a IndexReader, and to obtain a reader, you call Indexwriter.getReader(). Whenever the writer is written to, a new reader is formed, composed of the IndexWriter's SegmentInfos (since a reader and writer essentially share copies of both of these structures anyways). It's essentially an in-memory swap rather than reading the segment infos back from disk after the writer has written them. I've attached the patch based on the current dev code. Basically it implements doAfterFlush(), and adds getReader() and addNotifier() methods. The notifier is simply so that anybody using a Searcher can be notified that the underlying reader has changed, and the Searcher should be re-opened. Something like this: writer.addNotifier(new WriterUpdateNotifier() { public void onUpdate(IndexWriter writer, IndexReader r) { // The reader and writer has been updated, rebuild the searchers readers[readers.length - 1] = r; try { reader = new MultiReader(readers); } catch (IOException e) { e.printStackTrace(); } reopenSearcher(); } }); This is currently working well in a production system and is working quite well. It has been load tested, and well, our users are load testing it for us as well :-). However, see my previous post about the ArrayIndexOutOfBoundsException, although I don't see how this could be the cause... but maybe, since nobody else gets the problem. However, I haven't modified the writer at all, and I am never modifying the index with the Reader. So feel free to tell me this is crazy... I'm just throwing it out there. Thanks 266,285d268 < /** Zigtag added **/ < private IndexReader reader; < < private List notifiers = new ArrayList(); < < public IndexReader getReader() throws IOException < { < if (reader == null) < { < reader = IndexReader.open(directory); < } < return reader; < } < < public void addNotifier(WriterUpdateNotifier notifier) < { < notifiers.add(notifier); < } < /** END Zigtag added **/ < 1857,1914c1840,1842 < // Zigtag added to this class: < void doAfterFlush() < throws IOException < { < final boolean closeDirectory = false; < reader = (IndexReader) new SegmentInfos.FindSegmentsFile(directory) < { < < protected Object doBody(String segmentFileName) throws CorruptIndexException, IOException < { < < SegmentInfos infos = segmentInfos; < //infos.read(directory, segmentFileName); < < IndexReader reader; < < if (infos.size() == 1) < { // index is optimized < reader = SegmentReader.get(infos, infos.info(0), closeDirectory); < } < else < { < < // To reduce the chance of hitting FileNotFound < // (and having to retry), we open segments in < // reverse because IndexWriter merges & deletes < // the newest segments first. < < IndexReader[] readers = new IndexReader[infos.size()]; < for (int i = infos.size() - 1; i >= 0; i--) < { < try < { < readers[i] = SegmentReader.get(infos.info(i)); < } < catch (IOException e) < { < // Close all readers we had opened: < for (i++; i < infos.size(); i++) < { < readers[i].close(); < } < throw e; < } <
Re: getting term offset information for fields with multiple value entiries
What version of Lucene are you using? On Aug 17, 2007, at 12:44 PM, [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello community, dear Grant I have build a JUnit test case that illustrates the problem - there, I try to cut out the right substring with the offset values given from Lucene - and fail :( A few remarks: In this example, the 'é' from 'Bosé' makes that the '\w' pattern don't matches - it is recognized, unlike in StandardAnalyzer - as delimiter sign. Analysis: It seems that Lucene calculates the offset values by adding a virtual delimiter between every field value. But Lucene forgets the last characters of a field value when these are analyzer-specific delimiter values. (I seem this because of DocumentWriter, line 245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)' With this line of code, only the end offset of the last token is considered - by forgetting potential, trimmed delimiter chars. Thus, solving would be: 1. Add a single delimiter char between the field values 2. Substract (from the Lucene Offset) the count of analyzer- specific delimiters that are at the end of all field values before the match For this, someone needs to know what a delimiter for an specific analyzer is. The other possibility of course is to change the behaviour inside Lucene, because the current offset values are more or less useless / hard to use (I currently have no idea how to get analyzer-specific delimiter chars). For me, this looks like a bug - am I wrong? Any ideas/hints/remarks? I would be very lucky about :) Greetings Christian Grant Ingersoll schrieb: Hi Christian, Is there anyway you can post a complete, self-contained example preferably as a JUnit test? I think it would be useful to know more about how you are indexing (i.e. what Analyzer, etc.) The offsets should be taken from whatever is set in on the Token during Analysis. I, too, am trying to remember where in the code this is taking place Also, what version of Lucene are you using? -Grant On Aug 16, 2007, at 5:50 AM, [EMAIL PROTECTED] wrote: Hello, I have an index with an 'actor' field, for each actor there exists an single field value entry, e.g. stored/ compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorP osition movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) movie_actors:Miguel Bosé movie_actors:Anna Lizaran (as Ana Lizaran) movie_actors:Raquel Sanchís movie_actors:Angelina Llongueras I try to get the term offset, e.g. for 'angelina' with termPositionVector = (TermPositionVector) reader.getTermFreqVector(docNumber, "movie_actors"); int iTermIndex = termPositionVector.indexOf("angelina"); TermVectorOffsetInfo[] termOffsets = termPositionVector.getOffsets(iTermIndex); I get one TermVectorOffsetInfo for the field - with offset numbers that are bigger than one single Field entry. I guessed that Lucene gives the offset number for the situation that all values were concatenated, which is for the single (virtual) string: movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras This fits in nearly no situation, so my second guess was that lucene adds some virtual delimiters between the single field entries for offset calculation. I added a delimiter, so the result would be: movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé Anna Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras (note the ' ' between each actor name) ..this also fits not for each situation - there are too much delimiters there now, so, further, I guessed that Lucene don't add a delimiter in each situation. So I added only one when the last character of an entry was no alphanumerical one, with: StringBuilder strbAttContent = new StringBuilder(); for (String strAttValue : m_luceneDocument.getValues(strFieldName)) { strbAttContent.append(strAttValue); if(strbAttContent.substring(strbAttContent.length() - 1).matches("\\w")) strbAttContent.append(' '); } where I get the result (virtual) entry: movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras this fits in ~96% of all my queriesbut still its not 100% the way lucene calculates the offset value for fields with multiple value entries. ..maybe the problem is that there are special characters inside my database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches. I have looked to this specific situation, but considering this one character don't solves the problem. How do Lucene calculates these offsets? I also searched inside the source code, but can't find the correct place. Thanks in advance! Christian Reuschling -- _ _ Christian Reuschling, Dipl.-Ing.(BA) Software Engineer Knowledge Management Department German Research Center fo
RE: Issue with indexed tokens position
: My lucene query: fieldName:"pinki i" finds document. (see "i" in "pinki") i'm guessing that in this debuging output you provided... : > indexed value: pink-I : > Indexed tokens:1: [pink:0->5] 2: [pinki:0->5] 3: [i:5->6] : > (ex. explanation: : > "pink" is a term "0->5" term-position) ...that the "1" is the position of "pink", "2" is the position of "pinki", and "3" is the position of "i" ... the numbers you are refering to as term-positions actually look like start and end offsets. the offsets aren't used in phrase queries -- only the positions, your problem appears to be that you are using a non sloppy phrase query and expecting it to match two tokens with a psotion gap of 1 between them. you could either use sloppier queries (ie: "pink i"~2) or chnage your analyzer so the position incriment between "pink" and "pinki" is 0 -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: formalizing a query
Hi, I have done using this: final QueryParser filterQueryParser = new QueryParser("", new KeywordAnalyzer()); hits = indexSearcher.search(query, new QueryWrapperFilter(filterQueryParser.parse(filterQuery))); where filterQuery= "(field1:query1 AND field2:query2) OR (field1:query3 AND field2:query4)" If there are other methods that can do it in a professional way. please comment. Thanks Sagar Naik-2 wrote: > > Hey, > > I think u can try : > > MultiFieldQueryParser.parse(String[] queries, String[] fields, > BooleanClause.Occur[] flags, > Analyzer analyzer) > > The flags arrray will get u ORs and ANDs in places u need > > - Sagar Naik > > Abu Abdulla alhanbali wrote: >> Thanks for the help, >> >> please provide the code to do that. >> >> I tried with this one but it didn't work: >> >> Query filterQuery = MultiFieldQueryParser.parse(new String{query1, >> query2, >> query3, query4, }, new String{field1, field2, field1, field2, ... }, >> new KeywordAnalyzer()); >> >> this results in: >> >> field1:query1 OR field2:query2 OR >> field1:query3 OR field2:query4 ... etc >> >> and NOT: >> >> (field1:query1 AND field2:query2) OR >> (field1:query3 AND field2:query4) ... etc >> >> please help. >> >> >> On 8/10/07, Erick Erickson <[EMAIL PROTECTED]> wrote: >> >>> I *strongly* suggest you get a copy of Luke. It'll allow you to form >>> queries >>> and see the results and you can then answer this kind of question as >>> well >>> as many others. >>> >>> Meanwhile, please see >>> http://lucene.apache.org/java/docs/queryparsersyntax.html >>> >>> Erick >>> >>> On 8/10/07, Abu Abdulla alhanbali <[EMAIL PROTECTED]> wrote: >>> Hi, I need your help in formalizing this query: (field1:query1 AND field2:query2) OR (field1:query3 AND field2:query4) OR (field1:query5 AND field2:query6) OR (field1:query7 AND field2:query8) ... etc Please give the code since I'm new to lucene how we can use MultiFieldQueryParser or any parser to do the job greatly appreciated >> >> > > > -- > Always vizz it us @ visvo.com > > > -- > This message has been scanned for viruses and > dangerous content and is believed to be clean. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/formalizing-a-query-tf4246564.html#a12210481 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: query question
testn, here is my code but the thing is strange is that by Luke I can't reach my goal as well, look, I have a field (Indexed, Tokenized and Stored) this field has a wide variety of values from numbers to characters, I give the query patientResult:oxalate but the result is no document (using WhitespaceAnalyzer) but I expect to have values like Ca. Oxalate:few and Ca. Oxalate:many in following code, Context and Dispatcher are parts of interceptor pattern in which I change the given values if they are number and has nothing to do with queries with string values public class ExtendedQueryParser extends MultiFieldQueryParser { private Log logger = LogFactory.getLog(ExtendedQueryParser.class); /** * if true, overrides the getRangeQuery() method and treat with dates just like other strings, but * if false, everything will normally proceed just like its super class. */ private boolean asString; private Class clazz; public ExtendedQueryParser(String[] fields,Analyzer analyzer,Class clazz) { super(fields,analyzer); //this.asString = asString; this.clazz = clazz; } @Override protected org.apache.lucene.search.Query getRangeQuery(String field, String part1, String part2, boolean inclusive) throws ParseException { String val1 = part1; String val2 = part2; String fieldName = field; try { Dispatcher dispatcher = Dispatcher.getInstance(); Context c = new Context(); c.setClazz(clazz); c.setFieldData(MetadataHelper.getIndexField(clazz,field)); c.setValue(val1); dispatcher.beforeQuery(c); val1 = c.getWorkingValue(); c.setValue(val2); dispatcher.beforeQuery(c); val2 = c.getWorkingValue(); fieldName = c.getChangedFieldName(); logger.debug("Query text translated to "+fieldName+":["+val1+ " TO " + val2+"]"); } catch (Exception e) { e.printStackTrace(); } BooleanQuery.setMaxClauseCount(5120);//5 * 1024 return new RangeQuery(new Term(fieldName, val1),new Term(fieldName, val2),inclusive); } @Override protected org.apache.lucene.search.Query getFieldQuery(String field, String queryText) throws ParseException { logger.debug("FieldQuery no slop:"+queryText); String val = queryText; String fieldName = field; try { Dispatcher dispatcher = Dispatcher.getInstance(); Context c = new Context(); c.setClazz(clazz); c.setFieldData(MetadataHelper.getIndexField(clazz,field)); c.setValue(val); dispatcher.beforeQuery(c); val = c.getWorkingValue(); fieldName = c.getChangedFieldName(); logger.debug("Query text translated to "+fieldName+ ":" + val); } catch (Exception e) { e.printStackTrace(); } logger.debug("TermQuery..."); setLowercaseExpandedTerms(false); TermQuery termQuery = new TermQuery(new Term(fieldName, val)); return termQuery;//(field,val); } @Override protected org.apache.lucene.search.Query getFuzzyQuery(String arg0, String arg1, float arg2) throws ParseException { logger.debug("FuzzyQuery Text:"+arg1); return super.getFuzzyQuery(arg0, arg1, arg2); } @Override protected org.apache.lucene.search.Query getPrefixQuery(String field, String text) throws ParseException { logger.debug("PrefixQuery Text:"+text); //PrefixQuery prefixQuery = new PrefixQuery(new Term(field,text)); setLowercaseExpandedTerms(false); return super.getPrefixQuery(field,text); } @Override protected org.apache.lucene.search.Query getWildcardQuery(String field, String text) throws ParseException { logger.debug("WildcardQuery:"+text); setLowercaseExpandedTerms(false); //WildcardQuery doesn't need to perform any translation on its numbers return super.getWildcardQuery(field, text); } @Override protected Query getFieldQuery(String field, String queryText, int slop) throws ParseException { logger.debug("PhraseQuery :"+queryText+" with slop:"+slop); String val = queryText; String fieldName = field; try { Dispatcher dispatcher = Dispatcher.getInstance(); Context c = new Context(); c.setClazz(clazz); c.setFieldData(MetadataHelper.getIndexField(clazz,field)); c.setValue(val); dispatcher.beforeQuery(c); val = c.getWorkingValue(); fieldName = c.getChangedFieldName(); logger.debug("Query text translated to "+fieldName+":"+val+""); } catch (Exception e) { e.printStackTrace(); } PhraseQuery phraseQuery = new PhraseQuery(); phraseQuery.add(new Term(fieldName, val));
Deleting the result from a query or a filter and not a documents specified by Term
Hi, Is there a way to delete the results from a query or a filter and not documents specified by Term. I have seen some explanations here but i do not know how to do it: http://www.nabble.com/Batch-deletions-of-Records-from-index-tf615674.html#a1644740 Thanks in advanced