Issue with indexed tokens position

2007-08-17 Thread Ramana Jelda
Hi,
Lucene doesn't find following value. Some issues with PhraseQuery.

indexed value: pink-I
Indexed tokens:1: [pink:0->5] 2: [pinki:0->5] 3: [i:5->6] (ex. explanation:
"pink" is a term "0->5" term-position)

And I have indexed in a field called "fieldName".
My lucene search with the query [fieldName:"pink i"] can't find above
indexed value.

Can anyone help me out here.

Thx in advance,
Jelda



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Issue with indexed tokens position

2007-08-17 Thread Ramana Jelda
Strangely..
My lucene query: fieldName:"pinki i"  finds document. (see "i" in  "pinki")

Jelda

> -Original Message-
> From: Ramana Jelda [mailto:[EMAIL PROTECTED] 
> Sent: Friday, August 17, 2007 12:33 PM
> To: java-user@lucene.apache.org
> Subject: Issue with indexed tokens position
> 
> Hi,
> Lucene doesn't find following value. Some issues with PhraseQuery.
> 
> indexed value: pink-I
> Indexed tokens:1: [pink:0->5] 2: [pinki:0->5] 3: [i:5->6] 
> (ex. explanation:
> "pink" is a term "0->5" term-position)
> 
> And I have indexed in a field called "fieldName".
> My lucene search with the query [fieldName:"pink i"] can't 
> find above indexed value.
> 
> Can anyone help me out here.
> 
> Thx in advance,
> Jelda
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Issue with indexed tokens position

2007-08-17 Thread Erick Erickson
You'd get much better answers if you posted a concise example
(or possibly code snippets), especially including the analyzers you
used.

Have you used Luke to examine your index and see if it's indexed as
you expect?

Best
Erick

On 8/17/07, Ramana Jelda <[EMAIL PROTECTED]> wrote:
>
> Strangely..
> My lucene query: fieldName:"pinki i"  finds document. (see "i"
> in  "pinki")
>
> Jelda
>
> > -Original Message-
> > From: Ramana Jelda [mailto:[EMAIL PROTECTED]
> > Sent: Friday, August 17, 2007 12:33 PM
> > To: java-user@lucene.apache.org
> > Subject: Issue with indexed tokens position
> >
> > Hi,
> > Lucene doesn't find following value. Some issues with PhraseQuery.
> >
> > indexed value: pink-I
> > Indexed tokens:1: [pink:0->5] 2: [pinki:0->5] 3: [i:5->6]
> > (ex. explanation:
> > "pink" is a term "0->5" term-position)
> >
> > And I have indexed in a field called "fieldName".
> > My lucene search with the query [fieldName:"pink i"] can't
> > find above indexed value.
> >
> > Can anyone help me out here.
> >
> > Thx in advance,
> > Jelda
> >
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


RE: Issue with indexed tokens position

2007-08-17 Thread Ramana Jelda
Hi Erick,
Thanks.
Here I try here my best to provide Pseudo code.

Indexed Value: "pink-i"

I have used a Custom Analyzer. My Analyzer looks a littlebit like
following..
public class KeyWordFilter extends TokenFilter{
public KeyWordFilter(TokenStream in) {
super(in);
keywordStack = new LinkedList();
}

org.apache.lucene.analysis.Token next(){
if(keywordStack.size() > 0){
return (Token) keywordStack.poll();
}
//token = "pink-i"
makeTokens(token);
}

void makeTokens(Token token){
//make following tokens and add to stack..
//[(pink,0,5,type=HYPENWORD_DIVIDED),
(pinki,0,5,type=HYPENWORD_DIVIDED,posIncr=0),
(i,5,6,type=HYPENWORD_DIVIDED)] 
}
}


I am 100% sure that there is a problem with token-positions. And PhraseQuery
"pink i" is not working where as PhraseQuery "pinki i" works. 
And it seems positions are totally ignored by PhraseQuery. 

Any thoughts?

Thx,
Jelda
> -Original Message-
> From: Erick Erickson [mailto:[EMAIL PROTECTED] 
> Sent: Friday, August 17, 2007 3:31 PM
> To: java-user@lucene.apache.org
> Subject: Re: Issue with indexed tokens position
> 
> You'd get much better answers if you posted a concise example 
> (or possibly code snippets), especially including the 
> analyzers you used.
> 
> Have you used Luke to examine your index and see if it's 
> indexed as you expect?
> 
> Best
> Erick
> 
> On 8/17/07, Ramana Jelda <[EMAIL PROTECTED]> wrote:
> >
> > Strangely..
> > My lucene query: fieldName:"pinki i"  finds document. (see "i"
> > in  "pinki")
> >
> > Jelda
> >
> > > -Original Message-
> > > From: Ramana Jelda [mailto:[EMAIL PROTECTED]
> > > Sent: Friday, August 17, 2007 12:33 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Issue with indexed tokens position
> > >
> > > Hi,
> > > Lucene doesn't find following value. Some issues with PhraseQuery.
> > >
> > > indexed value: pink-I
> > > Indexed tokens:1: [pink:0->5] 2: [pinki:0->5] 3: [i:5->6] (ex. 
> > > explanation:
> > > "pink" is a term "0->5" term-position)
> > >
> > > And I have indexed in a field called "fieldName".
> > > My lucene search with the query [fieldName:"pink i"] can't find 
> > > above indexed value.
> > >
> > > Can anyone help me out here.
> > >
> > > Thx in advance,
> > > Jelda
> > >
> > >
> > >
> > > 
> 
> > > - To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> >
> >
> > 
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: getting term offset information for fields with multiple value entiries

2007-08-17 Thread duiduder
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello community, dear Grant

I have build a JUnit test case that illustrates the problem - there, I try to 
cut
out the right substring with the offset values given from Lucene - and fail :(

A few remarks:

In this example, the 'é' from 'Bosé' makes that the '\w' pattern don't matches -
it is recognized, unlike in StandardAnalyzer - as delimiter sign.

Analysis: It seems that Lucene calculates the offset values by adding a virtual
delimiter between every field value.
But Lucene forgets the last characters of a field value when these are
analyzer-specific delimiter values. (I seem this because of DocumentWriter, line
245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)'
With this line of code, only the end offset of the last token is considered - by
forgetting potential, trimmed delimiter chars.

Thus, solving would be:
1. Add a single delimiter char between the field values
2. Substract (from the Lucene Offset) the count of analyzer-specific delimiters
   that are at the end of all field values before the match

For this, someone needs to know what a delimiter for an specific analyzer is.

The other possibility of course is to change the behaviour inside Lucene, 
because
the current offset values are more or less useless / hard to use (I currently 
have
no idea how to get analyzer-specific delimiter chars).

For me, this looks like a bug - am I wrong?

Any ideas/hints/remarks? I would be very lucky about :)

Greetings

Christian



Grant Ingersoll schrieb:
> Hi Christian,
> 
> Is there anyway you can post a complete, self-contained example
> preferably as a JUnit test?  I think it would be useful to know more
> about how you are indexing (i.e. what Analyzer, etc.)
> The offsets should be taken from whatever is set in on the Token during
> Analysis.  I, too, am trying to remember where in the code this is
> taking place
> 
> Also, what version of Lucene are you using?
> 
> -Grant
> 
> On Aug 16, 2007, at 5:50 AM, [EMAIL PROTECTED] wrote:
> 
> Hello,
> 
> I have an index with an 'actor' field, for each actor there exists an
> single field value entry, e.g.
> 
> stored/compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPosition
> 
> 
> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
> movie_actors:Miguel Bosé
> movie_actors:Anna Lizaran (as Ana Lizaran)
> movie_actors:Raquel Sanchís
> movie_actors:Angelina Llongueras
> 
> I try to get the term offset, e.g. for 'angelina' with
> 
> termPositionVector = (TermPositionVector)
> reader.getTermFreqVector(docNumber, "movie_actors");
> int iTermIndex = termPositionVector.indexOf("angelina");
> TermVectorOffsetInfo[] termOffsets =
> termPositionVector.getOffsets(iTermIndex);
> 
> 
> I get one TermVectorOffsetInfo for the field - with offset numbers
> that are bigger than one single
> Field entry.
> I guessed that Lucene gives the offset number for the situation that
> all values were concatenated,
> which is for the single (virtual) string:
> 
> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
> Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras
> 
> This fits in nearly no situation, so my second guess was that lucene
> adds some virtual delimiters between the single
> field entries for offset calculation. I added a delimiter, so the
> result would be:
> 
> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé Anna
> Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras
> (note the ' ' between each actor name)
> 
> ..this also fits not for each situation - there are too much
> delimiters there now, so, further, I guessed that Lucene don't add
> a delimiter in each situation. So I added only one when the last
> character of an entry was no alphanumerical one, with:
> StringBuilder strbAttContent = new StringBuilder();
> for (String strAttValue : m_luceneDocument.getValues(strFieldName))
> {
>strbAttContent.append(strAttValue);
>if(strbAttContent.substring(strbAttContent.length() -
> 1).matches("\\w"))
>   strbAttContent.append(' ');
> }
> 
> where I get the result (virtual) entry:
> movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
> Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras
> 
> this fits in ~96% of all my queriesbut still its not 100% the way
> lucene calculates the offset value for fields with multiple
> value entries.
> 
> 
> ..maybe the problem is that there are special characters inside my
> database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches.
> I have looked to this specific situation, but considering this one
> character don't solves the problem.
> 
> 
> How do Lucene calculates these offsets? I also searched inside the
> source code, but can't find the correct place.
> 
> 
> Thanks in advance!
> 
> Christian Reuschling
> 
> 
> 
> 
> 
> --
> __
> 
> Christian Reuschling, Dipl.-Ing.(BA)
> Software E

ArrayIndexOutOfBoundsException

2007-08-17 Thread karl wettin
When I add a field containing a really long term I get an AIOOBE. Is  
this a documented feature?


  public static void main(String[] args) throws Exception {
RAMDirectory dir = new RAMDirectory();
IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer 
(Collections.emptySet()), true);

StringBuffer buf = new StringBuffer(65535);
for (int i=0; i<32767; i++) {
  buf.append("ha");
}
Document doc = new Document();
doc.add(new Field("f", "three tokens here " + buf.toString(),  
Field.Store.NO, Field.Index.TOKENIZED));

iw.addDocument(doc);
iw.close();
dir.close();
  }

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
	at org.apache.lucene.index.DocumentsWriter$ThreadState 
$FieldData.addPosition(DocumentsWriter.java:1462)
	at org.apache.lucene.index.DocumentsWriter$ThreadState 
$FieldData.invertField(DocumentsWriter.java:1285)
	at org.apache.lucene.index.DocumentsWriter$ThreadState 
$FieldData.processField(DocumentsWriter.java:1215)
	at org.apache.lucene.index.DocumentsWriter 
$ThreadState.processDocument(DocumentsWriter.java:936)
	at org.apache.lucene.index.DocumentsWriter.addDocument 
(DocumentsWriter.java:2147)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Custom SynonymMap

2007-08-17 Thread Antonius Ng
Hi all,

I'd like to add more words into SynonymMap for my application, but the
HashMap that holds all the words is not visible (private).

Is there any other Class that I can use to implement SynonymAnalyzer? I am
using Lucene version 2.2.0

Antonius Ng


Re: Issue with indexed tokens position

2007-08-17 Thread Erick Erickson
Sure. I'd recommend that you start by taking out our custom
tokenizer and looking at what Lucene does rather than what you've
tried to emulate. For instance, the StandardTokenizer returns
offsets that are one more than the end of the previous token. That is,
the following program (Lucene 2.1)


import java.io.Reader;
import java.io.StringReader;

import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.standard.StandardTokenizer;


public class Analysis
{
public static void main(String[] args)
{
try {
Reader r = new StringReader("this is some text");
Tokenizer tzer = new StandardTokenizer(r);

Token t;

while ((t = tzer.next()) != null) {
System.out.println(
String.format(
"Text: %s, start: %d, end: %d",
t.termText(),
t.startOffset(),
t.endOffset()));
}
} catch (Exception e) {
e.printStackTrace();
}
}
}

outputs:
Text: this, start: 0, end: 4
Text: is, start: 5, end: 7
Text: some, start: 8, end: 12
Text: text, start: 13, end: 17


Which, if I'm reading your code correctly is different in that the end of
one
token is the same offset as the beginning of the next token in your
example. So the off-by-one error you're claiming is perhaps the result of
an off-by-one error of your tokenizer.

In general, a lot of people depend on offset positions and phrase queries,
so I'd be very surprised if something this basic is out there without anyone
being aware of it. But you never know.

Of course, I may be way off. If so can you post a self-contained program
using standard analyzers/tokenizers illustrating the problem? Most often,
when I try to create such a thing I can't and it then points me back to
my own code..

Best
Erick

On 8/17/07, Ramana Jelda <[EMAIL PROTECTED]> wrote:
>
> Hi Erick,
> Thanks.
> Here I try here my best to provide Pseudo code.
>
> Indexed Value: "pink-i"
>
> I have used a Custom Analyzer. My Analyzer looks a littlebit like
> following..
> public class KeyWordFilter extends TokenFilter{
> public KeyWordFilter(TokenStream in) {
> super(in);
> keywordStack = new LinkedList();
> }
>
> org.apache.lucene.analysis.Token next(){
> if(keywordStack.size() > 0){
> return (Token) keywordStack.poll();
> }
> //token = "pink-i"
> makeTokens(token);
> }
>
> void makeTokens(Token token){
> //make following tokens and add to stack..
> //[(pink,0,5,type=HYPENWORD_DIVIDED),
> (pinki,0,5,type=HYPENWORD_DIVIDED,posIncr=0),
> (i,5,6,type=HYPENWORD_DIVIDED)]
> }
> }
>
>
> I am 100% sure that there is a problem with token-positions. And
> PhraseQuery
> "pink i" is not working where as PhraseQuery "pinki i" works.
> And it seems positions are totally ignored by PhraseQuery.
>
> Any thoughts?
>
> Thx,
> Jelda
> > -Original Message-
> > From: Erick Erickson [mailto:[EMAIL PROTECTED]
> > Sent: Friday, August 17, 2007 3:31 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Issue with indexed tokens position
> >
> > You'd get much better answers if you posted a concise example
> > (or possibly code snippets), especially including the
> > analyzers you used.
> >
> > Have you used Luke to examine your index and see if it's
> > indexed as you expect?
> >
> > Best
> > Erick
> >
> > On 8/17/07, Ramana Jelda <[EMAIL PROTECTED]> wrote:
> > >
> > > Strangely..
> > > My lucene query: fieldName:"pinki i"  finds document. (see "i"
> > > in  "pinki")
> > >
> > > Jelda
> > >
> > > > -Original Message-
> > > > From: Ramana Jelda [mailto:[EMAIL PROTECTED]
> > > > Sent: Friday, August 17, 2007 12:33 PM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Issue with indexed tokens position
> > > >
> > > > Hi,
> > > > Lucene doesn't find following value. Some issues with PhraseQuery.
> > > >
> > > > indexed value: pink-I
> > > > Indexed tokens:1: [pink:0->5] 2: [pinki:0->5] 3: [i:5->6] (ex.
> > > > explanation:
> > > > "pink" is a term "0->5" term-position)
> > > >
> > > > And I have indexed in a field called "fieldName".
> > > > My lucene search with the query [fieldName:"pink i"] can't find
> > > > above indexed value.
> > > >
> > > > Can anyone help me out here.
> > > >
> > > > Thx in advance,
> > > > Jelda
> > > >
> > > >
> > > >
> > > >
> > 
> > > > - To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > >
> > >
> > >
> > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMA

Re: ArrayIndexOutOfBoundsException

2007-08-17 Thread Michael McCandless

Hmmm ... good catch.  With DocumentsWriter there is a max term length
(currently 16384 chars).  I think we should fix it to raise a clearer
exception?  I'll open an issue.

Mike

On Fri, 17 Aug 2007 19:53:09 +0200, "karl wettin" <[EMAIL PROTECTED]> said:
> When I add a field containing a really long term I get an AIOOBE. Is  
> this a documented feature?
> 
>public static void main(String[] args) throws Exception {
>  RAMDirectory dir = new RAMDirectory();
>  IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer 
> (Collections.emptySet()), true);
>  StringBuffer buf = new StringBuffer(65535);
>  for (int i=0; i<32767; i++) {
>buf.append("ha");
>  }
>  Document doc = new Document();
>  doc.add(new Field("f", "three tokens here " + buf.toString(),  
> Field.Store.NO, Field.Index.TOKENIZED));
>  iw.addDocument(doc);
>  iw.close();
>  dir.close();
>}
> 
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at org.apache.lucene.index.DocumentsWriter$ThreadState 
> $FieldData.addPosition(DocumentsWriter.java:1462)
>   at org.apache.lucene.index.DocumentsWriter$ThreadState 
> $FieldData.invertField(DocumentsWriter.java:1285)
>   at org.apache.lucene.index.DocumentsWriter$ThreadState 
> $FieldData.processField(DocumentsWriter.java:1215)
>   at org.apache.lucene.index.DocumentsWriter 
> $ThreadState.processDocument(DocumentsWriter.java:936)
>   at org.apache.lucene.index.DocumentsWriter.addDocument 
> (DocumentsWriter.java:2147)
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: ArrayIndexOutOfBoundsException

2007-08-17 Thread Erick Erickson
Ignore the part about "much longer strings", I overlooked that this
was a single term

But it still works on my machine, Lucene 2.1...

Erick

On 8/17/07, Michael McCandless <[EMAIL PROTECTED]> wrote:
>
>
> Hmmm ... good catch.  With DocumentsWriter there is a max term length
> (currently 16384 chars).  I think we should fix it to raise a clearer
> exception?  I'll open an issue.
>
> Mike
>
> On Fri, 17 Aug 2007 19:53:09 +0200, "karl wettin" <[EMAIL PROTECTED]>
> said:
> > When I add a field containing a really long term I get an AIOOBE. Is
> > this a documented feature?
> >
> >public static void main(String[] args) throws Exception {
> >  RAMDirectory dir = new RAMDirectory();
> >  IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer
> > (Collections.emptySet()), true);
> >  StringBuffer buf = new StringBuffer(65535);
> >  for (int i=0; i<32767; i++) {
> >buf.append("ha");
> >  }
> >  Document doc = new Document();
> >  doc.add(new Field("f", "three tokens here " + buf.toString(),
> > Field.Store.NO, Field.Index.TOKENIZED));
> >  iw.addDocument(doc);
> >  iw.close();
> >  dir.close();
> >}
> >
> > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
> >   at java.lang.System.arraycopy(Native Method)
> >   at org.apache.lucene.index.DocumentsWriter$ThreadState
> > $FieldData.addPosition(DocumentsWriter.java:1462)
> >   at org.apache.lucene.index.DocumentsWriter$ThreadState
> > $FieldData.invertField(DocumentsWriter.java:1285)
> >   at org.apache.lucene.index.DocumentsWriter$ThreadState
> > $FieldData.processField(DocumentsWriter.java:1215)
> >   at org.apache.lucene.index.DocumentsWriter
> > $ThreadState.processDocument(DocumentsWriter.java:936)
> >   at org.apache.lucene.index.DocumentsWriter.addDocument
> > (DocumentsWriter.java:2147)
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: ArrayIndexOutOfBoundsException

2007-08-17 Thread Erick Erickson
I've added MUCH larger strings to a document without any problem,
but it was an FSDir. I admit that it is kind of "interesting" that this
happens just as you cross the magic number.

But I tried it on my machine and it works just fine, go figure ..

Erick

On 8/17/07, karl wettin <[EMAIL PROTECTED]> wrote:
>
> When I add a field containing a really long term I get an AIOOBE. Is
> this a documented feature?
>
>public static void main(String[] args) throws Exception {
>  RAMDirectory dir = new RAMDirectory();
>  IndexWriter iw = new IndexWriter(dir, new StandardAnalyzer
> (Collections.emptySet()), true);
>  StringBuffer buf = new StringBuffer(65535);
>  for (int i=0; i<32767; i++) {
>buf.append("ha");
>  }
>  Document doc = new Document();
>  doc.add(new Field("f", "three tokens here " + buf.toString(),
> Field.Store.NO, Field.Index.TOKENIZED));
>  iw.addDocument(doc);
>  iw.close();
>  dir.close();
>}
>
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
> at java.lang.System.arraycopy(Native Method)
> at org.apache.lucene.index.DocumentsWriter$ThreadState
> $FieldData.addPosition(DocumentsWriter.java:1462)
> at org.apache.lucene.index.DocumentsWriter$ThreadState
> $FieldData.invertField(DocumentsWriter.java:1285)
> at org.apache.lucene.index.DocumentsWriter$ThreadState
> $FieldData.processField(DocumentsWriter.java:1215)
> at org.apache.lucene.index.DocumentsWriter
> $ThreadState.processDocument(DocumentsWriter.java:936)
> at org.apache.lucene.index.DocumentsWriter.addDocument
> (DocumentsWriter.java:2147)
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Re: Custom SynonymMap

2007-08-17 Thread Erick Erickson
Try searching the mail archives for SynonymMap, as I know this was
discussed a while ago but don't remember the specifics.

Erick



On 8/17/07, Antonius Ng <[EMAIL PROTECTED]> wrote:
>
> Hi all,
>
> I'd like to add more words into SynonymMap for my application, but the
> HashMap that holds all the words is not visible (private).
>
> Is there any other Class that I can use to implement SynonymAnalyzer? I am
> using Lucene version 2.2.0
>
> Antonius Ng
>


Lucene and DRBD

2007-08-17 Thread Jeff Gutierrez
I'm currently trying to figure how I could provide a Lucene-based search 
functionality to an existing system. Though the application is hosted in 
multiple boxes, they do NOT share a SAN where we can put the index directory. 
Each of the nodes need to update Lucene documents but it's not going to be a 
common use case -- probably 100x a day from the 7-8M documents.

Has anyone here tried storing the Lucene index on top of DRBD? I'm curious to 
hear your experience in setting up and maintaining such a solution. Were there 
any performance issues?

DRBD
http://en.wikipedia.org/wiki/DRBD

Thanks,

Jeff

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Document Similarities lucene(particularly using doc id's)

2007-08-17 Thread Grant Ingersoll

Hi,


On Aug 16, 2007, at 2:20 PM, Lokeya wrote:



Hi All,

I have the following set up: a) Indexed set of docs. b) Ran 1st  
query and
got tops docs  c) Fetched the id's from that and stored in a data  
structure.

d) Ran 2nd query , got top docs , fetched id's and stored in a data
structure.

Now i have 2 sets of doc ids (set 1) and (set 1).

I want to find out the document content similarity between these 2  
sets(just

using doc ids information which i have).



Not sure what you mean here.  What do the doc ids have to do with the  
content?


Qn 1: Is it possible using any lucene api's. In that case can you  
point me

to the appropriate API's. I did a search at
:http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/ 
javadoc/index.html

But couldn't find anything.



It is possible if you use Term Vectors (see  
IndexReader.getTermFreqVector).  You will need to store (when you  
construct your Field) and load the term vectors and then calculate  
the similarity.  A common way of doing this is by calculating the  
cosine of the angle between the two vectors.


-Grant

--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Modification to IndexWriter / IndexReader

2007-08-17 Thread Scott Montgomerie
I've noticed a few threads on this so far... maybe it's useful or maybe
somebody's already done this, or maybe it's insane and bug-prone.

Anyways, our application requires lucene to act as a non-critical
database, as in each record is composed of denormalized data derived
from the real DBMS.  The index can be regenerated at any time from the
database.  However, information added to the index must be searchable
immediately after being added.  The index is written to concurrently by
many users.  Therefore, flushing the IndexWriter to disk, and re-opening
a IndexReader is not really feasible.  Therefore, I worked up this hack
to compensate. 

Note that this solution precludes multiple readers from reading an
index.  Also, a reader cannot be allowed to delete documents (but
really, why can you delete using a reader, anyway?  Or has this been
deprecated?)

Essentially, a IndexWriter owns a IndexReader, and to obtain a reader,
you call Indexwriter.getReader().  Whenever the writer is written to, a
new reader is formed, composed of the IndexWriter's SegmentInfos (since
a reader and writer essentially share copies of both of these structures
anyways).  It's essentially an in-memory swap rather than reading the
segment infos back from disk after the writer has written them.

I've attached the patch based on the current dev code.  Basically it
implements doAfterFlush(), and adds getReader() and addNotifier()
methods.  The notifier is simply so that anybody using a Searcher can be
notified that the underlying reader has changed, and the Searcher should
be re-opened.

Something like this:

writer.addNotifier(new WriterUpdateNotifier()
{
public void onUpdate(IndexWriter writer, IndexReader r)
{
// The reader and writer has been updated, rebuild
the searchers
readers[readers.length - 1] = r;
try
{
reader = new MultiReader(readers);
}
catch (IOException e)
{
e.printStackTrace();
}
reopenSearcher();
}
});

This is currently working well in a production system and is working
quite well.  It has been load tested, and well, our users are load
testing it for us as well :-).  However, see my previous post about the
ArrayIndexOutOfBoundsException, although I don't see how this could be
the cause... but maybe, since nobody else gets the problem.  However, I
haven't modified the writer at all, and I am never modifying the index
with the Reader.

So feel free to tell me this is crazy... I'm just throwing it out there.

Thanks
266,285d268
< /** Zigtag added **/
< private IndexReader reader;
< 
< private List notifiers = new ArrayList();
< 
< public IndexReader getReader() throws IOException
< {
< if (reader == null)
< {
< reader = IndexReader.open(directory);
< }
< return reader;
< }
< 
< public void addNotifier(WriterUpdateNotifier notifier)
< {
< notifiers.add(notifier);
< }
< /** END Zigtag added **/
< 
1857,1914c1840,1842
<   // Zigtag added to this class:
< void doAfterFlush()
< throws IOException
< {
< final boolean closeDirectory = false;
< reader = (IndexReader) new SegmentInfos.FindSegmentsFile(directory)
< {
< 
< protected Object doBody(String segmentFileName) throws CorruptIndexException, IOException
< {
< 
< SegmentInfos infos = segmentInfos;
< //infos.read(directory, segmentFileName);
< 
< IndexReader reader;
< 
< if (infos.size() == 1)
< {  // index is optimized
< reader = SegmentReader.get(infos, infos.info(0), closeDirectory);
< }
< else
< {
< 
< // To reduce the chance of hitting FileNotFound
< // (and having to retry), we open segments in
< // reverse because IndexWriter merges & deletes
< // the newest segments first.
< 
< IndexReader[] readers = new IndexReader[infos.size()];
< for (int i = infos.size() - 1; i >= 0; i--)
< {
< try
< {
< readers[i] = SegmentReader.get(infos.info(i));
< }
< catch (IOException e)
< {
< // Close all readers we had opened:
< for (i++; i < infos.size(); i++)
< {
< readers[i].close();
< }
< throw e;
< }
< 

Re: getting term offset information for fields with multiple value entiries

2007-08-17 Thread Grant Ingersoll

What version of Lucene are you using?


On Aug 17, 2007, at 12:44 PM, [EMAIL PROTECTED] wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello community, dear Grant

I have build a JUnit test case that illustrates the problem -  
there, I try to cut
out the right substring with the offset values given from Lucene -  
and fail :(


A few remarks:

In this example, the 'é' from 'Bosé' makes that the '\w' pattern  
don't matches -

it is recognized, unlike in StandardAnalyzer - as delimiter sign.

Analysis: It seems that Lucene calculates the offset values by  
adding a virtual

delimiter between every field value.
But Lucene forgets the last characters of a field value when these are
analyzer-specific delimiter values. (I seem this because of  
DocumentWriter, line

245: 'if(lastToken != null) offset += lastToken.endOffset() + 1;)'
With this line of code, only the end offset of the last token is  
considered - by

forgetting potential, trimmed delimiter chars.

Thus, solving would be:
1. Add a single delimiter char between the field values
2. Substract (from the Lucene Offset) the count of analyzer- 
specific delimiters

   that are at the end of all field values before the match

For this, someone needs to know what a delimiter for an specific  
analyzer is.


The other possibility of course is to change the behaviour inside  
Lucene, because
the current offset values are more or less useless / hard to use (I  
currently have

no idea how to get analyzer-specific delimiter chars).

For me, this looks like a bug - am I wrong?

Any ideas/hints/remarks? I would be very lucky about :)

Greetings

Christian



Grant Ingersoll schrieb:

Hi Christian,

Is there anyway you can post a complete, self-contained example
preferably as a JUnit test?  I think it would be useful to know more
about how you are indexing (i.e. what Analyzer, etc.)
The offsets should be taken from whatever is set in on the Token  
during

Analysis.  I, too, am trying to remember where in the code this is
taking place

Also, what version of Lucene are you using?

-Grant

On Aug 16, 2007, at 5:50 AM, [EMAIL PROTECTED] wrote:

Hello,

I have an index with an 'actor' field, for each actor there exists an
single field value entry, e.g.

stored/ 
compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorP 
osition



movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
movie_actors:Miguel Bosé
movie_actors:Anna Lizaran (as Ana Lizaran)
movie_actors:Raquel Sanchís
movie_actors:Angelina Llongueras

I try to get the term offset, e.g. for 'angelina' with

termPositionVector = (TermPositionVector)
reader.getTermFreqVector(docNumber, "movie_actors");
int iTermIndex = termPositionVector.indexOf("angelina");
TermVectorOffsetInfo[] termOffsets =
termPositionVector.getOffsets(iTermIndex);


I get one TermVectorOffsetInfo for the field - with offset numbers
that are bigger than one single
Field entry.
I guessed that Lucene gives the offset number for the situation that
all values were concatenated,
which is for the single (virtual) string:

movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras

This fits in nearly no situation, so my second guess was that lucene
adds some virtual delimiters between the single
field entries for offset calculation. I added a delimiter, so the
result would be:

movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé  
Anna

Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras
(note the ' ' between each actor name)

..this also fits not for each situation - there are too much
delimiters there now, so, further, I guessed that Lucene don't add
a delimiter in each situation. So I added only one when the last
character of an entry was no alphanumerical one, with:
StringBuilder strbAttContent = new StringBuilder();
for (String strAttValue : m_luceneDocument.getValues(strFieldName))
{
   strbAttContent.append(strAttValue);
   if(strbAttContent.substring(strbAttContent.length() -
1).matches("\\w"))
  strbAttContent.append(' ');
}

where I get the result (virtual) entry:
movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna
Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras

this fits in ~96% of all my queriesbut still its not 100% the way
lucene calculates the offset value for fields with multiple
value entries.


..maybe the problem is that there are special characters inside my
database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches.
I have looked to this specific situation, but considering this one
character don't solves the problem.


How do Lucene calculates these offsets? I also searched inside the
source code, but can't find the correct place.


Thanks in advance!

Christian Reuschling





--
_ 
_


Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center fo

RE: Issue with indexed tokens position

2007-08-17 Thread Chris Hostetter

: My lucene query: fieldName:"pinki i"  finds document. (see "i" in  "pinki")

i'm guessing that in this debuging output you provided...

: > indexed value: pink-I
: > Indexed tokens:1: [pink:0->5] 2: [pinki:0->5] 3: [i:5->6]
: > (ex. explanation:
: > "pink" is a term "0->5" term-position)

...that the "1" is the position of "pink", "2" is the position of "pinki",
and "3" is the position of "i" ... the numbers you are refering to as
term-positions actually look like start and end offsets.

the offsets aren't used in phrase queries -- only the positions, your
problem appears to be that you are using a non sloppy phrase query and
expecting it to match two tokens with a psotion gap of 1 between them.

you could either use sloppier queries (ie: "pink i"~2) or chnage your
analyzer so the position incriment between "pink" and "pinki" is 0




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: formalizing a query

2007-08-17 Thread Abu Abdulla

Hi,

I have done using this:

final QueryParser filterQueryParser = new QueryParser("", new
KeywordAnalyzer());

hits = indexSearcher.search(query, new
QueryWrapperFilter(filterQueryParser.parse(filterQuery)));

where filterQuery= "(field1:query1 AND field2:query2) OR (field1:query3 AND
field2:query4)"

If there are other methods that can do it in a professional way. please
comment.

Thanks


Sagar Naik-2 wrote:
> 
> Hey,
> 
> I think u can try :
> 
> MultiFieldQueryParser.parse(String[] queries, String[] fields, 
> BooleanClause.Occur[] flags,
>   Analyzer analyzer)
> 
> The flags arrray will get u ORs and ANDs in places u need
> 
> - Sagar Naik
> 
> Abu Abdulla alhanbali wrote:
>> Thanks for the help,
>>
>> please provide the code to do that.
>>
>> I tried with this one but it didn't work:
>>
>> Query filterQuery = MultiFieldQueryParser.parse(new String{query1,
>> query2,
>> query3, query4,  }, new String{field1, field2, field1, field2, ... },
>> new KeywordAnalyzer());
>>
>> this results in:
>>
>> field1:query1 OR field2:query2 OR
>> field1:query3 OR field2:query4 ... etc
>>
>> and NOT:
>>
>> (field1:query1 AND field2:query2) OR
>> (field1:query3 AND field2:query4) ... etc
>>
>> please help.
>>
>>
>> On 8/10/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>>   
>>> I *strongly* suggest you get a copy of Luke. It'll allow you to form
>>> queries
>>> and see the results and you can then answer this kind of question as
>>> well
>>> as many others.
>>>
>>> Meanwhile, please see
>>> http://lucene.apache.org/java/docs/queryparsersyntax.html
>>>
>>> Erick
>>>
>>> On 8/10/07, Abu Abdulla alhanbali <[EMAIL PROTECTED]> wrote:
>>> 
 Hi,

 I need your help in formalizing this query:

 (field1:query1 AND field2:query2) OR
 (field1:query3 AND field2:query4) OR
 (field1:query5 AND field2:query6) OR
 (field1:query7 AND field2:query8) ... etc

 Please give the code since I'm new to lucene
 how we can use MultiFieldQueryParser or any parser to do the job

 greatly appreciated

   
>>
>>   
> 
> 
> -- 
> Always vizz it us @ visvo.com
> 
> 
> -- 
> This message has been scanned for viruses and
> dangerous content and is believed to be clean.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/formalizing-a-query-tf4246564.html#a12210481
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: query question

2007-08-17 Thread Mohammad Norouzi
testn,

here is my code but the thing is strange is that by Luke I can't reach my
goal as well,

look, I have a field (Indexed, Tokenized and Stored) this field has a wide
variety of values from numbers to characters, I give the query
patientResult:oxalate but the result is no document (using
WhitespaceAnalyzer) but I expect to have values like Ca. Oxalate:few and Ca.
Oxalate:many

in following code, Context and Dispatcher are parts of interceptor pattern
in which I change the given values if they are number and has nothing to do
with queries with string values


public class ExtendedQueryParser extends MultiFieldQueryParser {
private Log logger = LogFactory.getLog(ExtendedQueryParser.class);
/**
 * if true, overrides the getRangeQuery() method and treat with dates
just like other strings, but
 * if false, everything will normally proceed just like its super class.

 */
private boolean asString;
private Class clazz;

public ExtendedQueryParser(String[] fields,Analyzer analyzer,Class
clazz) {
super(fields,analyzer);
//this.asString = asString;
this.clazz = clazz;
}

@Override
protected org.apache.lucene.search.Query getRangeQuery(String field,
String part1, String part2, boolean inclusive) throws ParseException {
String val1 = part1;
String val2 = part2;
String fieldName = field;
try {
Dispatcher dispatcher = Dispatcher.getInstance();
Context c = new Context();
c.setClazz(clazz);
c.setFieldData(MetadataHelper.getIndexField(clazz,field));
c.setValue(val1);
dispatcher.beforeQuery(c);
val1 = c.getWorkingValue();

c.setValue(val2);
dispatcher.beforeQuery(c);
val2 = c.getWorkingValue();
fieldName = c.getChangedFieldName();
logger.debug("Query text translated to "+fieldName+":["+val1+ "
TO " + val2+"]");

} catch (Exception e) {
e.printStackTrace();
}

BooleanQuery.setMaxClauseCount(5120);//5 * 1024
return new RangeQuery(new Term(fieldName, val1),new Term(fieldName,
val2),inclusive);
}

@Override
protected org.apache.lucene.search.Query getFieldQuery(String field,
String queryText) throws ParseException {
logger.debug("FieldQuery no slop:"+queryText);
String val = queryText;
String fieldName = field;
try {
Dispatcher dispatcher = Dispatcher.getInstance();
Context c = new Context();
c.setClazz(clazz);
c.setFieldData(MetadataHelper.getIndexField(clazz,field));
c.setValue(val);
dispatcher.beforeQuery(c);
val = c.getWorkingValue();
fieldName = c.getChangedFieldName();
logger.debug("Query text translated to "+fieldName+ ":" + val);

} catch (Exception e) {
e.printStackTrace();
}

logger.debug("TermQuery...");
setLowercaseExpandedTerms(false);
TermQuery termQuery = new TermQuery(new Term(fieldName, val));

return termQuery;//(field,val);
}

@Override
protected org.apache.lucene.search.Query getFuzzyQuery(String arg0,
String arg1, float arg2) throws ParseException {
logger.debug("FuzzyQuery Text:"+arg1);
return super.getFuzzyQuery(arg0, arg1, arg2);
}

@Override
protected org.apache.lucene.search.Query getPrefixQuery(String field,
String text) throws ParseException {
logger.debug("PrefixQuery Text:"+text);
//PrefixQuery prefixQuery = new PrefixQuery(new Term(field,text));
setLowercaseExpandedTerms(false);
return super.getPrefixQuery(field,text);
}

@Override
protected org.apache.lucene.search.Query getWildcardQuery(String field,
String text) throws ParseException {
logger.debug("WildcardQuery:"+text);
setLowercaseExpandedTerms(false);
//WildcardQuery doesn't need to perform any translation on its
numbers
return super.getWildcardQuery(field, text);
}

@Override
protected Query getFieldQuery(String field, String queryText, int slop)
throws ParseException {
logger.debug("PhraseQuery :"+queryText+" with slop:"+slop);
String val = queryText;
String fieldName = field;
try {
Dispatcher dispatcher = Dispatcher.getInstance();
Context c = new Context();
c.setClazz(clazz);
c.setFieldData(MetadataHelper.getIndexField(clazz,field));
c.setValue(val);
dispatcher.beforeQuery(c);
val = c.getWorkingValue();
fieldName = c.getChangedFieldName();
logger.debug("Query text translated to "+fieldName+":"+val+"");

} catch (Exception e) {
e.printStackTrace();
}
PhraseQuery phraseQuery = new PhraseQuery();
phraseQuery.add(new Term(fieldName, val));
 

Deleting the result from a query or a filter and not a documents specified by Term

2007-08-17 Thread Abu Abdulla alhanbali
Hi,

Is there a way to delete the results from a query or a filter and not
documents specified by Term. I have seen some explanations here but i do not
know how to do it:

http://www.nabble.com/Batch-deletions-of-Records-from-index-tf615674.html#a1644740

Thanks in advanced