How to index word-pairs and phrases

2008-02-18 Thread Ghinwa Choueiter
Hi,

I am new to Lucene and have been reading the documentation. I would like to use 
Lucene to query a song database by lyrics. The query could potentially contain 
typos, or even wrong words, word contractions (can't versus cannot), etc..

I would like to create an inverted list by word pairs and possibly phrases and 
not just by isolated words. For example:
   < d1, d10, d27>
   
...

OR even
 
 <...>
...

It seems to me that, by default, the index in Lucene stores statistics for 
isolated words. The Lucene documentation refers to the word "Term" all the time 
and seems to imply that "Term" can be a word or a phrase, but I can't see how 
IndexWriter can read a document and index it by word pairs. 

thank you in advance for the answers and my apologies if I did not get the 
terminology quite right.

-Ghinwa


RE: How to index word-pairs and phrases

2008-02-19 Thread Ghinwa Choueiter

What about
\contrib\analyzers\src\java\org\apache\lucene\analysis\ngram  ??

Does this tokenizer do what I need?

thank you,
-Ghinwa

On Tue, 19 Feb 2008, Steven A Rowe wrote:


Mark,

The ShingleFilter contrib has not been committed yet - it's still here:

  https://issues.apache.org/jira/browse/LUCENE-400

Steve

On 02/19/2008 at 2:33 AM, markharw00d wrote:

Further to Grant's useful background - there is an analyzer specifically
for multi-word terms in "contrib". See
Lucene\contrib\analyzers\src\java\org\apache\lucene\analysis\shingle

Cheers
Mark

Hi Ghinwa,

A Term is simply a unit of tokenization that has been indexed for a
Field, produced by a TokenStream.   In the demo, on the main site,
this can be seen in the file called IndexFiles.java on line 56:
IndexWriter writer = new IndexWriter(INDEX_DIR, new
StandardAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);

The key being that the StandardAnalyzer is used to get a TokenStream.
The TokenStream.next() method returns a Token.  All a Term is, is a
Token that has been indexed for a given Field.  In a nutshell, though,
a Term is whatever you want it to be, based on your Analyzer.  It could
be a whole document as a single string or it could be a string
containing a single character.  In reality, it usually corresponds to a
single word.

Have a look at the wiki, also, as there are many talks and articles
explaining all of this.  Lucene In Action, while outdated, is also an
excellent book for explaining this stuff (with the exception that some
of the code examples no longer work on the latest Lucene version)

Getting beyond that, you will need to look into adding spell checking
(in Lucene's "contrib" area) and other features like stemming and
synonyms to handle the various issues you bring up.

Hope that helps,
Grant

On Feb 18, 2008, at 7:36 PM, Ghinwa Choueiter wrote:


Hi,

I am new to Lucene and have been reading the documentation. I would
like to use Lucene to query a song database by lyrics. The query could
potentially contain typos, or even wrong words, word contractions
(can't versus cannot), etc..

I would like to create an inverted list by word pairs and possibly
phrases and not just by isolated words. For example:
   < d1, d10, d27>
   
...

OR even
 
 <...>
...

It seems to me that, by default, the index in Lucene stores statistics
for isolated words. The Lucene documentation refers to the word "Term"
all the time and seems to imply that "Term" can be a word or a phrase,
but I can't see how IndexWriter can read a document and index it by
word pairs.

thank you in advance for the answers and my apologies if I did not
get the terminology quite right.

-Ghinwa


--
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






-
To unsubscribe, e-mail: [EMAIL PROTECTED] For
additional commands, e-mail: [EMAIL PROTECTED]







- To
unsubscribe, e-mail: [EMAIL PROTECTED] For
additional commands, e-mail: [EMAIL PROTECTED]







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



searching for "Nothing"

2008-03-01 Thread Ghinwa Choueiter
Hi,

I am trying to do a search as follows (this is a very simplified example):

I want to search for: (1) the little boy or (2) one little boy or (3) little boy

Can I write the query as:
"the OR one OR "" " AND "little" AND "boy"

note that what I mean by "" is "Nothing". 

thank you,
-Ghinwa

PS. I know that I can write the query as "the little boy" OR "one little boy" 
OR "little boy"



Re: searching for "Nothing"

2008-03-02 Thread Ghinwa Choueiter

thank you. You were right. Indexing by "" does not do what I need.

How would one represent a null index? Perhaps another way of asking the 
question is what query would return to me all the documents in the

database (all-pass filter).

thank you for your patience!
-Ghinwa

- Original Message - 
From: "Erick Erickson" <[EMAIL PROTECTED]>

To: 
Sent: Sunday, March 02, 2008 8:36 AM
Subject: Re: searching for "Nothing"



The best and most deterministic way to answer this kind of question is
to download Luke and look at the explained query. That'll show you
exactly what the effect of different analyzers and the exact structure
of the resulting query.

Offhand, I don't think your rewrite will work.

Best
Erick

On Sat, Mar 1, 2008 at 4:06 PM, Ghinwa Choueiter <[EMAIL PROTECTED]>
wrote:


Hi,

I am trying to do a search as follows (this is a very simplified 
example):


I want to search for: (1) the little boy or (2) one little boy or (3)
little boy

Can I write the query as:
"the OR one OR "" " AND "little" AND "boy"

note that what I mean by "" is "Nothing".

thank you,
-Ghinwa

PS. I know that I can write the query as "the little boy" OR "one little
boy" OR "little boy"







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Scoring a query with OR's

2008-03-09 Thread Ghinwa Choueiter

Hi,

I had a look at the scoring equation and read the scoring online document:
http://lucene.apache.org/java/docs/scoring.html#Scoring

It is clear to me how the scoring equation would work for a query that 
contains AND: 
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apache/lucene/search/Similarity.html


but what exactly happens when there are OR's, for eg.  (life OR place OR 
time)


The scoring equation can get a score for life, place, time separately, but 
what does it do with them then? Does it also add them.


Many thanks,
-Ghinwa 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Scoring a query with OR's

2008-03-09 Thread Ghinwa Choueiter
but shouldn't the coord factor kick in with AND instead of OR? I understand 
why you would want to use coord in the case of AND, where you reward more 
the documents that contain most of the terms in the query. However in the 
case of OR, it should not matter if all the OR  operands are in the 
document?


-Ghinwa

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>

To: 
Sent: Sunday, March 09, 2008 1:22 PM
Subject: Re: Scoring a query with OR's




On Mar 9, 2008, at 12:39 PM, Ghinwa Choueiter wrote:
but what exactly happens when there are OR's, for eg.  (life OR  place OR 
time)


The scoring equation can get a score for life, place, time  separately, 
but what does it do with them then? Does it also add them.


The coord factor kicks in then:

<http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/ 
apache/lucene/search/DefaultSimilarity.html#coord(int,%20int)>


the formula listed here should help too:

<http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/ 
apache/lucene/search/Similarity.html>


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Scoring a query with OR's

2008-03-19 Thread Ghinwa Choueiter

Hi,

I emailed a question earlier about the difference between OR and AND in a 
Boolean query. So in what I am trying to do, I need AND to behave like an 
OR ( or what I like to call "soft AND"), and I need OR to behave like a 
logic OR, meaning that I don't want to reward documents that have more of 
the OR operands. It is easy for me to fix the AND, but is there a 
straightforward way of fixing the OR?


Many thanks!
-Ghinwa

On Sun, 9 Mar 2008, Mark Miller wrote:

I have been trying to understand all of this better myself, so while I am no 
expert, here is my take:


Lucene is really a combined Vector Space / Boolean Model search engine.

At its core, Lucene is essentially a Vector Space Model search engine: 
scoring is done by comparing a query term vector to each of the document term 
vectors. However, on top of this, Lucene allows a Boolean Model by 
constraining results using a BooleanQuery.


So when Lucene finds the score for "mark OR mandy", the idea is the same as 
for "mark AND mandy". The difference is that the BooleanQuery will treat the 
Must and Should clause differently: if a term is labeled Must but is not in 
the document, the document won't match. If a Should term is not in the 
document, the BooleanQuery excludes no extra documents on that account, but 
the term may contribute 0 towards the similarity score. The BooleanQuery kind 
of clamps down on top of the Vector Space TermVector similarity scoring, 
allowing for a hybrid system.


The coord factor essentially juices the term vector similarity score based on 
how many query terms are in the document. Term overlap is already taken into 
account during the term vector similarity part, but apparently users don't 
like how that ranks eg users intuitively think that sharing more terms 
between document and query is more important than sharing fewer very highly 
weighted terms. So basically, coord is just trying to reorder things a bit 
based on reported user expectations.


- Mark



Ghinwa Choueiter wrote:
but shouldn't the coord factor kick in with AND instead of OR? I understand 
why you would want to use coord in the case of AND, where you reward more 
the documents that contain most of the terms in the query. However in the 
case of OR, it should not matter if all the OR  operands are in the 
document?


-Ghinwa

- Original Message - From: "Erik Hatcher" 
<[EMAIL PROTECTED]>

To: 
Sent: Sunday, March 09, 2008 1:22 PM
Subject: Re: Scoring a query with OR's




On Mar 9, 2008, at 12:39 PM, Ghinwa Choueiter wrote:
but what exactly happens when there are OR's, for eg.  (life OR  place OR 
time)


The scoring equation can get a score for life, place, time  separately, 
but what does it do with them then? Does it also add them.


The coord factor kicks in then:

<http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/ 
apache/lucene/search/DefaultSimilarity.html#coord(int,%20int)>


the formula listed here should help too:

<http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/ 
apache/lucene/search/Similarity.html>


Erik


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]