RE: Fastest way to perform 'like' searches

2007-08-09 Thread Ard Schrijvers

Thanks Daniel, 

I understand how it can be done. The only things that bothers me is that 
expanding the "*" might result in many phrases, and that in turn might imply a 
performance hit. I'll see what the impact is,

Regards Ard

> 
> On Wednesday 08 August 2007 10:28, Ard Schrijvers wrote:
> 
> > Does anybody know a more efficient way? A PhraseQuery might get me
> > somewhere, isn't? 
> 
> No, you need to use MultiPhraseQuery, and you will need to 
> first epxand the 
> terms with the "*" yourself (e.g. using term enumeration).
> 
> > as a phrase is analyzed according some analyzer it might 
> strip the 'the'
> > as a stopword, implying that *oo bar qu* would also match, right?
> 
> Stopwords need to be removed everywhere, i.e. also from 
> phrases, this way 
> they generally work as expected.
> 
> Regards
>  Daniel
> 
> -- 
> http://www.danielnaber.de
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



frequent phrases

2007-08-09 Thread Akanksha Baid
I was wondering if there is a "search based" method to find the top-k
frequent phrases in a set of documents.( I do not have a particular phrase
in mind so PhraseQuery can probably be ruled out).
I have implemented something that works using termvectors and termpositions
but the performance is not great so far since I am basically iterating
multiple times and hacking my way around. I was wondering if an API exists
for finding frequent phrases and/or if someone could point me to some code
for the same.

Thanks.


Re: frequent phrases

2007-08-09 Thread karl wettin


9 aug 2007 kl. 09.34 skrev Akanksha Baid:


I was wondering if there is a "search based" method to find the top-k
frequent phrases in a set of documents.( I do not have a particular  
phrase

in mind so PhraseQuery can probably be ruled out).
I have implemented something that works using termvectors and  
termpositions

but the performance is not great so far since I am basically iterating
multiple times and hacking my way around. I was wondering if an API  
exists
for finding frequent phrases and/or if someone could point me to  
some code

for the same.


I think this is the closest thing available in the issue tracker:

https://issues.apache.org/jira/browse/LUCENE-725

--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: frequent phrases

2007-08-09 Thread karl wettin


9 aug 2007 kl. 09.34 skrev Akanksha Baid:


I was wondering if there is a "search based" method to find the top-k
frequent phrases in a set of documents.( I do not have a particular  
phrase

in mind so PhraseQuery can probably be ruled out).
I have implemented something that works using termvectors and  
termpositions

but the performance is not great so far since I am basically iterating
multiple times and hacking my way around. I was wondering if an API  
exists
for finding frequent phrases and/or if someone could point me to  
some code

for the same.


By the way, if you explain what you aim to use it for, then you might  
get

more elaborate responses.


--
karl



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: frequent phrases

2007-08-09 Thread mark harwood
The "CollocationFinder" code attached to this may be more suited
http://issues.apache.org/jira/browse/LUCENE-474

Again, not exactly sure of your use case.

Cheers
Mark

- Original Message 
From: karl wettin <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, 9 August, 2007 11:16:35 AM
Subject: Re: frequent phrases


9 aug 2007 kl. 09.34 skrev Akanksha Baid:

> I was wondering if there is a "search based" method to find the top-k
> frequent phrases in a set of documents.( I do not have a particular  
> phrase
> in mind so PhraseQuery can probably be ruled out).
> I have implemented something that works using termvectors and  
> termpositions
> but the performance is not great so far since I am basically iterating
> multiple times and hacking my way around. I was wondering if an API  
> exists
> for finding frequent phrases and/or if someone could point me to  
> some code
> for the same.

I think this is the closest thing available in the issue tracker:

https://issues.apache.org/jira/browse/LUCENE-725

-- 
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]






  ___ 
Yahoo! Mail is the world's favourite email. Don't settle for less, sign up for
your free account today 
http://uk.rd.yahoo.com/evt=44106/*http://uk.docs.yahoo.com/mail/winter07.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



special handling of certain terms with embedded periods

2007-08-09 Thread Donna L Gresh
Is there a good way to handle the following scenario:

I have certain terms with embedded periods for which I want to leave them 
intact (not split at the periods). For 
example in my application a particular skill might be SAP.FIN (SAP 
financial), and it should not be split into
SAP and FIN. Is there a way to specify a list of terms such as these which 
should not be split? I am 
currently using my own "SynonymAnalyzer" for which the token stream looks 
like below
 (pretty standard I think) and where engine is a custom SynonymEngine 
where I provide the synonyms. 
Is there a typical way to handle this situation?

public TokenStream tokenStream(String fieldName, Reader reader) {
 
TokenStream result = new SnowballFilter(
   new SynonymFilter(
new StopFilter(
   new LowerCaseFilter(
 new StandardFilter(
   new StandardTokenizer(reader))),
  StandardAnalyzer.STOP_WORDS),
  engine),"English"
);
return result;
}

Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]


Re: special handling of certain terms with embedded periods

2007-08-09 Thread karl wettin


9 aug 2007 kl. 16.36 skrev Donna L Gresh:


Is there a good way to handle the following scenario:

I have certain terms with embedded periods for which I want to  
leave them

intact (not split at the periods). For example in my application a
particular skill might be SAP.FIN (SAP financial), and it should  
not be
split into SAP and FIN. Is there a way to specify a list of terms  
such as

these which should not be split?


Updating the standard analyzer BNF to allow terms with punctuation is  
not a
big deal. If there is a list of terms you want to allow, you would  
handle

them in a TokenFilter. See StandadardTokenizer and StandardFilter.

You might save a couple of clock ticks by implementing a BNF rule rather
than a filter though.


--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: special handling of certain terms with embedded periods

2007-08-09 Thread Erick Erickson
Some possibilities...
> write your own tokenizer and/or filter. If you alter your BNF,
 you'll have to maintain it in later releases.
> use some simple transformations for the input *before* tokenizing.
> there's been some discussion that StandardAnalyzer (and, I assume,
   the Standard* beasts) are slower than the other analyzers, so you
   may be better off eschewing them.

Best
Erick


On 8/9/07, Donna L Gresh <[EMAIL PROTECTED]> wrote:
>
> Is there a good way to handle the following scenario:
>
> I have certain terms with embedded periods for which I want to leave them
> intact (not split at the periods). For
> example in my application a particular skill might be SAP.FIN (SAP
> financial), and it should not be split into
> SAP and FIN. Is there a way to specify a list of terms such as these which
> should not be split? I am
> currently using my own "SynonymAnalyzer" for which the token stream looks
> like below
> (pretty standard I think) and where engine is a custom SynonymEngine
> where I provide the synonyms.
> Is there a typical way to handle this situation?
>
> public TokenStream tokenStream(String fieldName, Reader reader) {
>
> TokenStream result = new SnowballFilter(
>new SynonymFilter(
> new StopFilter(
>new LowerCaseFilter(
>  new StandardFilter(
>new StandardTokenizer(reader))),
>   StandardAnalyzer.STOP_WORDS),
>   engine),"English"
> );
> return result;
> }
>
> Donna L. Gresh
> Services Research, Mathematical Sciences Department
> IBM T.J. Watson Research Center
> (914) 945-2472
> http://www.research.ibm.com/people/g/donnagresh
> [EMAIL PROTECTED]
>


Re: special handling of certain terms with embedded periods

2007-08-09 Thread Donna L Gresh
thanks.
In this case it actually looks like I was trying to solve a problem
that doesn't exist (not an unusual occurrence in my experience)
since the StandardAnalyzer does not appear to split the terms
if the period has no white space following. I was a bit misled by
the additional complication that I am using the MoreLikeThis
class to construct the query, and it seemed to be dropping the
SAP.FIN term, apparently because it actually never appears in
my index to be searched, only in my input queries. In fact I may
decide to do some acronym expansion of this to allow it to
match things that *do* appear in my index.

But your point about the StandardAnalyzer being slow is 
well-taken, and I'll keep that in mind. Also, the straighforward
substitution before indexing and searching is a reasonable
approach to keep in mind.

Thanks-
Donna

Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]




"Erick Erickson" <[EMAIL PROTECTED]> 
08/09/2007 12:09 PM
Please respond to
java-user@lucene.apache.org


To
java-user@lucene.apache.org
cc

Subject
Re: special handling of certain terms with embedded periods






Some possibilities...
> write your own tokenizer and/or filter. If you alter your BNF,
 you'll have to maintain it in later releases.
> use some simple transformations for the input *before* tokenizing.
> there's been some discussion that StandardAnalyzer (and, I assume,
   the Standard* beasts) are slower than the other analyzers, so you
   may be better off eschewing them.

Best
Erick


On 8/9/07, Donna L Gresh <[EMAIL PROTECTED]> wrote:
>
> Is there a good way to handle the following scenario:
>
> I have certain terms with embedded periods for which I want to leave 
them
> intact (not split at the periods). For
> example in my application a particular skill might be SAP.FIN (SAP
> financial), and it should not be split into
> SAP and FIN. Is there a way to specify a list of terms such as these 
which
> should not be split? I am
> currently using my own "SynonymAnalyzer" for which the token stream 
looks
> like below
> (pretty standard I think) and where engine is a custom SynonymEngine
> where I provide the synonyms.
> Is there a typical way to handle this situation?
>
> public TokenStream tokenStream(String fieldName, Reader reader) {
>
> TokenStream result = new SnowballFilter(
>new SynonymFilter(
> new StopFilter(
>new LowerCaseFilter(
>  new StandardFilter(
>new StandardTokenizer(reader))),
>   StandardAnalyzer.STOP_WORDS),
>   engine),"English"
> );
> return result;
> }
>
> Donna L. Gresh
> Services Research, Mathematical Sciences Department
> IBM T.J. Watson Research Center
> (914) 945-2472
> http://www.research.ibm.com/people/g/donnagresh
> [EMAIL PROTECTED]
>



Re: special handling of certain terms with embedded periods

2007-08-09 Thread Mark Miller

Donna L Gresh wrote:


But your point about the StandardAnalyzer being slow is 
well-taken, and I'll keep that in mind. 
A new StandardAnalyzer that is 6x faster was recently committed on the 
trunk. Should be in next release.


- Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: What is the contrib/surround/src/java purpose

2007-08-09 Thread Paul Elschot
Ard,

It's the source code of an alternative query language on top of Lucene 
that allows boolean queries and span (distance) queries.
The minimal documentation is in the Java API documentation
on the lucene java site under contrib: Surround Parser, and in 
the surround.txt file here:
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surround/surround.txt?view=log

Groeten,
Paul Elschot


On Wednesday 08 August 2007 12:20, Ard Schrijvers wrote:
> Hello,
> 
> without having to dive into the code, I was hoping somebody could tell me 
what this contrib block does? I can't seem to find any documentation or 
relevant hits when searching for it,
> 
> Thanks in advance,
> 
> Regards Ard
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Help Regarding Fuzzy Scoring

2007-08-09 Thread Jami Kapla

Hello,

I have been using Lucene recently (newbie) and reading about scoring.
However, have a result that of course still does not make sense to me.

I have an index created from a single field (entree) in one table in  
my DB.

I Index it with the StandardAnalyzer.

I also Query it with the StandardAnalyzer as simple as possible like:

QueryParser qp = new QueryParser("title", new  
StandardAnalyzer());

org.apache.lucene.search.Query query = qp.parse( qStr );
hits = searcher.search( query );

Here is the possible responses that I know are in the index.

Chicken
Lemon chicken
Chicken parmesan
Chicken marsala
Chicken cordon blue
Chicken and dumplings
chickpeas
Bbq chicken
Garlic chicken
Teriyaki chicken
Chlcken picatta


If qStr="Chicken" thus basic TermQuery it works as expected and  
returns the result:


Chicken  			score=0.9994   (sort of thought this would be 1.0 but  
at least it found and scored the right one)

Lemon chicken   score=0.625Chicken parmesan score=0.625
 so on


However, if I make it a Fuzzy search by setting qStr="Chicken~" the  
results I get are


Chlcken picatta score=1.0
Chicken score=0.4026232
Lemon chicken   score=0.2516395
Chicken parmesanscore=0.2516395
so on

Most likely this is a lack of my understanding of the fundamentals  
but I would think
it should come up with Chicken as the highest score still considering  
it is an exact match.


Any clue as to why this is is greatly appreciated.

Meanwhile I am still reading docs and searching the web.

-Jami



RE: What is the contrib/surround/src/java purpose

2007-08-09 Thread Ard Schrijvers
Thanks Paul,

I didn't look at the surround.txt, though I must admit it is a pretty cryptic 
explanation :-)

Regards Ard

> Ard,
> 
> It's the source code of an alternative query language on top 
> of Lucene 
> that allows boolean queries and span (distance) queries.
> The minimal documentation is in the Java API documentation
> on the lucene java site under contrib: Surround Parser, and in 
> the surround.txt file here:
> http://svn.apache.org/viewvc/lucene/java/trunk/contrib/surroun
> d/surround.txt?view=log
> 
> Groeten,
> Paul Elschot
> 
> 
> On Wednesday 08 August 2007 12:20, Ard Schrijvers wrote:
> > Hello,
> > 
> > without having to dive into the code, I was hoping somebody 
> could tell me 
> what this contrib block does? I can't seem to find any 
> documentation or 
> relevant hits when searching for it,
> > 
> > Thanks in advance,
> > 
> > Regards Ard
> > 
> > 
> > 
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> > 
> > 
> > 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Help Regarding Fuzzy Scoring

2007-08-09 Thread Grant Ingersoll

Try using the explain() method to give more insight.


On Aug 9, 2007, at 4:08 PM, Jami Kapla wrote:


Hello,

I have been using Lucene recently (newbie) and reading about scoring.
However, have a result that of course still does not make sense to me.

I have an index created from a single field (entree) in one table  
in my DB.

I Index it with the StandardAnalyzer.

I also Query it with the StandardAnalyzer as simple as possible like:

QueryParser qp = new QueryParser("title", new  
StandardAnalyzer());

org.apache.lucene.search.Query query = qp.parse( qStr );
hits = searcher.search( query );

Here is the possible responses that I know are in the index.

Chicken
Lemon chicken
Chicken parmesan
Chicken marsala
Chicken cordon blue
Chicken and dumplings
chickpeas
Bbq chicken
Garlic chicken
Teriyaki chicken
Chlcken picatta


If qStr="Chicken" thus basic TermQuery it works as expected and  
returns the result:


Chicken  			score=0.9994   (sort of thought this would be 1.0  
but at least it found and scored the right one)

Lemon chicken   score=0.625Chicken parmesan score=0.625
 so on


However, if I make it a Fuzzy search by setting qStr="Chicken~" the  
results I get are


Chlcken picatta score=1.0
Chicken score=0.4026232
Lemon chicken   score=0.2516395
Chicken parmesanscore=0.2516395
so on

Most likely this is a lack of my understanding of the fundamentals  
but I would think
it should come up with Chicken as the highest score still  
considering it is an exact match.


Any clue as to why this is is greatly appreciated.

Meanwhile I am still reading docs and searching the web.

-Jami



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



答复: Nested Fields

2007-08-09 Thread Kai Hu
Hi, Spencer
Why not translating a xml to an object when indexing,and translating the object 
to a xml when searching.

kai

-邮件原件-
发件人: Spencer Tickner [mailto:[EMAIL PROTECTED] 
发送时间: 2007年8月4日 星期六 6:01
收件人: java-user@lucene.apache.org
主题: Nested Fields

Hi, and thanks in advace for any help.

I'm fairly new to lucene so excuse the ignorance. I'm attempting to
field an XML documents with nested fields. So:


  This
  That


would give me hits for:

bar:This
bat:That
foo:ThisThat

The only way I can see a way of doing this now is to field each
element I want to field and all of it's descendant text, repeating the
same text multiple times for nested fields, then finally have a 
field that would field the text for all the document so users not
doing field searches would programmically be forced to search on
 to not get multiple hits. This seems inefficent to me and am
hopeing that someone has some advice or can point me to a better way.

Thanks,

Spencer

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



formalizing a query

2007-08-09 Thread Abu Abdulla alhanbali
Hi,

I need your help in formalizing this query:

(field1:query1 AND field2:query2) OR
(field1:query3 AND field2:query4) OR
(field1:query5 AND field2:query6) OR
(field1:query7 AND field2:query8) ... etc

Please give the code since I'm new to lucene
how we can use MultiFieldQueryParser or any parser to do the job

greatly appreciated


Update boost factor for indexed document using setBoost()

2007-08-09 Thread rohit saini
Hi,
could u pl. tell me how to update boost factor of already indexed document
using setBoost.

Thanks & regards,
Rohit

-- 
VANDE - MATRAM


Re: Nested Fields

2007-08-09 Thread Jeff French

Spencer, it seems inefficient to me too, but that's pretty much what I did
for tables embedded within a document. 

I used a SAX parser to parse the document and kept track of the table
elements I saw. When I received an endElement, I added the text I had
buffered up in the characters() method to the buffer for each parent
element. Then I removed the current element and added its content as a
Field.

I should add that I am also fairly new to Lucene, so just because I did it
that way doesn't mean it's the best or even a good way.

Jeff


Spencer Tickner wrote:
> 
> I'm fairly new to lucene so excuse the ignorance. I'm attempting to
> field an XML documents with nested fields. So:
> 
> 
>   This
>   That
> 
> 
> ...
> 
> The only way I can see a way of doing this now is to field each
> element I want to field and all of it's descendant text, ...
> 

-- 
View this message in context: 
http://www.nabble.com/Nested-Fields-tf4214959.html#a12085200
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Nested Fields

2007-08-09 Thread Lukas Vlcek
Hi,

Have you checked Compass  framework
(built on top of Lucene)? This might be interesting for you:
http://www.opensymphony.com/compass/versions/1.2M3/html/core-xsem.html

BR
Lukas


On 8/10/07, Jeff French <[EMAIL PROTECTED]> wrote:
>
>
> Spencer, it seems inefficient to me too, but that's pretty much what I did
> for tables embedded within a document.
>
> I used a SAX parser to parse the document and kept track of the table
> elements I saw. When I received an endElement, I added the text I had
> buffered up in the characters() method to the buffer for each parent
> element. Then I removed the current element and added its content as a
> Field.
>
> I should add that I am also fairly new to Lucene, so just because I did it
> that way doesn't mean it's the best or even a good way.
>
> Jeff
>
>
> Spencer Tickner wrote:
> >
> > I'm fairly new to lucene so excuse the ignorance. I'm attempting to
> > field an XML documents with nested fields. So:
> >
> > 
> >   This
> >   That
> > 
> >
> > ...
> >
> > The only way I can see a way of doing this now is to field each
> > element I want to field and all of it's descendant text, ...
> >
>
> --
> View this message in context:
> http://www.nabble.com/Nested-Fields-tf4214959.html#a12085200
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>