Re: Indexer.Java problem

2009-02-20 Thread Seid Mohammed
tanks erick i have got the latest INDEXER example from lia2 working properly thanks a lot Seid M On 2/19/09, Michael McCandless wrote: > > The early access version of LIA2 (accessible at > http://www.manning.com/hatcher3/) > has updated this example to work with recent Lucene releases (though

RE: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-20 Thread Philip Puffinburger
>some changes were made to the StandardTokenizer.jflex grammer (you can svn >diff the two URLs fairly trivially) to better deal with correctly >identifying >word characters, but from what i can tell that should have reduced the number >of splits, not increased them. > >it's hard to tell from you

Re: 2.3.2 -> 2.4.0 StandardTokenizer issue

2009-02-20 Thread Chris Hostetter
: In 2.3.2 if the token �Co�mo� came through this it would get changed to : �como� by the time it made it through the filters.In 2.4.0 this isn�t : the case. It treats this one token as two so we get �co� and �mo�.So : instead of search �como� or �Co�mo� to get all the hits we now have t

queryNorm affect on score

2009-02-20 Thread Peter Keegan
The explanation of scores from the same document returned from 2 similar queries differ in an unexpected way. There are 2 fields involved, 'contents' and 'literals'. The 'literals' field has setBoost = 0. As you an see from the explanations below, the total weight of the matching terms from the 'li

Confidence scores at search time

2009-02-20 Thread Ken Williams
Hi, Has there been any work done on getting confidence scores at runtime, so that scores of documents can be compared across queries? I found one reference in the mailing list to some work in 2003, but couldn't find any follow-up: http://osdir.com/ml/jakarta.lucene.user/2003-12/msg00093.html

Re: Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!

2009-02-20 Thread Robert Muir
Yusuf, You are 100% correct it is bad that this uses a custom tokenizer. this was my motivation for attacking it from this angle: https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel (unfinished) otherwise, at some point jflex rules

Re: searching a sentence or paragraph

2009-02-20 Thread Grant Ingersoll
I'm not sure why using a PhraseQuery allows you to search within a sentence. PhraseQuery just makes sure that the terms appear next to each other (or within some slop), but it isn't aware of sentence or paragraph boundaries. See http://www.lucidimagination.com/search/document/6a5dfb8df2ce

Re: Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!

2009-02-20 Thread Grant Ingersoll
It's been a few years since I've worked on Arabic, but it sounds reasonable. Care to submit a patch with unit tests showing the StandardTokenizer properly handling all Arabic characters? http://wiki.apache.org/lucene-java/HowToContribute On Feb 20, 2009, at 6:22 AM, Yusuf Aaji wrote: Hi

Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!

2009-02-20 Thread Yusuf Aaji
Hi Everyone, My question is related to the arabic analysis package under: org.apache.lucene.analysis.ar It is cool and it is doing a great job, but it uses a special tokenizer: ArabicLetterTokenizer The problem with this tokenizer is that it fails to handle emails, urls and acronyms the