Re: Why has PerFieldAnalyzerWrapper been made final in Lucene 3.1 ?

2011-05-03 Thread Israel Tsadok
On Tue, May 3, 2011 at 7:03 PM, Paul Taylor wrote: > We subclassed PerFieldAnalyzerWrapper as follows: > > public class PerFieldEntityAnalyzer extends PerFieldAnalyzerWrapper { > >public PerFieldEntityAnalyzer(Class indexFieldClass) { >super(new StandardUnaccentAnalyzer()); > >

AW: AW: AW: AW: "fuzzy prefix" search

2011-05-03 Thread Clemens Wyss
I know this is just an example. But even the WhitespaceAnalyzer takes the words apart, which I don't want. I would like the phrases as they are (maximum 3 words, e.g. "Merlot del Ticino", ...) to be n-gram-ed. I hence want to have the n-grams. Mer Merl Merlo Merlot Merlot Merlot d ... Regards Cl

Re: AW: AW: AW: "fuzzy prefix" search

2011-05-03 Thread Otis Gospodnetic
Clemens - that's just an example. Stick another tokenizer in there, like WhitespaceTokenizer in there, for example. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Clemens Wyss > To:

AW: AW: AW: "fuzzy prefix" search

2011-05-03 Thread Clemens Wyss
But doesn't the KeyWordTokenizer extract single words out oft he stream? I would like to create n-grams on the stream (field content) as it is... > -Ursprüngliche Nachricht- > Von: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] > Gesendet: Dienstag, 3. Mai 2011 21:31 > An: java-user

Re: AW: AW: "fuzzy prefix" search

2011-05-03 Thread Otis Gospodnetic
Clemens, Something a la: public TokenStream tokenStream (String fieldName, Reader r) { return nw EdgeNGramTokenFilter(new KeywordTokenizer(r), EdgeNGramTokenFilter.Side.FRONT, 1, 4); } Check out page 265 of Lucene in Action 2. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nu

AW: AW: "fuzzy prefix" search

2011-05-03 Thread Clemens Wyss
How does an simple Analyzer look that just "n-grams" the docs/fields. class SimpleNGramAnalyzer extends Analyzer { @Override public TokenStream tokenStream ( String fieldName, Reader reader ) { EdgeNGramTokenFilter... ??? } } > -Ursprüngliche Nachricht- > Von: Otis Gospodnetic [mailto:

Why has PerFieldAnalyzerWrapper been made final in Lucene 3.1 ?

2011-05-03 Thread Paul Taylor
We subclassed PerFieldAnalyzerWrapper as follows: public class PerFieldEntityAnalyzer extends PerFieldAnalyzerWrapper { public PerFieldEntityAnalyzer(Class indexFieldClass) { super(new StandardUnaccentAnalyzer()); for(Object o : EnumSet.allOf(indexFieldClass)) {

Re: MultiPhraseQuery slowing down over time in Lucene 3.1

2011-05-03 Thread Michael McCandless
On Tue, May 3, 2011 at 7:43 AM, Tomislav Poljak wrote: > Hi, > > 2011/5/3 Michael McCandless : >> I feel like we are back to Basic ;) >> >> If you keep running line 40 over and over on the same memory index, do >> you see a slowdown? > > Yes. I've tested running same query list (~3,5 k queries) on

Re: Problem modifying Similarity class to work with lucene 3.1.0

2011-05-03 Thread Robert Muir
On Tue, May 3, 2011 at 10:29 AM, Paul Taylor wrote: > I assume this would be the correct way to fix the code for 3.1.0 > Yes, thats correct. > public float computeNorm(String field, FieldInvertState state) { > > >        //This will match both artist and label aliases and is applicable to > both

Re: Problem modifying Similarity class to work with lucene 3.1.0

2011-05-03 Thread Paul Taylor
On 03/05/2011 15:06, Robert Muir wrote: On Tue, May 3, 2011 at 9:57 AM, Paul Taylor wrote: How can I convert this Similariity method to use 3.1 (currently using 3.0.3), I understand I have to replace lengthNorm() wuth computerNorm() , but fieldlName is not a provided parameter in computerNorm()

Re: Problem modifying Similarity class to work with lucene 3.1.0

2011-05-03 Thread Robert Muir
On Tue, May 3, 2011 at 9:57 AM, Paul Taylor wrote: > How can I convert this Similariity method to use 3.1 (currently using > 3.0.3), I understand I have to replace lengthNorm() wuth computerNorm() , > but fieldlName is not a provided parameter in computerNorm() and > FieldInvertState does not cont

Problem modifying Similarity class to work with lucene 3.1.0

2011-05-03 Thread Paul Taylor
How can I convert this Similariity method to use 3.1 (currently using 3.0.3), I understand I have to replace lengthNorm() wuth computerNorm() , but fieldlName is not a provided parameter in computerNorm() and FieldInvertState does not contain the fieldname either. I need the field because I onl

Re: ComplexPhraseQueryParser with multiple fields

2011-05-03 Thread Chris Salem
That seems to work. Thank you! Sincerely, Chris Salem Development Team Main Sequence Technologies, Inc. PCRecruiter.net - PCRecruiter Support ch...@mainsequence.net P: 440.946.5214 ext 5458 F: 440.856.0312 This email and any files transmitted with it may contain confidential information inten

Re: How to fix the number of searched terms for a field

2011-05-03 Thread Erick Erickson
Why do you want to do this? I'm wondering if this is an XY problem... See: http://people.apache.org/~hossman/#xyproblem Best Erick On Tue, May 3, 2011 at 7:55 AM, harsh srivastava wrote: > Hi All, > > > I want to know any inbuilt method in lucene that can help me to fix the > number of searched

How to fix the number of searched terms for a field

2011-05-03 Thread harsh srivastava
Hi All, I want to know any inbuilt method in lucene that can help me to fix the number of searched terms for a given field e.g. Suppose I have given content:(text1 text2 text3 text4 text5) to search and want to limit it to 3 words only i.e. content:(text1 text2 text3) Please help. Thanks, Harsh

Anyway to not bother scoring less good matches ?

2011-05-03 Thread Paul Taylor
Im receiving a number of searches with many ORs so that the total number of matches is huge ( > 1 million) although only the first 20 results are required. Analysis shows most time is spent scoring the results. Now it seems to me if you sending a query with 10 OR components, documents that matc

RE: MultiPhraseQuery slowing down over time in Lucene 3.1

2011-05-03 Thread Uwe Schindler
> Hi, > > 2011/5/3 Michael McCandless : > > I feel like we are back to Basic ;) > > > > If you keep running line 40 over and over on the same memory index, do > > you see a slowdown? > > Yes. I've tested running same query list (~3,5 k queries) on the same > MemoryIndex instance and after a while

Re: MultiPhraseQuery slowing down over time in Lucene 3.1

2011-05-03 Thread Tomislav Poljak
Hi, 2011/5/3 Michael McCandless : > I feel like we are back to Basic ;) > > If you keep running line 40 over and over on the same memory index, do > you see a slowdown? Yes. I've tested running same query list (~3,5 k queries) on the same MemoryIndex instance and after a while iterations get slow

Re: AW: "fuzzy prefix" search

2011-05-03 Thread Otis Gospodnetic
Hi, I didn't read this thread closely, but just in case: * Is this something you can handle with synonyms? * If this is for English and you are trying to handle typos, there is a list of common English misspellings out there that you could use for this perhaps. * Have you considered n-gramming yo

AW: "fuzzy prefix" search

2011-05-03 Thread Biedermann,S.,Fa. Post Direkt
I don't know. But changing it now would cause trouble in many applications... For our applications we reimplemented fuzzy query so that we can pass along a org.apache.lucene.search.spell.StringDistance instance that holds the similarity algorithm of choice. -- Sven -Ursprüngliche Nachr

Re: Speed up payload loading?

2011-05-03 Thread Michael McCandless
On Tue, May 3, 2011 at 5:35 AM, Chris Bamford wrote: > Hi, > > I have been experimenting with using a int payload as a unique identifier, > one per Document.  I have successfully loaded them in using the TermPositions > API with something like: > >    public static void loadPayloadIntArray(Index

Re: The MoreLikeThisHandler could include highlighting ?

2011-05-03 Thread Koji Sekiguchi
(11/03/01 21:16), Amel Fraisse wrote: Hello, The MoreLikeThisHandler could include higlighting ? Is it true to define a MoreLikeThisHandler like this: ? true contenu Thank you for your help. Amel. Amel, 1. I think you shou

AW: "fuzzy prefix" search

2011-05-03 Thread Clemens Wyss
Is this calculation intended or a bug? > -Ursprüngliche Nachricht- > Von: Biedermann,S.,Fa. Post Direkt [mailto:s.biederm...@postdirekt.de] > Gesendet: Dienstag, 3. Mai 2011 12:00 > An: java-user@lucene.apache.org > Betreff: AW: "fuzzy prefix" search > > I had a look into the 3.0 implemen

AW: "fuzzy prefix" search

2011-05-03 Thread Biedermann,S.,Fa. Post Direkt
I had a look into the 3.0 implementation The calculation of the similarity is 1 - (edit distance / min (string 1 length, string 2 length) As opposed to the levenstein in spellchecker 1 - (edit distance / max (string 1 length, string 2 length) So, the similarity is 1 - ( 3 / mi

Re: MultiPhraseQuery slowing down over time in Lucene 3.1

2011-05-03 Thread Michael McCandless
I feel like we are back to Basic ;) If you keep running line 40 over and over on the same memory index, do you see a slowdown? Mike http://blog.mikemccandless.com On Mon, May 2, 2011 at 1:19 PM, Otis Gospodnetic wrote: > Hi, > > I think this describes what's going on: > > 10 load N stored quer

Re: "fuzzy prefix" search

2011-05-03 Thread Ian Lea
Then why not do that? Add a PrefixQuery and a FuzzyQuery to a BooleanQuery and use that. -- Ian. On Tue, May 3, 2011 at 10:25 AM, Clemens Wyss wrote: >>PrefixQuery > I'd like the combination of prefix and fuzzy ;-) because people could also > type "menlo" or "märl" and in any of these cases

AW: "fuzzy prefix" search

2011-05-03 Thread Biedermann,S.,Fa. Post Direkt
Have you tried Query q = new FuzzyQuery( new Term( "test", "Mer" ), 0.499f); Sven -Ursprüngliche Nachricht- Von: Clemens Wyss [mailto:clemens...@mysign.ch] Gesendet: Dienstag, 3. Mai 2011 10:57 An: java-user@lucene.apache.org Betreff: AW: "fuzzy prefix" search Sorry for coming back

Speed up payload loading?

2011-05-03 Thread Chris Bamford
Hi, I have been experimenting with using a int payload as a unique identifier, one per Document. I have successfully loaded them in using the TermPositions API with something like: public static void loadPayloadIntArray(IndexReader reader, Term term, int[] intArray, int from, int to) thro

AW: "fuzzy prefix" search

2011-05-03 Thread Clemens Wyss
>PrefixQuery I'd like the combination of prefix and fuzzy ;-) because people could also type "menlo" or "märl" and in any of these cases I'd like to get a hit on Merlot (for suggesting Merlot) > -Ursprüngliche Nachricht- > Von: Ian Lea [mailto:ian@gmail.com] > Gesendet: Dienstag, 3.

Re: "fuzzy prefix" search

2011-05-03 Thread Ian Lea
I'd assumed that FuzzyQuery wouldn't ignore case but I could be wrong. What would be the edit distance between "mer" and "merlot"? Would it be less that 1.5 which I reckon would be the value of length(term)*0.5 as detailed in the javadocs? Seems unlikely, but I don't really know anything about th

AW: "fuzzy prefix" search

2011-05-03 Thread Clemens Wyss
Unfortunately lowercasing doesn't help. Also, doesn't the FuzzyQuery ignore casing? > -Ursprüngliche Nachricht- > Von: Ian Lea [mailto:ian@gmail.com] > Gesendet: Dienstag, 3. Mai 2011 11:06 > An: java-user@lucene.apache.org > Betreff: Re: "fuzzy prefix" search > > Mer != mer. The la

Re: "fuzzy prefix" search

2011-05-03 Thread Ian Lea
Mer != mer. The latter will be what is indexed because StandardAnalyzer calls LowerCaseFilter. -- Ian. On Tue, May 3, 2011 at 9:56 AM, Clemens Wyss wrote: > Sorry for coming back to my issue. Can anybody explain why my "simple" unit > test below fails? Any hint/help appreciated. > > Directory

AW: "fuzzy prefix" search

2011-05-03 Thread Clemens Wyss
Sorry for coming back to my issue. Can anybody explain why my "simple" unit test below fails? Any hint/help appreciated. Directory directory = new RAMDirectory(); IndexWriter indexWriter = new IndexWriter( directory, new StandardAnalyzer( Version.LUCENE_31 ), IndexWriter.MaxFieldLength.UNLIMITED

Re: questions about the index

2011-05-03 Thread Bernd Fehling
Well, it is not only with a huge index. It is only if ReplicationHandler is in use on a master. If ReplicationHandler is configured to replicateAfter startup it first sends a commit via IndexWriter to have a "stable" index. The left over of this operation is the write.lock. So removing replicateA

Re: Lucene spending alot of time in BooleanScorer2

2011-05-03 Thread Paul Taylor
On 02/05/2011 23:36, Paul Taylor wrote: Hi Nearing completion on a new version of a lucene search component for the http://www.musicbrainz.org music database and having a problem with performance. There are a number of indexes each built from data in a database, there is one index for albums,

Re: questions about the index

2011-05-03 Thread Bernd Fehling
Hi Mike, thanks for the infos. As far as I know a write.lock is created from an IndexWriter. So I have to dig into it why an IndexWriter is created just on starting solr with an optimized index. The problem, this is only with a huge index. And also old parts of the index are not cleaned up. May

Re: Inquiring part-of-speech (POS) tagging indexing and searching

2011-05-03 Thread Grijesh
As you have seen the example code for PartOfSpeechTaggingFilter at http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/analysis/package-summary.html You can use a custom analyzer to inject "metadata" tokens into the index at the same position as the source tokens. For example, given t