RE: SweetSpotSimilarity

2012-02-28 Thread Chris Hostetter
: i'll try to get some graphs commited and linked to from the javadocs that : make it more clear how tweaking the settings affect the formula http://svn.apache.org/viewvc?rev=1294920&view=rev -Hoss - To unsubscribe, e-mail:

RE: SweetSpotSimilarity

2012-02-28 Thread Chris Hostetter
: A picture -- or more precisely a graph -- would be worth a 1000 words. fair enough. I think the reason i never committed one initially was because the formula in the javadocs was trivial to plot in gnuplot... gnuplot> min=0 gnuplot> max=2 gnuplot> base=1.3 gnuplot> xoffset=10 gnuplot> set

How to exclude words with overlapping patterns?

2012-02-28 Thread Yung-chung Lin
Hi all, I have a question. Is there a way to distinguish queries like 'hotel' and 'hotel restaurant', queries with overlapping patterns, effectively? For example, if I want the search to return 'hotel' in the top 100 results while 'hotel restaurant' results come after those of 'hotel', when I sear

Re: Building FST-like automaton queries

2012-02-28 Thread Dawid Weiss
> Wow, that was quick!  Thanks! The power of open source and coffee break, combined... > I don't think we'll have too many terms per query term - as I said earlier, > we're restricting the expansions to those with an edit distance of 1.  But > this looks cool anyway. Shouldn't make much of a d

Re: Building FST-like automaton queries

2012-02-28 Thread Alan Woodward
Wow, that was quick! Thanks! I don't think we'll have too many terms per query term - as I said earlier, we're restricting the expansions to those with an edit distance of 1. But this looks cool anyway. On 28 Feb 2012, at 16:01, Dawid Weiss wrote: > The issue has a patch -- feel free to try

Re: Building FST-like automaton queries

2012-02-28 Thread Dawid Weiss
The issue has a patch -- feel free to try it out. Dawid On Tue, Feb 28, 2012 at 4:48 PM, Dawid Weiss wrote: > I filed an issue for that. > https://issues.apache.org/jira/browse/LUCENE-3832 > > I'll try to port it myself actually. It shouldn't be a big problem. > > Dawid > > On Tue, Feb 28, 2012

n-gram frequencies

2012-02-28 Thread xavier aimé
Dear List, I need for example to know the frequency of the phrase "phd finger protein 6" - not only the niumber of document where this phrase appears. With a simpleAnalyzer or an other, I must parse each hits, each document, each position for each term and compute all these data, or is there

Re: Building FST-like automaton queries

2012-02-28 Thread Dawid Weiss
I filed an issue for that. https://issues.apache.org/jira/browse/LUCENE-3832 I'll try to port it myself actually. It shouldn't be a big problem. Dawid On Tue, Feb 28, 2012 at 2:31 PM, Michael McCandless wrote: > Neat :)  It's like a FuzzyQuery w/ a custom (binary?) cost matrix for > the insert/

Re: Building FST-like automaton queries

2012-02-28 Thread Dawid Weiss
> For steps 2 and 3 you shouldn't use FST at all.  Instead, for 2) use > BasicAutomata.makeString(String) on each of your expanded terms, then > BasicOperations.union on all of those automata to make a single How many input strings do you have? The API Mike mentioned in from a port of the Brics li

Re: Building FST-like automaton queries

2012-02-28 Thread Alan Woodward
>> >> We're only allowing expansions within an edit distance of 1, which should >> keep the numbers of terms down. > > Ahh, ok. So even if the term has two occurrences of cl, only one of > them is allowed to substitute d? Yes, exactly - "cloocl" will be expanded to "doocl" and "clood" only. I

Re: Building FST-like automaton queries

2012-02-28 Thread Michael McCandless
On Tue, Feb 28, 2012 at 8:42 AM, Alan Woodward wrote: > > On 28 Feb 2012, at 13:31, Michael McCandless wrote: > >> Neat :)  It's like a FuzzyQuery w/ a custom (binary?) cost matrix for >> the insert/delete/transposition changes... >> >> Is the number of edits smallish?  Ie you're not concerned abo

Re: Building FST-like automaton queries

2012-02-28 Thread Alan Woodward
On 28 Feb 2012, at 13:31, Michael McCandless wrote: > Neat :) It's like a FuzzyQuery w/ a custom (binary?) cost matrix for > the insert/delete/transposition changes... > > Is the number of edits smallish? Ie you're not concerned about > combinatoric explosion of step 1? We're only allowing ex

Re: Building FST-like automaton queries

2012-02-28 Thread Michael McCandless
Neat :) It's like a FuzzyQuery w/ a custom (binary?) cost matrix for the insert/delete/transposition changes... Is the number of edits smallish? Ie you're not concerned about combinatoric explosion of step 1? For steps 2 and 3 you shouldn't use FST at all. Instead, for 2) use BasicAutomata.mak

Building FST-like automaton queries

2012-02-28 Thread Alan Woodward
Hello, I'm trying to create a Lucene Query that will take a term and expand it to include common OCR errors (for example, 'cl' is often misread as 'd', so a search for 'clog' should also hit 'dog'). My plan is to do this by generating all the possible variants of a term, using an existing list

Re: [Bulk] RE: RE: Date time as String or Numeric field

2012-02-28 Thread Ganesh
Thanks. I use this field for Rangequery and sort. I think it is best to use Int to gain some heap. Regards Ganesh - Original Message - From: "Uwe Schindler" To: Sent: Tuesday, February 28, 2012 5:08 PM Subject: [Bulk] RE: RE: Date time as String or Numeric field > Hi, > > The long

RE: RE: Date time as String or Numeric field

2012-02-28 Thread Uwe Schindler
Hi, The long or int size mostly only affects the size of e.g. FieldCache during sorting (which doubles its size). The term dictionary's size depends on the number of unique terms and that does not really change by the data type. The size of the values is of minor importance because how the data is

Re: RE: Date time as String or Numeric field

2012-02-28 Thread Ganesh
I tried NumericField with Integer value and Long value. There is no difference in space and heap utilization. Will it be? Are both are same? Regards Ganesh - Original Message - From: "Uwe Schindler" To: Sent: Tuesday, February 28, 2012 3:52 PM Subject: [Bulk] RE: Date time as String

Re: QueryParser strange behavior

2012-02-28 Thread Ian Lea
Then I don't know. Something trivial like white space? What does line.equals("Jesus Christ") say? -- Ian. On Mon, Feb 27, 2012 at 7:42 PM, Damerian wrote: > Στις 27/2/2012 11:45 πμ, ο/η Ian Lea έγραψε: >> >> Does your analyzer look for a field called content, not contents? >> >> >> -- >> Ian

RE: Date time as String or Numeric field

2012-02-28 Thread Uwe Schindler
Hi, NumericField takes more space on disk and (possibly more heap because term dictionary is larger), but is much faster on RANGE searches (NumericRangeQuery). Depending on index size this can be hundreds of times faster. If you don't want to do numeric searches (like range from...to) but only so

Date time as String or Numeric field

2012-02-28 Thread Ganesh
Hello all, I was using DateTime as String and now i am using NumericField. Using NumericField takes more heap and storage space then the earlier String version. Is it good to move to NumericField or stick with String. I am using this field for search and sort. Regards Ganesh

Re: facet vs group search

2012-02-28 Thread Shai Erera
If I understand 'group search' correctly, you mean grouping search results by some criteria? The main difference between grouping search results to faceted search is that when you group search results by some criteria, your request is something like "give me the top 3 results from each movie categ