Custom Similarity

2011-10-08 Thread Joel Halbert
Hi, Does anyone have a modified scoring (Similarity) function they would care to share? I'm searching web page documents and find the default Similarity seems to assign too much weight to documents with frequent occurrence of a single term from the query and not enough weight to documents that co

Re: Query to always prefer adjacent terms.

2011-09-13 Thread Joel Halbert
spanquery/ for > good info. > > http://lucene.apache.org/java/3_3_0/queryparsersyntax.html tells you > how to use boosting if you are using the query parser. > > > -- > Ian. > > On Tue, Sep 13, 2011 at 2:26 PM, Joel Halbert wrote: > > Hi Folks, > > >

Query to always prefer adjacent terms.

2011-09-13 Thread Joel Halbert
Hi Folks, What is the simplest method of constructing a multi term query such that the highest scoring document(s) is always that which contain all terms in the query adjacent to each other? i.e. if I search for "federal reserve" I would prefer documents that contain "Ben Bernake is the chairman

RE: FastVectorHighlighter.getBestFragments returning null

2011-05-27 Thread Joel Halbert
Joel On Fri, 2011-05-27 at 13:56 +0200, Pierre GOSSE wrote: > Hi, > > Maybe is it related to : > https://issues.apache.org/jira/browse/LUCENE-3087 > > Pierre > > -Message d'origine- > De : Joel Halbert [mailto:j...@su3analytics.com] > Envoyé : vendr

FastVectorHighlighter.getBestFragments returning null

2011-05-27 Thread Joel Halbert
Hi, I'm using Lucene 3.0.3. I'm extracting snippets using FastVectorHighlighter, for some snippets (I think always when searching for exact matches, quoted) the fragment is null. Code looks like: query = QueryParser.escape(query); if (exact) {

Re: FastVectorHighlighter and field compression

2011-03-08 Thread Joel Halbert
Thanks Koji, I didn't think it was possible as it stands. On Mon, 2011-03-07 at 21:38 +0900, Koji Sekiguchi wrote: > (11/03/07 1:16), Joel Halbert wrote: > > Hi, > > > > I'm using FastVectorHighlighter for highlighting, 3.0.3. > > > > At the moment t

FastVectorHighlighter and field compression

2011-03-06 Thread Joel Halbert
Hi, I'm using FastVectorHighlighter for highlighting, 3.0.3. At the moment this is highlighting a field which is stored, but not compressed. It all works perfectly. I'd like to compress the field that is being highlighted, but it seems like the new way to compress a stored field is to apply it a

Re: scoring adjacent terms without proximity search

2009-11-02 Thread Joel Halbert
age----- From: Joel Halbert Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: Re: scoring adjacent terms without proximity search Date: Sat, 31 Oct 2009 08:38:29 + Thank you all for your suggestions, I shall have a little think about the best way forward, and report

Re: scoring adjacent terms without proximity search

2009-10-31 Thread Joel Halbert
Thank you all for your suggestions, I shall have a little think about the best way forward, and report back if I do anything interesting that works well. In answer to Grant's question, why not use PhraseQuery, we do not want to have an artificial upper limit on the slop, i.e. we do want to includ

scoring adjacent terms without proximity search

2009-10-30 Thread Joel Halbert
Hi, Without using a proximity search i.e. "cheese sandwich"~5 What's the best way of up-scoring results in which the search terms are closer to each other? E.g. so if I search for: content:cheese content:sandwich How do you ensure that a document with content: "Toasted Cheese Sandwich" scores

Re: similarity function

2009-10-28 Thread Joel Halbert
I suppose this could be summarised as: "how do i set the score of each document result to be the score of that of the field that best matches the search terms"? -Original Message----- From: Joel Halbert Reply-To: java-user@lucene.apache.org To: Lucene Users Subject: similarit

similarity function

2009-10-28 Thread Joel Halbert
Hi, Given a query with multiple terms, e.g. fish oil, and searching across multiple fields e.g. query= fieldA:fish fieldA:oil fieldB:fish fieldB:oil etc... I don't want to give any more weight to documents that match the same word multiple times (either in the same, or different fields). I am

Re: Where to download lucene-analyzers and lucene-highlighter?

2009-09-26 Thread Joel Halbert
Hi Peng - they are both within the contrib dir in your lucene package dowload e.g lucene-2.4.0/contrib/highlighter/*.jar lucene-2.4.0/contrib/analyzers/*.jar - Original Message - From: "Peng Yu" To: java-user@lucene.apache.org Sent: Saturday, 26 September, 2009 12:11:02 GMT +00:00 GMT B

Re: metrics for index ~100M docs

2009-09-24 Thread Joel Halbert
e-java/PoweredBy>Best Erick On Thu, Sep 24, 2009 at 11:17 AM, Joel Halbert wrote: > Hi, > > Does anyone know of any recent metrics & stats on building out an index > of ~100mm documents (each doc approx 5k). I'm looking for approx stats > on time to build, time to

metrics for index ~100M docs

2009-09-24 Thread Joel Halbert
Hi, Does anyone know of any recent metrics & stats on building out an index of ~100mm documents (each doc approx 5k). I'm looking for approx stats on time to build, time to query and infrastructure requirements (number of machines & spec) to reasonably support an index of such a size. Thanks, J

Displaying search result data - stored fields vs external source

2009-09-15 Thread Joel Halbert
Hi, When using Lucene I always consider two approaches to displaying search result data to users: 1. Store any fields that we index and display to users in the Lucene Documents themselves. When we perform a search simply retrieve the data to be displayed from the Lucence documents themselves. or

Re: Synchronizing Lucene indexes across 2 application servers

2009-06-19 Thread Joel Halbert
me both the servers have uptodate indexes. I was thinking what > could be the best architecture/design strategy to do so given the fact that > any of the 2 application servers could be serving search request depending > upon its availability. > > Any inputs please? > > Thanks for

Re: Lucene performance: is search time linear to the index size?

2009-06-19 Thread Joel Halbert
- > >> > >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > -

Re: London Open Source Search meetup - Mon 15th June

2009-06-15 Thread Joel Halbert
Hi Rich - from what time? -Original Message- From: Richard Marr Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: Re: London Open Source Search meetup - Mon 15th June Date: Fri, 12 Jun 2009 12:54:30 +0100 Hi all, Just a quick reminder that this is happening

Re: relevance function for scores

2009-05-27 Thread Joel Halbert
ave a good idea to get the distributions less than some reasonable time? On 2009. 05. 26, at 오후 8:15, Joel Halbert wrote: > Yes, something like this might work, although rather than having a > cutoff determined by the difference between two successive document > scores (Doc(n) and D

Re: relevance function for scores

2009-05-26 Thread Joel Halbert
e thing to check is that the scores are indeed sorted in descending > order to begin with. For example, I don't think the hits in > TopDocCollector and its brethren are strictly ordered this way (no?). > > -Babak > > On Mon, May 18, 2009 at 6:52 AM, Joel Halbert wrote: &

Re: Does Lucene fail fast on boolean queries?

2009-05-21 Thread Joel Halbert
TrieRangeQuery - thanks for the tip. -Original Message- From: Michael McCandless Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: Re: Does Lucene fail fast on boolean queries? Date: Thu, 21 May 2009 11:39:23 -0400 On Thu, May 21, 2009 at 10:58 AM, Joel

Re: Parsing large xml files

2009-05-21 Thread Joel Halbert
try http://piccolo.sourceforge.net/ is small and fast. -Original Message- From: Michael Barbarelli Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: Re: Parsing large xml files Date: Thu, 21 May 2009 15:52:00 +0100 Why not use an XML pull parser? I recommen

Re: Does Lucene fail fast on boolean queries?

2009-05-21 Thread Joel Halbert
uot;doc=5" can be asked for by Lucene. Also note that this is an internal implementation detail -- Lucene could easily change to do batch processing of AND'd queries in which case docs 5,10 could easily be iterated on. So I wouldn't "rely" on this in your app. Mike On Thu, M

Re: Does Lucene fail fast on boolean queries?

2009-05-21 Thread Joel Halbert
: java-user@lucene.apache.org To: java-user@lucene.apache.org Subject: Re: Does Lucene fail fast on boolean queries? Date: Thu, 21 May 2009 10:29:57 -0400 Yes. As soon as Lucene sees that the Name docID iteration has ended, the search will break. Mike On Thu, May 21, 2009 at 8:44 AM, Joel Halbert

Re: hit highlighting in lucene ?

2009-05-21 Thread Joel Halbert
dles non-eng and eng in equally good ways? Or any other ideas on the same ? Thanks, KK. On Thu, May 21, 2009 at 6:18 PM, Joel Halbert wrote: > The highlighter should be language independent. So long as you are > consistent with your use of Analyzer between > indexing/query/highlighting

Re: hit highlighting in lucene ?

2009-05-21 Thread Joel Halbert
The highlighter should be language independent. So long as you are consistent with your use of Analyzer between indexing/query/highlighting. As for the most appropriate Analyzer to use for your local language, this is a seperate question - especially if you are using stop word and stemming filters

Does Lucene fail fast on boolean queries?

2009-05-21 Thread Joel Halbert
Hi, When Lucene performs a Boolean query, say: Field Name = Male AND Field Age = 30 assuming the resultant docs for each portion of the query were: Matching docs for: Name = 1,2 Matching docs for: Age = 1,2,5,10 Will Lucene stop searching for documents matching the Age term once it has found

RangeQuery & TooManyClausesException : Lucene 2.4

2009-05-20 Thread Joel Halbert
Hi, Looking at the docs for the 2.4 codebase, for RangeQuery http://lucene.apache.org/java/2_4_0/api/index.html?org/apache/lucene/search/RangeQuery.html there is a comment that a TooManyClauses exception is no longer thrown. Does this mean that it is now safe to use RangeQuery without worrying a

Re: Too many results with RegexQuery

2009-05-18 Thread Joel Halbert
"but in some cases the search returns too many results" do you *really* mean you get "too many results"? or do you actually mean you get a "too many terms" exception due to the query expansion? -Original Message- From: Huntsman84 Reply-To: java-user@lucene.apache.org To: java-user@lucen

Re: relevance function for scores

2009-05-18 Thread Joel Halbert
ce function for scores Date: Mon, 18 May 2009 09:50:10 -0400 In that case, I'll have to defer to folks who actually know somethingabout that part of the code . Erick On Mon, May 18, 2009 at 9:25 AM, Joel Halbert wrote: > Hi Erick, > > Thanks for the pointer. Sorry if the q

Re: relevance function for scores

2009-05-18 Thread Joel Halbert
ou can examine the scores and put them in buckets any way you want, all you're doing is spinning through a small data structure performing some calculations. HTH Erick On Mon, May 18, 2009 at 8:52 AM, Joel Halbert wrote: > Hi, > > I'd like to apply a score filter. I realise

relevance function for scores

2009-05-18 Thread Joel Halbert
Hi, I'd like to apply a score filter. I realise that filtering by absolute (i.e. anything less than x) scores is pretty meaningless. In my case I want to filter based on relative score - or on some function of score which looks for clustering of documents around certain score values. Context: I

Re: analysis filter wrapper

2009-05-14 Thread Joel Halbert
You can use your Analyzer to get a token stream from any text you give it, just like Lucene does. Something like: String text = "your list of words to analyze and tokenize"; TokenStream ts = YOUR_ANALYZER.tokenStream(null, new StringReader(text)); Token token = new Token(); while((ts.next(tok

RE: Upper limit on document field value length ?

2009-05-13 Thread Joel Halbert
/IndexWriter. MaxFieldLength.html And the corresponding IndexWriter ctors. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Joel Halbert [mailto:j...@su3analytics.com] > Sent: Wednesday, May 13, 200

Upper limit on document field value length ?

2009-05-13 Thread Joel Halbert
Is there a limit to the size of a field which Lucene will index? i.e. for very large field values are only the first n tokens or n characters indexed? If so is there a way of upping/removing this limit? Rgs, Joel - To unsubs

Filters - at what stage are they applied?

2009-02-19 Thread Joel Halbert
Hi, By way of clarification, when a filter is used with a search query, is the filter applied only to documents that matched the search query or is it applied to all documents in the index before the query is executed? Rgs, Joel

Re: what's the best practice for getting "next page" of hits?

2009-02-19 Thread Joel Halbert
Out of interest, if the index is entirely in memory (using a RAMDir) is there any significant different in performance between options (a) and (b) as outlined below? Rgs, Joel -Original Message- From: Ganesh Reply-To: java-user@lucene.apache.org To: java-user@lucene.apache.org, rolaren..

Pattern for maintaining FSDirectory copy of RAMDirectory

2009-02-16 Thread Joel Halbert
Hi, I have a RAMDirectory based index. The document source for the index is a database table, where content to be indexed is stored alongside a status (pending_index, indexed, pending_delete, deleted). Each time the application is started, and periodically thereafter, all documents from the databa

Re: search(Query query, HitCollector results)

2009-02-15 Thread Joel Halbert
Presumably there is no score ordering to the hit id's lucene delivers to a HitCollector? i.e. they are delivered in the order they are found and score is neither ascending or descending i.e. the next score could be higher or lower that the previous one? -Original Message- From: Mark Miller

Term precendence

2009-02-15 Thread Joel Halbert
When constructing a query, using a series of terms e.g. Term1=X, Term2=Y etc... does it make sense, like in sql, to place to most restrictive term query first? i.e. if I know that the query will be mainly constrained by the value of Term1, does having this as the first in the query make the exec

Upper limit on number of Fields

2009-02-15 Thread Joel Halbert
Hi, Is there any practical limit on the number of fields that can be maintained on an index? My index looks something like this, 1 million documents. For each group of 1000 documents I might have 10 indexed fields. This would mean in total about 1 fields. Am I going to run into any issues her

Optimal Solution for Unique Field Values

2009-02-15 Thread Joel Halbert
Hi, I'm looking for an optimal solution for extracting unique field values. The rub is that I want to be able to perform this for a unique subset of documents...as per the example: I have an index with Field1 and Field2. I want "all unique values of Field1 where Field2=X". Other than actually p