RE: Using Lucene to match document sets to each other

2011-12-19 Thread Paul Allan Hill
I'm not sure I understand what your field arrangement would be when you say "[T]he items I'm pulling in from the web contain large bodies of text (descriptions) whereas the products in my catalog consist of shorter fields such as product name, manufacturer, product code, etc. So using the smaller

deprecated optimize()!

2012-01-27 Thread Paul Allan Hill
After reading all about the renaming of optimize() and updating my Lucene libraries to 3.4, I was surprised and confused by what I found. I have a 1 segment index (all files are named _1*.*) that had been created with 3.0.1 code which had been optimized many times (all 3.0.1 code). The first

RE: deprecated optimize()!

2012-01-27 Thread Paul Allan Hill
Thanks for the reply, > > The first time my code used the 3.4 libraries with version level set > > to 3.4 and it tried > > to optimize() (still using this now deprecated old call), the new code > went wild! > > It took up more memory than the heap was limited to, so I believe it > > is taking > >

Phrase Queries vs. SpanTermQueries exact phrases vs. stop words

2012-01-31 Thread Paul Allan Hill
In Lucene, 3.4 I recently implemented "Translating PhraseQuery to SpanNearQuery" (see Lucene in Action, page 220) because I wanted _order_ to matter. Here is my exact code called from getFieldsQuery once I know I'm looking at a PhraseQuery, but I think it is exactly from the book. static Q

RE: best query for one-box search string over multiple types & fields

2012-01-31 Thread Paul Allan Hill
> -Original Message- > short of it: i want "queen bohemian rhapsody" to return that song named > "Bohemian Rhapsody" by > the artist named "Queen", rather than songs with titles like "Bohemian > Rhapsody (Queen Cover)". Have you looked in MultiFieldQueryParser and its use of extra boosts

RE: Phrase Queries vs. SpanTermQueries exact phrases vs. stop words

2012-02-01 Thread Paul Allan Hill
Thanks for the discussion, I really appreciate you pointing out that the > Code here ignores PhraseQuery (PQ) 's positions: And by "here" you mean my original code not your suggestion. > To accommodate for this, the overall extra gap can be added to the slope: > int gap = (pp[pp.length] -

RE: Phrase Queries vs. SpanTermQueries exact phrases vs. stop words

2012-02-01 Thread Paul Allan Hill
>Doron wrote: > > int gap = (pp[pp.length] - pp[0]) - (pp.length - 1); int gap = (pp[pp.length-1] - pp[0]) - (pp.length - 1); Don't want to cause an IndexOutOfBoundsException -Paul - To unsubscribe, e-mail: java-user-unsub

recording a universal ID from DocID in a CustomScoreQuery

2012-02-03 Thread Paul Allan Hill
My Index does NOT have a simple UID, it uses the file PATH to the file as the unique key. I was implementing a CustomScoreQuery which not only tweaked the score it also wanted to write down which documents had passed through this part of overall rebuilt query, so that I could further mess with t

RE: recording a universal ID from DocID in a CustomScoreQuery

2012-02-06 Thread Paul Allan Hill
hat > you needed for the > subsequent messing around. > > > -- > Ian. > > > On Sat, Feb 4, 2012 at 12:09 AM, Paul Allan Hill wrote: > > My Index does NOT have a simple UID, it uses the file PATH to the file as > > the unique key. > > I was implemen

Please explain DisjunctionMaxQuery JavaDoc.

2012-02-08 Thread Paul Allan Hill
What the heck does is the JavaDoc for DisjunctionMaxQuery saying: "A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matchi

RE: Please explain DisjunctionMaxQuery JavaDoc.

2012-02-08 Thread Paul Allan Hill
> -Original Message- > From: Paul Allan Hill [mailto:p...@metajure.com] > Sent: Wednesday, February 08, 2012 2:42 PM > To: java-user@lucene.apache.org > Subject: Please explain DisjunctionMaxQuery JavaDoc. > > What the heck does is the JavaDoc for Disju

norm for a document in a CustomScoreQuery

2012-02-10 Thread Paul Allan Hill
I was looking to the possibility that _some_ subqueries might discount (actually remove) field norms. I'm trying out the view that in general while looking for terms norm values seem appropriate, but when searching for phrases that my custom query parsing has added to the query, the document bo

RE: SweetSpotSimilarity

2012-02-15 Thread Paul Allan Hill
I'd love to hear what you find out. I have been working with this also. I only changed the sweet spot to a slightly larger range than the one in the original paper (but kept the same steepness) and I tweaked the sloppy freq to not score multiple occurances of a phrase as strong as the they are i

RE: SweetSpotSimilarity

2012-02-17 Thread Paul Allan Hill
> -Original Message- > From: Chris Hostetter [mailto:hossman_luc...@fucit.org] > As for what hyperbolicTf is trying to do ... it creates a hyperbolic function > letting you specify a hard max > no matter how many terms there are. A picture -- or more precisely a graph -- would be worth a

Upgrade Path Lucene 3.0.2 to 3.4

2011-11-16 Thread Paul Allan Hill
As it says in the title, we are moving from 3.0.2 from to 3.4. I am interested in issues about the need to build a new index or just keep changing the current one. My company has been busy building software and have not upgraded the Lucene and Tika libraries since last year, but I'm trying to

RE: Best document format / markup for text indexing?

2011-11-22 Thread Paul Allan Hill
> What is the best format/markup/ebook standard/document standard/other to use > for easiest and best text search support? The helpful Tika libraries can parse any number of formats and then index the text into Lucene, so I'm thinking the question is what is the better format when you want to d