Using Lucene to match document sets to each other

2011-12-15 Thread Josh Stone
I have a use case for which I'm trying to figure out the best way to use Lucene and could use some guidance. I have a set of documents representing products in a catalog (name, description, etc.). I then pull down data from different sources such as Ebay and Amazon and need to determine if the ite

Re: Trying to generate a list of DISTINCT field names from all documents in an index

2011-12-15 Thread todd.hunt
Thank you, Trejkaz. I was just about to post the fact that I /finally/ found that method by looking at the source code for LUKE. There is a night and day difference in performance. -- View this message in context: http://lucene.472066.n3.nabble.com/Trying-to-generate-a-list-of-DISTINCT-field-na

RE: Obtaining IDF values for the terms in a document set

2011-12-15 Thread Burton-West, Tom
Hi Mike, If you just need the IDF you can run HighFreqTerm.java in contrib against either your sample index or your index to get the N terms with the highest DF values (i.e. lowest IDF.) If you have a large index, giving it lots of memory seems to help. Depending on your use case, you may inst

RE: Obtaining IDF values for the terms in a document set

2011-12-15 Thread Mike O'Leary
Hi Simon, I guess in a sense we are interested in obtaining a list of the top N terms, but they would be the top terms in the sense that they have the lowest IDF values. These would be the terms that appear in all or almost all documents in the document set. This is not a count of the number of

Re: Obtaining IDF values for the terms in a document set

2011-12-15 Thread Simon Willnauer
On Thu, Dec 15, 2011 at 6:33 PM, Mike O'Leary wrote: > We have a large set of documents that we would like to index with a > customized stopword list. We have run tests by indexing a random set of about > 10% of the documents, and we'd like to generate a list of the terms in that > smaller set

Obtaining IDF values for the terms in a document set

2011-12-15 Thread Mike O'Leary
We have a large set of documents that we would like to index with a customized stopword list. We have run tests by indexing a random set of about 10% of the documents, and we'd like to generate a list of the terms in that smaller set and their IDF values as a way to create a starter set of stopw

NGramTokenFilter filters out small tokens?

2011-12-15 Thread Rob Hasselbaum
Hi. I'm trying to configure an analyzer to be somewhat forgiving of spelling mistakes in longer words of a search query. So, for example, if a word in the query matches at least five characters of an indexed word (token), I want that to be a hit. NGramTokenFilter with a minimum gram size of 5 seems

Trying to generate a list of DISTINCT field names from all documents in an index

2011-12-15 Thread todd.hunt
Hi, I have come across a problem with our code that is not scaling well and I'm hoping there is a way I can tweak our existing code to run faster. We are indexing on a Java object called "Node". A "Node" can have one or more "Attributes". The "Attributes" consist of a key / value pair and the

Re: Broken link in Lucene 3.5 JavaDoc?

2011-12-15 Thread Shai Erera
I opened LUCENE-3649. Shai On Thu, Dec 15, 2011 at 2:50 PM, Shai Erera wrote: > Sure, as soon as I'll be in front of a computer. > > Shai > On Dec 15, 2011 2:48 PM, "Uwe Schindler" wrote: > >> Yes, I could attach the patch there! Will you open it? >> >> - >> Uwe Schindler >> H.-H.-Meier-Al

RE: Broken link in Lucene 3.5 JavaDoc?

2011-12-15 Thread Shai Erera
Sure, as soon as I'll be in front of a computer. Shai On Dec 15, 2011 2:48 PM, "Uwe Schindler" wrote: > Yes, I could attach the patch there! Will you open it? > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original M

RE: Broken link in Lucene 3.5 JavaDoc?

2011-12-15 Thread Shai Erera
... issue for *it*, not 'other' :) Shai On Dec 15, 2011 2:47 PM, "Shai Erera" wrote: > If you already did it, then a patch will be great. Perhaps we should open > an issue for other? > > Shai > On Dec 15, 2011 11:44 AM, "Uwe Schindler" wrote: > >> Alternatively in overview.html (which fits bett

RE: Broken link in Lucene 3.5 JavaDoc?

2011-12-15 Thread Uwe Schindler
Yes, I could attach the patch there! Will you open it? - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Shai Erera [mailto:ser...@gmail.com] > Sent: Thursday, December 15, 2011 1:47 PM > To: java-user@luce

RE: Broken link in Lucene 3.5 JavaDoc?

2011-12-15 Thread Shai Erera
If you already did it, then a patch will be great. Perhaps we should open an issue for other? Shai On Dec 15, 2011 11:44 AM, "Uwe Schindler" wrote: > Alternatively in overview.html (which fits better). > > There is only one limitation according to docs: The first sentence is > copied over to the

RE: Broken link in Lucene 3.5 JavaDoc?

2011-12-15 Thread Uwe Schindler
Alternatively in overview.html (which fits better). There is only one limitation according to docs: The first sentence is copied over to the package description an if the first sentence is formatted as or whatever, it kills the whole Javascript formatting. So to make it perfect (and it looks r

RE: Broken link in Lucene 3.5 JavaDoc?

2011-12-15 Thread Uwe Schindler
If you remove the useless CSS in the HTML it looks perfect in package.html! - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Shai Erera [mailto:ser...@gmail.com] > Sent: Thursday, December 15, 2011 8:39 A