Re: Help: tweaking search - reducing IDF skew and implementing score cutoff

2006-02-09 Thread Chris Hostetter
: Sunday gets ranked highly due to idf. How do I reduce this skewness : due to the date-posted field? I saw a reference earlier to : ConstantScoreRangeQuery on JIRA - is it the solution? Yes. RangeQuery expands to a BooleanQuery containing all of the terms in the. The number of terms (and the fr

Help: tweaking search - reducing IDF skew and implementing score cutoff

2006-02-09 Thread Chun Wei Ho
Hi, I am running a search for something akin to a news site, when each news document has a date, title, keywords/bylines, summary fields and then the actual content. Using Lucene for this database of documents, it seems that: 1. The relevancy score is skewed drastically by the actual number of ne

Re: Custom filters and booleanquery (MUST_NOT)

2006-02-09 Thread Amrit Jassal
Chris Thanks. Appreciate your comment about using ConstantScoreQuery as well. Amrit On 2/9/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > > : I am experimenting with using a custom filter with QueryParser and ran > into > : some unanticipated issues with using NOT terms. I narrowed down the

retrieving a documentNumber from a document

2006-02-09 Thread Paulo Silveira
hello IndexReader.delete receives a docNum How do I know a docNum given a document? I will always need to get this number (sometimes called id in the javadocs) from the Hits.id? thanks -- Paulo Silveira http://www.paulo.com.br/ --

Re: Duplicates recods in index

2006-02-09 Thread Daniel Noll
Pasha Bizhan wrote: Hi, From: Daniel Noll [mailto:[EMAIL PROTECTED] I don't know how this will be for efficiency. If you did it that way, you would have to re-open the index for every single document you add, otherwise you might miss a duplicate which was added recently. You do not need

Re: Queries not derived from the text index

2006-02-09 Thread Daniel Noll
Chris Hostetter wrote: : I think that overriding getFieldQuery would work, yeah... you're right. :It's just a matter of comparing efficienty of this: : : BooleanQuery of (TermQuery, FilteredQuery of (AllDocsQuery, Filter)) : : to the efficiency of this: : : FilteredQuery of (TermQue

lucene & ejbs

2006-02-09 Thread zzzzz shalev
i am currently implementing lucene using multiple rmi servers as index searchers, has anyone done this using ejbs? (any tips?) if so, are there any performance hits? thanks in advance, - Relax. Yahoo! Mail virus scanning helps det

Re: Build vs. Buy?

2006-02-09 Thread P. Alex. Salamanca R.
On the other hand, if you want be the most cheapest, why don't give a chance to google search appliance?

Re: Custom filters and booleanquery (MUST_NOT)

2006-02-09 Thread Chris Hostetter
: I am experimenting with using a custom filter with QueryParser and ran into : some unanticipated issues with using NOT terms. I narrowed down the issue ... : bquery = new BooleanQuery(); : bquery.add(new BooleanClause(fq, BooleanClause.Occur.MUST_NOT)); :

Custom filters and booleanquery (MUST_NOT)

2006-02-09 Thread Amrit Jassal
I am experimenting with using a custom filter with QueryParser and ran into some unanticipated issues with using NOT terms. I narrowed down the issue into the following test case. I am expecting a MUST_NOT booleanclause within a booleanquery to return a resultset that is the complement of a MUST cl

RE: How to get mapping of query terms to number of their occurrences in a doc?

2006-02-09 Thread Dmitry Goldenberg
It seems, from the javadoc, that the 10K default is enforced to avoid a possible OutOfMemoryError. I wonder how safe/unsafe it is to set the value to maximum possible, if we don't impose any limit on customers' document sizes. Perhaps, the best solution is to expose the value as configurable b

Re: Too many required clauses for a BooleanQuery

2006-02-09 Thread Kevin Dutcher
Thanks Hoss... You're absolutely right! Kevin On 2/9/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > > : I need all the documents returned from the search and am manipulating > the > : results with a custom HitCollector, therefore I can't use filters. > > I don't understand this comment. The

Re: Too many required clauses for a BooleanQuery

2006-02-09 Thread Chris Hostetter
: I need all the documents returned from the search and am manipulating the : results with a custom HitCollector, therefore I can't use filters. I don't understand this comment. There are certianly methods in the Searchble interface that allow you to use both a Filter and a HitCollector together

Re: Too many required clauses for a BooleanQuery

2006-02-09 Thread Kevin Dutcher
> > One more thing: in case these queries are generated, you might > consider building the corresponding (nested) BooleanQuery yourself > instead of using the QueryParser. > > Regards, > Paul Elschot I'll give that a try. Thanks Paul.

Re: Too many required clauses for a BooleanQuery

2006-02-09 Thread Paul Elschot
On Thursday 09 February 2006 00:52, Kevin Dutcher wrote: > Hey Everyone, > > I'm running into the "More than 32 required/prohibited clauses in query" > exception when running a query. I thought I understood the problem but the > following two scenarios confuse me. > > 1st - No Error > 33 required

Re: Too many required clauses for a BooleanQuery

2006-02-09 Thread Paul Elschot
On Thursday 09 February 2006 15:25, Kevin Dutcher wrote: > > I don't know a lot about the error your encountering (or not encountering > > as the case may be) but please for hte love of all that is sane use a > > Filter instead of putting all those categories in your Query. > > > > Your search perf

Re: How to get mapping of query terms to number of their occurrences in a doc?

2006-02-09 Thread Otis Gospodnetic
There is a HighFreqTerms class in contib/misc. that may be interesting to you. I just modified it slightly locally last night to limit things to a specific field, and will commit it later. Otis - Original Message From: Dmitry Goldenberg <[EMAIL PROTECTED]> To: java-user@lucene.apa

Re: 1.9 lucene version

2006-02-09 Thread Otis Gospodnetic
Hi, Antworten: 1) No date set yet 2) I've been happily using 1.9 in production - see http://www.simpy.com/ 3) Yes, there have been some memory improvements - see CHANGES.txt file in Subversion Otis Hello all, I have a couple of questions for the community about the 1.9 Lucene version. As I

Re: Queries not derived from the text index

2006-02-09 Thread Otis Gospodnetic
Daniel, If you end up trying all 3 options here, please report your findings (speed/memory). I'm about to rework some of the Lucene stuff behind Simpy.com, and am looking at Filters used this way (+ sort by date or some int) more and more. Thanks, Otis - Original Message From: Chris

Re: opening index readers and writers to keep indexes updated

2006-02-09 Thread Otis Gospodnetic
Definitely batch your adds/updates/deletes, and reuse the IndexReader as you described instead of opening a new one for every search. I _believe_ you can keep the same IndexWriter for adds, as long as you don't overlap it with an IndexReader that does deletes. If you have Lucene in Action, che

Re: Word files

2006-02-09 Thread Otis Gospodnetic
I'm not sure if it will work better than what you've got, but you can try the code from section 7.5 in Lucene in Action: http://www.lucenebook.com/search?query=word+document+microsoft The code is free, even if you don't have the book. Otis - Original Message From: [EMAIL PROTECTED] To

Boosting

2006-02-09 Thread Sebastian Menge
Hi I dont know much about lucene's scoring, but my intuition tells me that a boost of "2" tells lucene to regard that field/document as "double-important", while a boost of "0.5" tells lucene to regard the field/document as "half-important". Thus the boost is exponential, is that right !? If not,

anyone interested in taking over textmining.org?

2006-02-09 Thread sackley
The TextMining.org website keeps getting hacked and I don't have the time to upgrade postnuke to a more secure version. Also, because of legal reasons I can't maintain the software. I am more than willing to "hand-off" the project to lucene or someone else. It's an apache 2 license so anyone ca

RE: Word files & Build vs. Buy?

2006-02-09 Thread Dmitry Goldenberg
Chris, Awesome stuff. A few questions: is your Excel extractor somehow better than POI's? and, what do you see as the timeframe for adding WordPerfect support? Are you considering supporting any other sources such as MS Project, Framemaker, etc? Thanx, - Dmitry _

opening index readers and writers to keep indexes updated

2006-02-09 Thread Paulo Silveira
Hello everybody. I have a big index that will be stored in the FS. I have lots of updates, insertions and deletions in the index, and I would like to minimize the number of "phatom reads". I ve seen in the wiki this link: http://wiki.apache.org/jakarta-lucene/UpdatingAnIndex So, what about my i

Re: Too many required clauses for a BooleanQuery

2006-02-09 Thread Kevin Dutcher
> I don't know a lot about the error your encountering (or not encountering > as the case may be) but please for hte love of all that is sane use a > Filter instead of putting all those categories in your Query. > > Your search performance and your scores will thank you. I need all the documents

Re: Word files & Build vs. Buy?

2006-02-09 Thread Christiaan Fluit
Nick Burch wrote: You could try using org.apache.poi.hwpf.HWPFDocument, and getting the range, then the paragraphs, and grab the text from each paragraph. If there's interest, I could probably commit an extractor that does this to poi. Yes, that's exactly what I'm doing. Having this in POI wo

Re: Word files & Build vs. Buy?

2006-02-09 Thread Nick Burch
On Thu, 9 Feb 2006, Christiaan Fluit wrote: My experience is that the WordDocument class crashes on about 25% of the documents, i.e. it throws some sort of Exception. I've tested POI 2.5.1-final as well as the current code in CVS, but both produce this result. I even suspect the output to be 10

Re: Word files & Build vs. Buy?

2006-02-09 Thread Christiaan Fluit
Hello all, I'm replying to two threads at once as what I have to say relates to both. My company recently started an open source project called Aperture (http://sourceforge.net/projects/aperture), together with the German DFKI institute. The project is still very much in alpha stage, but I do

Word files

2006-02-09 Thread arnaudbuffet
Hello, I use the Poi Api to parse MSword files in order to index the content to enable lucene search. For that I download the last jars from Poi (including the scratchdpad one) and use the parser from lucenebook called POIWordDocHandler. It works quiet good, but for some files the parser does

RE: Build vs. Buy?

2006-02-09 Thread Gwyn Carwardine
Have you considered running the .net version (dotLucene)? The converters for Office and PDF are freely available and there is a cheap commercial IFilter available for wordperfect files (and many others). -Gwyn -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: 09

RE: Duplicates recods in index

2006-02-09 Thread Pasha Bizhan
Hi, > From: Daniel Noll [mailto:[EMAIL PROTECTED] > I don't know how this will be for efficiency. If you did it > that way, you would have to re-open the index for every > single document you add, otherwise you might miss a duplicate > which was added recently. You do not need to reopen in

Re: How to get mapping of query terms to number of their occurrences in a doc?

2006-02-09 Thread Erik Hatcher
This is a real gotcha with Lucene in it's out of the box configuration. In the several applications I've built to index documents I've always hit this and had to set the maxFieldLength to its maximum possible value. Is there still an argument to be made to keep the default at 10K or would

Re: Too many required clauses for a BooleanQuery

2006-02-09 Thread mark harwood
>for hte love of all > that is sane use a > Filter instead of putting all those categories in > your Query. Try this one: package org.apache.lucene.search; import java.io.IOException; import java.util.ArrayList; import java.util.BitSet; import java.util.Iterator; import org.apache.lucene.ind