Re: Combining score from two or more hits

2007-03-23 Thread Antony Bowesman
Chris Hostetter wrote: if you are using a HitCollector, there any re-evaluation is going to happen in your code using whatever mechanism you want -- once your collect method is called on a docid, Lucene is done with that docid and no longer cares about it ... it's only whatever storage you may b

RE: Reverse search

2007-03-23 Thread Melanie Langlois
Well, I though to use the PerFieldAnalyzerWrapper which contains as basic the snowballAnalyzer with English stopwords and use snowballAnalyzer with language specific keywords for the fields which will be in different languages. But I'm seeing that in your MemoryIndexTest you commented the use of

Re: How can I index Phrases in Lucene?

2007-03-23 Thread mark harwood
This may be of interest: http://issues.apache.org/jira/browse/LUCENE-474 Cheers Mark - Original Message From: Ryan McKinley <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, 23 March, 2007 3:25:02 AM Subject: Re: How can I index Phrases in Lucene? Is there any way to fi

Re: MergeFactor and MaxBufferedDocs value should ...?

2007-03-23 Thread Michael McCandless
"SK R" <[EMAIL PROTECTED]> wrote: > If I set MergeFactor = 100 and MaxBufferedDocs=250 , then first 100 > segments will be merged in RAMDir when 100 docs arrived. At the end of > 350th > doc added to writer , RAMDir have 2 merged segment files + 50 seperate > segment files not merged together

Re: Reverse search

2007-03-23 Thread karl wettin
23 mar 2007 kl. 09.57 skrev Melanie Langlois: Well, I though to use the PerFieldAnalyzerWrapper which contains as basic the snowballAnalyzer with English stopwords and use snowballAnalyzer with language specific keywords for the fields which will be in different languages. But I'm seeing t

Re: MergeFactor and MaxBufferedDocs value should ...?

2007-03-23 Thread SK R
Please clarify the following. 1.When will be the segments in RAMDirectory moved (flushed) in to FSDirectory? 2.Segments creation by maxBufferedDocs occur in RAMDir. Where merge by MergeFactor happen? whether in RAMDir or FSDir? Thanks in Advance RSK On 3/23/07, Michael McCandless <[EM

Re: MergeFactor and MaxBufferedDocs value should ...?

2007-03-23 Thread Erick Erickson
I haven't used it yet, but I've seen several references to IndexWriter.ramSizeInBytes() and using it to control when the writer flushes the RAM. This seems like a more deterministic way of making things efficient than trying various combinations of maxBufferedDocs , MergeFactor, etc, all of which

Re: Reverse search

2007-03-23 Thread mark harwood
Bear in mind that the million queries you run on the MemoryIndex can be shortlisted if you place those queries in a RAMIndex and use the source document's terms to "query the queries". The list of unique terms for your document is readily available in the MemoryIndex's TermEnum. You can take thi

Re: MergeFactor and MaxBufferedDocs value should ...?

2007-03-23 Thread Michael McCandless
"SK R" <[EMAIL PROTECTED]> wrote: > 1.When will be the segments in RAMDirectory moved (flushed) in to > FSDirectory? This is maxBufferedDocs. Right now, every added doc creates its own segment in the RAMDir. After maxBufferedDocs, all of these single documents are merged and flushed to a s

Re: MergeFactor and MaxBufferedDocs value should ...?

2007-03-23 Thread Michael McCandless
"Erick Erickson" <[EMAIL PROTECTED]> wrote: > I haven't used it yet, but I've seen several references to > IndexWriter.ramSizeInBytes() and using it to control when the writer > flushes the RAM. This seems like a more deterministic way of > making things efficient than trying various combinations

Lazy Field Loading in IndexSearcher

2007-03-23 Thread jafarim
Hi I am seeking for making use of the latest lazy field loading in lucene 2.1. I store the orignal bytes of a document, say a PDF file for example, in a special untokenized field in the index. Though there is enough facilities in IndexReader class for lazy field loading, the search API in IndexSea

index word files ( doc )

2007-03-23 Thread e.j.w.vanbloem
Hello, I am planning to index Word 2003 files. I read I have to use Jakarta Apache POI, but I also read on the POI site that their work with doc's is in an early stage. Is POI advisable? Or are there better alternatives? Please give some advice. Regards, Erik

Re: index word files ( doc )

2007-03-23 Thread jafarim
Hi My experience is not much satisfactory. It breaks very easily on many files. On 3/23/07, [EMAIL PROTECTED] < [EMAIL PROTECTED]> wrote: Hello, I am planning to index Word 2003 files. I read I have to use Jakarta Apache POI, but I also read on the POI site that their work with doc's is in an

Re: Lazy Field Loading in IndexSearcher

2007-03-23 Thread Chris Hostetter
please read the answer i gave you the last time you asked this question... http://www.nabble.com/Re%3A-Lazy-field-loading-in-p9604064.html : Hi : I am seeking for making use of the latest lazy field loading in lucene 2.1. : I store the orignal bytes of a document, say a PDF file for example, in

Re: Lazy Field Loading in IndexSearcher

2007-03-23 Thread jafarim
Sorry if the question is trivial but why not a Hits.doc(int,FieldSelector) method? On 3/23/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: please read the answer i gave you the last time you asked this question... http://www.nabble.com/Re%3A-Lazy-field-loading-in-p9604064.html : Hi : I am se

Re: Lazy Field Loading in IndexSearcher

2007-03-23 Thread Chris Hostetter
: Sorry if the question is trivial but why not a Hits.doc(int,FieldSelector) : method? As i said before... >> Lazy loading stored fields is really about perfermance tweaking ... if >> yoiu are that concerned baout performance, you shouldn't be using Hits at >> all. ...there is a lot of info in

Search Design Question

2007-03-23 Thread Michael J. Prichard
Hello All, We allow our users to search through our index with a simple textfield. The search phrase has "content" as its default value. This allows them to search quickly through content but then when they type "to:blah AND from:foo AND content:boogie" it will know to parse,etc. What I wa

Re: Search Design Question

2007-03-23 Thread Erick Erickson
I don't believe there's anything built into Lucene that helps you out here because you're really saying "do special things for my problem space in these situations". So about the only thing you can do that I know of is to construct the query yourself by making a series of additions to BooleanQuer

RE: index word files ( doc )

2007-03-23 Thread e.j.w.vanbloem
Thank you, Are there other sollutions? Van: jafarim [mailto:[EMAIL PROTECTED] Verzonden: vr 23-3-2007 18:55 Aan: java-user@lucene.apache.org Onderwerp: Re: index word files ( doc ) Hi My experience is not much satisfactory. It breaks very easily on many files

Re: index word files ( doc )

2007-03-23 Thread Otis Gospodnetic
I think the code from Lucene in Action has examples that us POI and the Textmining.org API. Check manning.com/hatcher2 for the code. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: "[

Re: Search Design Question

2007-03-23 Thread Chris Hostetter
: One final note, it may be much easier for you to throw all the : fields into a single uber-field and search that rather than implement : all four separate clauses, but it's a trade off between simplicity and : size. this would be a very simple way to get the behavior you describe straight f

Re: index word files ( doc )

2007-03-23 Thread Antony Bowesman
www.textmining.org, but the site is no longer accessible. Check Nutch which has a Word parser - it seems to be the original textmining.org Word6+POI parser. Pre-word6 and "fast-saved" files will not work. I've not found a solution for those Antony [EMAIL PROTECTED] wrote: Thank you, Are

Re: index word files ( doc )

2007-03-23 Thread Sami Siren
Antony Bowesman wrote: >> Are there other sollutions? There's also antiword [1] which can convert your .doc to plain text or PS, not sure how good it is. -- Sami Siren [1] http://www.winfield.demon.nl/ - To unsubscribe, e-mai