how to design Lucene Document and Field to indexing and searching email message and attachments

2010-04-27 Thread 刘庆志
hi all: our bussiness system generate some data,that information structrue like email message,one message have some attachments,so we can use email message to think of our data,I need index and search the message and its attachments,and when display hits,must display two kinds of links for every

Re: Right memory for search application

2010-04-27 Thread Samarendra Pratap
I have got a lot of valuable information in this thread so far. Thanks to all. In my last mail I mentioned only two fields because others' usage was negligible and I thought they are not important. But now after *Toke *explained the formulae, I think sorting on those fields would also be consuming

Re: Right memory for search application

2010-04-27 Thread Lance Norskog
Solr's timestamp representation (TrieDateField) is tuned for space and speed. It has a compressed representation, and sorts with far less space than Strings. Also you get something called a date facet, which lets you bucketize facet searches by time block. On Tue, Apr 27, 2010 at 1:02 PM, Toke Es

Re: Range Score in Lucene

2010-04-27 Thread Clara Vania
Thanks again for your help!! :) Regards, -Clara Vania- From: Uwe Schindler To: java-user@lucene.apache.org Sent: Wed, April 28, 2010 12:38:04 AM Subject: RE: Range Score in Lucene This hast o do with combining multiple terms in a Boolean query. If you hav

Re: Trouble compiling JCC

2010-04-27 Thread Herbert Roitblat
I found the error of my ways. It was a typo. For linux2 I directed setup.py to use 'linux2': '/usr/lib/jdk/java-6-sun-1.6.0.17', <--WRONG rather than 'linux2': '/usr/lib/jvm/java-6-sun-1.6.0.17', There was not a /usr/lib/jdk folder on my Ubuntu 8.04 box. - Original Messag

Re: Base score to use for custom query?

2010-04-27 Thread Jeremy Volkman
Hi Hoss, I didn't end up writing my own query (well I did, but all it does is rewrite into another query). I found DisjunctionMaxQuery, which seemed a good fit for what I was trying to do. Instead of TermQuery, I used ConstantScoreQuery combined with TermsFilter to create queries that weren't depe

Trouble compiling JCC

2010-04-27 Thread Herbert Roitblat
I'm trying to compile JCC, using python setup.py build This is what I get: ~/pylucene-2.9.2-1/jcc$ python setup.py build running build running build_py copying jcc/config.py -> build/lib.linux-x86_64-2.5/jcc running build_ext building 'jcc._jcc' extension gcc -pthread -fno-s

RE: Right memory for search application

2010-04-27 Thread Toke Eskildsen
Samarendra Pratap [samarz...@gmail.com] wrote: > 1. Our default option is sort by score, however almost 8% of searches use > sorting on a field (mmddHHMMSS). This field is indexed as string (not as > NumericField or DateField). Guessing that the timestamp is practically unique for each documen

Re: HTMLStripReader, HTMLStripCharFilter

2010-04-27 Thread Justin
Thanks for the explanation. The situation makes much more sense now. Fortunately, I did wrap the result of Analyzer.tokenStream(). I had contemplated adding it to the Analyzer as you described and warned not to. - Original Message From: Uwe Schindler To: java-user@lucene.apache.or

RE: HTMLStripReader, HTMLStripCharFilter

2010-04-27 Thread Uwe Schindler
A Reader can only be read one time, that’s the problem. Resetting a TokenStream is not able to reset the Reader (see java.io.Reader API). To reply the same tokens again, you must wrap with a Caching filter. This is also done in Highlighters code. The general contract of reset() is not to reset

Re: HTMLStripReader, HTMLStripCharFilter

2010-04-27 Thread Herbert Roitblat
Oops. Sorry. replied to wrong message. - Original Message - From: "Herbert Roitblat" To: Sent: Tuesday, April 27, 2010 12:01 PM Subject: Re: HTMLStripReader, HTMLStripCharFilter Great, I will look forward to it. Thanks, Herb - Original Message - From: "Justin" To: Sent:

Re: HTMLStripReader, HTMLStripCharFilter

2010-04-27 Thread Herbert Roitblat
Great, I will look forward to it. Thanks, Herb - Original Message - From: "Justin" To: Sent: Tuesday, April 27, 2010 11:47 AM Subject: Re: HTMLStripReader, HTMLStripCharFilter Thanks for the help. No more exception. Seems odd that I need to add a filter to make reset apply to the

Re: HTMLStripReader, HTMLStripCharFilter

2010-04-27 Thread Justin
Thanks for the help. No more exception. Seems odd that I need to add a filter to make reset apply to the stream's underlying reader. - Original Message From: Uwe Schindler To: java-user@lucene.apache.org Sent: Tue, April 27, 2010 12:00:31 AM Subject: RE: HTMLStripReader, HTMLStripC

Re: Base score to use for custom query?

2010-04-27 Thread Chris Hostetter
First off: if you haven't already make sure you OMIT_NORMS when indexing this field, that way you don't have to worry about docs with "lots" of numbers scoring low purely because of hte fieldNorm. Second: i wouldn't bother with a custom query, i would stick with your BooleanQuery appraoch, but

RE: Range Score in Lucene

2010-04-27 Thread Uwe Schindler
This hast o do with combining multiple terms in a Boolean query. If you have only one term and no boost factors involved, you will get 1. I just repeat, the score numbers are arbitrary scale, only compareable within one query. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.t

Re: Range Score in Lucene

2010-04-27 Thread Clara Vania
Really thanks for the quick reply, I want to find documents similar to one document (let's call it document A) in my index. To do this I use the MoreLikeThis class to help create query from document A. I also included document A in my index, so I assumed that I will have document A at the first

RE: Term offsets for highlighting

2010-04-27 Thread Stephen Greene
Thank you Koji. Everything is now working as desired. You have been an invaluable resource for helping to resolve this issue and I really appreciate the time you spent reviewing this issue. Best regards, Steve -Original Message- From: Koji Sekiguchi [mailto:k...@r.email.ne.jp] Sent: M

Re: Right memory for search application

2010-04-27 Thread Ian Lea
Sorting by score down to the second will use a lot of memory. Can you make it less granular? And I think that switching that field to a NumericField will give you some savings - this has come up before but I can't remember the details. I'm sure someone else will. -- Ian. On Tue, Apr 27, 2010

RE: Right memory for search application

2010-04-27 Thread Fornoville, Tom
Samarendra, In regard to point #2, the GC should indeed handle the clean-up and that might explain the "idle" time with your original configuration during major collections. Have you checked that your machine is correctly identified as a server and has optimized GC settings? More info in the ex

Re: Right memory for search application

2010-04-27 Thread Samarendra Pratap
Hi Ian. Thanks for the points Here are my answers - 1. Our default option is sort by score, however almost 8% of searches use sorting on a field (mmddHHMMSS). This field is indexed as string (not as NumericField or DateField). 2. We are opening readers at the time of starting the application

Re: Right memory for search application

2010-04-27 Thread Ian Lea
There is no simple answer. However your app does sound to be using rather a lot of memory for what you describe as simple searches. Are you using lucene sorting? That can use lots of memory. How are you using/reusing searchers/readers? Having multiple ones open, or failing to close old ones, w

Right memory for search application

2010-04-27 Thread Samarendra Pratap
Hi. I am searching for some guidance on right memory options for my Search Server application. How much memory a lucene based application should be given? Till a few days back I was running my search server on java 1.4 with memory options "-Xmx3600m" which was running quite fine. After upgrading

RE: Range Score in Lucene

2010-04-27 Thread Uwe Schindler
The score is an arbitrary number > 0. It's not normalized to anything, it should only be used to e.g. sort the results. You cannot even compare scores between two searches. They should only be used to compare hits *within* one result set (e.g. sort as done in top docs). - Uwe Schindler H.-H

Re: Range Score in Lucene

2010-04-27 Thread Anshum
Hi Clara, Any particular reason why you'd need the score? Perhaps this would be of help http://lucene.apache.org/java/2_9_1/scoring.html http://lucene.apache.org/java/2_3_2/scoring.pdf Hope this explains whatever you were looking for. -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The

Re: IndexWriter and memory usage

2010-04-27 Thread Michael McCandless
Oooh -- I suspect you are hitting this issue: https://issues.apache.org/jira/browse/LUCENE-2283 Your 3rd image ("fdt") jogged my memory on this one. Can you try testing the trunk JAR from after that issue landed? (Or, apply that patch against 3.0.x -- let me know if it does not apply cleanl

Range Score in Lucene

2010-04-27 Thread Clara Vania
Hi all, I am new to Lucene and I want to ask about range score that Lucene used, because I got score greater than 1. I'm using lucene-3.0.1 and using MoreLikeThis to do document similarity and ScoreDoc class to get hits of my search. Thanks, -Clara Vania-