RE: Index not recreated

2006-08-14 Thread Ronald Wildenberg
> Van: Erick Erickson [mailto:[EMAIL PROTECTED] > Verzonden: maandag 14 augustus 2006 16:52 > Aan: java-user@lucene.apache.org > Onderwerp: Re: Index not recreated > > You have all my sympathy. Let me see if I can restate your > problem. > > "Hey Ron. The indexing process doesn't work. We c

Re: a "fair" similarity

2006-08-14 Thread Michael D. Curtin
Daniel Naber wrote: Hi, as some of you may have noticed, Lucene prefers shorter documents over longer ones, i.e. shorter documents get a higher ranking, even if the ratio "matched terms / total terms in document" is the same. For example, take these two artificial documents: doc1: x 2 3 4 5

Re: 7GB index taking forever to return hits

2006-08-14 Thread Erick Erickson
I actually suspect that your process isn't hung, it's just taking forever because it's swapping a lot. Like a really, really, really lot. Like more than you ever want to deal with . I think you're pretty much forced, as Lin said, to use a filter. I was pleasantly surprised at how quickly filters

Re: 7GB index taking forever to return hits

2006-08-14 Thread yueyu lin
To avoid "TooManyClauses", you can try Filter instead of Query. But that will be slower. Form what I see is that there are so many keys that match your query, it will be tough for Lucene. On 8/14/06, Van Nguyen <[EMAIL PROTECTED]> wrote: It was how I was implementing the search. I am using a b

RE: 7GB index taking forever to return hits

2006-08-14 Thread Van Nguyen
It was how I was implementing the search. I am using a boolean query. Prior to the 7GB index, I was searching over a 150MB index that consist of a very small part of the bigger index. I was able to set my BooleanQuery to BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE) and that worked fine. B

Re: Index not recreated

2006-08-14 Thread Jason Polites
PS... The "intermittent" nature of your problem points to a concurrency issue. Does the production environment have a greater number of users? If so, this likely translates to a greater number of threads acting upon the index. I'd be looking for possible conflicts between different threads acce

Re: Index not recreated

2006-08-14 Thread Jason Polites
My advice would be the "back-to-basics" approach. Create a test case which creates a simple index with a few documents, verify the index is as you expect, then re-create the index and verify again. Run this test case on your production environment (if you are able). This will determine once and

a "fair" similarity

2006-08-14 Thread Daniel Naber
Hi, as some of you may have noticed, Lucene prefers shorter documents over longer ones, i.e. shorter documents get a higher ranking, even if the ratio "matched terms / total terms in document" is the same. For example, take these two artificial documents: doc1: x 2 3 4 5 6 7 8 9 10 doc2: x x 3

Re: Indexing existing email archives

2006-08-14 Thread Daniel Naber
On Montag 14 August 2006 17:50, Suba Suresh wrote: > I have some stored emails in folders in my local > disk and huge list of email archives in another system. Lucene can only index plain text, so if you can convert these mails to text you can index them without any problem. Regards Daniel --

Re: Indexing existing email archives

2006-08-14 Thread Suba Suresh
Hi! Can someone help me? suba suresh Suba Suresh wrote: I was looking at "http://www.tropo.com/techno/java/lucene/imap.html"; and my understanding is it is used to retrieve and index the emails that is on the email server. I have some stored emails in folders in my local disk and hug

Re: stemmed search and exact match on "same" field

2006-08-14 Thread Chris Hostetter
Therre's a lot of information in your email, and a lot of questions that relate to similar topics and address different ways of acomplishing similar but different things ... too much for me to digest all at once, so lemme start by seeing if i can summarize your goal, and then give you my suggestio

Re: 7GB index taking forever to return hits

2006-08-14 Thread yueyu lin
2GB limitation only exists when you want to put them to memory in 32bits box. Our index size is larger than 13 giga bytes, and it works fine. I think it must be something error in your design. You can use Luke to see what happened in your index. On 8/14/06, Van Nguyen <[EMAIL PROTECTED]> wrote:

7GB index taking forever to return hits

2006-08-14 Thread Van Nguyen
Hi,   I have a 7GB index (about 45 fields per document X roughly 5.5 million docs) running on a Windows 2003 32bit machine (dual proc, 2GB memory).  The index is optimized.  Performing a search on this index will just “hang” when performing the search (wild card query with a sort).  At fi

RE: Repeatable field names

2006-08-14 Thread Furash Gary
I have a couple of fields like this (e.g., a given case can have 1:many case numbers and 1:many defendant aliases). So there's no problem with adding the same field n times to a given document? If so, that's perfect and i'll add it to the faq. I was concatenating before and getting false matche

RE: 30 milllion+ docs on a single server

2006-08-14 Thread Dejan Nenov
The important detail here is what you mean by "single server"? A high-end server will work just fine - you want 4GB+ or RAM and the fastest disk/IO you can get; CPU speed is far less important; A nice Linux software RAID and 5+ 15K SCSI disks will get you superb performance, at a reasonable price.

Indexing existing email archives

2006-08-14 Thread Suba Suresh
I was looking at "http://www.tropo.com/techno/java/lucene/imap.html"; and my understanding is it is used to retrieve and index the emails that is on the email server. I have some stored emails in folders in my local disk and huge list of email archives in another system. Is there a way I could

Re: Index not recreated

2006-08-14 Thread Erick Erickson
You have all my sympathy. Let me see if I can restate your problem. "Hey Ron. The indexing process doesn't work. We can't/won't let you look at the process or the results. We can't/won't let you look at the finished product. We can't/won't let you on the machine where it fails. Now fix it" ..

stemmed search and exact match on "same" field

2006-08-14 Thread Robert Watkins
I've been puzzling this one for a while now, and can't figure it out. The idea is to allow stemmed searches and exact matches (tokenized, but unstemmed phrase searches) on the same field. The subject of this email had "same" in quotes, because it's from the search-client perspective that the sa

RE: Index not recreated

2006-08-14 Thread Ronald Wildenberg
Thanks for your response, comments are below. I'm using Lucene 1.9.1. > Van: Erick Erickson [mailto:[EMAIL PROTECTED] > Verzonden: maandag 14 augustus 2006 16:20 > Onderwerp: Re: Index not recreated > > My first suspicion is that you have duplicate documents on > the *input* side, or are some

Re: Index not recreated

2006-08-14 Thread Erick Erickson
My first suspicion is that you have duplicate documents on the *input* side, or are somehow adding documents more than once. I use code similar to yours and it works just fine for me. How big is the index before and after you re-create it? Twice the size and you're appending, not twice then..

Index not recreated

2006-08-14 Thread Ronald Wildenberg
Hi, I'm experiencing the problem that my index does not seem to be recreated, despite using the correct flags. The result is that documents that represent equal database rows occur multiple times in the index. I recreate my entire index each night. My IndexDirectory/IndexWriter construction cod

Re: Handling OR, NOT, AND operators in search query

2006-08-14 Thread Nina Khosravi
Thanks! I did not notice that the code was lower-casing the query string! Regards, Nina > I am refactoring our search code that was written prior to 1.4.3. I am > using Lucene 2.0 now. The search string entered by users was actually > parsed by our custom code to generate the query.

SV: 30 milllion+ docs on a single server

2006-08-14 Thread Marcus Falck
You will run into problems with the sorting if you can't hold the fieldcache for long intervals. I'm working on a system containing 300 million docs. And I ran into sorting issues after only 5 million docs, but then again I can't hold my IndexSearcher open for so long intervals since I'm dealing

Re: Handling OR, NOT, AND operators in search query

2006-08-14 Thread Michael McCandless
I am refactoring our search code that was written prior to 1.4.3. I am using Lucene 2.0 now. The search string entered by users was actually parsed by our custom code to generate the query. This code was getting fairly big and messy and I'm changing the code to use Lucene's query parsers t

Handling OR, NOT, AND operators in search query

2006-08-14 Thread Nina Khosravi
Hello, I am refactoring our search code that was written prior to 1.4.3. I am using Lucene 2.0 now. The search string entered by users was actually parsed by our custom code to generate the query. This code was getting fairly big and messy and I'm changing the code to use Lucene's query par

Re: Special characters

2006-08-14 Thread Adrian Pillinger
Thanks for the replies on my question. In the end I've taken the StandardAnalyser grammar, modified it and generated a new analyser with JavaCC. Seems to be working a treat! Adrian On 11 Aug 2006, at 14:32, Erik Hatcher wrote: On Aug 11, 2006, at 1:23 AM, Martin Braun wrote: Hello Adrian