Re: wheres the word

2009-06-24 Thread Mark Miller
Timon Roth wrote: hello list im figgering about the following problem. in my index i cant find the word BE, but it exists in two documents. im usinglucene 2.4 with the standardanalyzer. other querys with words like de, et or de la works good. any ideas? gruess, timon be is a stopword. Do

wheres the word

2009-06-24 Thread Timon Roth
hello list im figgering about the following problem. in my index i cant find the word BE, but it exists in two documents. im usinglucene 2.4 with the standardanalyzer. other querys with words like de, et or de la works good. any ideas? gruess, timon

Re: updateDocument and high Memory Usage

2009-06-24 Thread Michael McCandless
Likely this is because under the hood when IndexWriter flushes your deletes, it opens readers. It closes the readers as soon as the deletes are done, thus creating a fair amount of garbage (which looks like memory used by the JVM). How are you measuring the memory usage? Likely it's mostly garba

Re: Setting swappiness

2009-06-24 Thread Michael McCandless
You can also run vmstat or iostat and watch if the high latency queries correspond to lots of swap-ins. Mike On Wed, Jun 24, 2009 at 3:54 PM, Nigel wrote: > This is interesting, and counter-intuitive: more queries could actually > improve overall performance. > > The big-index-and-slow-query-rate

Re: Analyzing performance and memory consumption for boolean queries

2009-06-24 Thread Michael McCandless
On Wed, Jun 24, 2009 at 3:38 PM, Nigel wrote: > Yes, we're indexing on a separate server, and rsyncing from index snapshots > there to the search servers.  Usually rsync has to copy just a few small > .cfs files, but every once in a while merging will product a big one.  I'm > going to try to limi

RE: Analyzing performance and memory consumption for boolean queries

2009-06-24 Thread Uwe Schindler
Have you tried out, if GC affects you? A first step would be to turn on GC logging with -verbosegc -XX:+PrintGCDetails If you see some relation between query time and gc messages, you should try to use a better parallelized GC and change the perm size and so on (se docs about GC tuning). - Uw

Re: Setting swappiness

2009-06-24 Thread Nigel
This is interesting, and counter-intuitive: more queries could actually improve overall performance. The big-index-and-slow-query-rate does describe our situation. I'll try running some tests that run queries at various rates concurrent with occasional big I/O operations that use the disk cache.

Re: Analyzing performance and memory consumption for boolean queries

2009-06-24 Thread Nigel
Hi Mike, Yes, we're indexing on a separate server, and rsyncing from index snapshots there to the search servers. Usually rsync has to copy just a few small .cfs files, but every once in a while merging will product a big one. I'm going to try to limit this by setting maxMergeMB, but of course t

Re: Analyzing performance and memory consumption for boolean queries

2009-06-24 Thread Nigel
Hi Uwe, Good points, thank you. The obvious place where GC really has to work hard is when index changes are rsync'd over and we have to open the new index and close the old one. Our slow performance times don't seem to be directly correlated with the index rotation, but maybe it just appears th

Re: Analyzing performance and memory consumption for boolean queries

2009-06-24 Thread Nigel
Thanks Otis -- I'll give that a try. I think this relates to the first question in my original message, which was what (if any) of the inverted index structure is explicitly cached by Lucene in the JVM. Clearly there's something, since a large JVM heap is required to avoid running out of memory,

Re: Searching for a special character

2009-06-24 Thread Simon Willnauer
Beside choosing the right analyzer you could run into problems if you use a query parser as it will interpret you parentesis. simon On Wed, Jun 24, 2009 at 8:11 PM, Erick Erickson wrote: > First, I highly, highly recommend you get a copy of Luke to examineyour > index. It'll also help you underst

Re: Searching for a special character

2009-06-24 Thread Erick Erickson
First, I highly, highly recommend you get a copy of Luke to examineyour index. It'll also help you understand the role of Analyzers. Your first problem is that StandardAnalyzer probably removes the open and close parens. See: http://lucene.apache.org/java/2_4_1/api/index.html so you can't search o

Re: Analyzing performance and memory consumption for boolean queries

2009-06-24 Thread Nigel
Hi Ken, Thanks for your reply. I agree that your overall diagnosis (GC problems and/or swapping) sounds likely. To follow up on some the specific things you mentioned: 2. 250M/4 = 60M docs/index. The old rule of thumb was 10M docs/index as a > reasonable size. You might just need more hardware.

Searching for a special character

2009-06-24 Thread Radha Sreedharan
Hi all, I am using a Standard analyzer on both my search field and my query. I use a SpanNearQuery to search on the search field. One of the query terms has special characters like ( - round open bracket and ) - round close bracket : How does Lucene handle this? Also, the search field has ( and

updateDocument and high Memory Usage

2009-06-24 Thread Kris Leite
I was wondering if anybody else that has been using updateDocument noticed it uses a large amount of memory when updating an existing document. For example, when using updateDocument on an empty Lucene directory, the resulting 12K documents creates a 3MB index, the amount of memory the program

Re: OutOfMemoryError using IndexWriter

2009-06-24 Thread Michael McCandless
On Wed, Jun 24, 2009 at 10:23 AM, stefan wrote: > does Lucene keep the complete index in memory ? No. Certain things (deleted docs, norms, field cache, terms index) are loaded into memory, but these are tiny compared to what's not loaded into memory (postings, stored docs, term vectors). > As s

Re: OutOfMemoryError using IndexWriter

2009-06-24 Thread Michael McCandless
On Wed, Jun 24, 2009 at 10:18 AM, stefan wrote: > > Hi, > > >>OK so this means it's not a leak, and instead it's just that stuff is >>consuming more RAM than expected. > Or that my test db is smaller than the production db which is indeed the case. But a "leak" would keep leaking over time, right?

Re: Setting swappiness

2009-06-24 Thread Michael McCandless
My opinion is swappiness should generally be set to zero, thus turning off "swap core out in favor of IO cache". I don't think the OS's simplistic LRU policy is smart enough to know which RAM (that Lucene had allocated & filled) are OK to move to disk. EG you see the OS evict stuff because Lucene

RE: OutOfMemoryError using IndexWriter

2009-06-24 Thread Sudarsan, Sithu D.
Hi Stefan, Generally, I run the memory monitor and see how much of it is being used. The total memory usage by application, is what we see on the monitor. So long as it is less that 1.8GB things are fine. In your case, you are assuming that the program takes only 50MB, which may be actually

AW: OutOfMemoryError using IndexWriter

2009-06-24 Thread stefan
Hi, does Lucene keep the complete index in memory ? As stated before the result index is 50MB, this would correlate with the memory footprint required by Lucene as seen in my app: jvm 120MB - 50MB(Lucene) - 50MB(my App) = something left jvm 100MB - 50MB(Lucene) - 50MB(my App) = OOError though s

RE: OutOfMemoryError using IndexWriter

2009-06-24 Thread Sudarsan, Sithu D.
When the segments are merged, but not optimized. It happened at 1.8GB to our program, and now we develop and test in Win32 but run the code on Linux, which seems to be handling atleast upto 3GB of index. In fact, if the index size if beyond 1.8GB even, Luke throws Java Heap Error, if I try to

AW: OutOfMemoryError using IndexWriter

2009-06-24 Thread stefan
Hi, >OK so this means it's not a leak, and instead it's just that stuff is >consuming more RAM than expected. Or that my test db is smaller than the production db which is indeed the case. >Hmm -- there are quite a few buffered deletes pending. It could be we >are under-accounting for RAM used

AW: OutOfMemoryError using IndexWriter

2009-06-24 Thread stefan
Hi, I do use Win32. What do you mean by "the index file before optimizations crosses your jvm memory usage settings (if say 512MB)" ? Could you please further explain this ? Stefan -Ursprüngliche Nachricht- Von: Sudarsan, Sithu D. [mailto:sithu.sudar...@fda.hhs.gov] Gesendet: Mi 24.06

AW: OutOfMemoryError using IndexWriter

2009-06-24 Thread stefan
Hi, there seems to be a little misunderstanding. The index will only be optimized if the IndexWriter is to be closed and then only with a probability of 2% (meaning occasionaly). In other words, I only close the IndexWriter (and thus optimize) to avoid the OOMError. When I keep the same Index

Setting swappiness

2009-06-24 Thread Otis Gospodnetic
Stealing this thread/idea, but changing subject, so we can branch and I don't look like a thread thief. I never played with /proc/sys/vm/swappiness, but I wonder if there are points in the lifetime of an index where this number should be changed. For example, does it make sense to in/decreas

RE: OutOfMemoryError using IndexWriter

2009-06-24 Thread Sudarsan, Sithu D.
Hi Stefan, Are you using Windows 32 bit? If so, sometimes, if the index file before optimizations crosses your jvm memory usage settings (if say 512MB), there is a possibility of this happening. Increase JVM memory settings if that is the case. Sincerely, Sithu D Sudarsan Off: 301-796-2587

Re: OutOfMemoryError using IndexWriter

2009-06-24 Thread Michael McCandless
On Wed, Jun 24, 2009 at 7:43 AM, stefan wrote: > I tried with 100MB heap size and got the Error as well, it runs fine with > 120MB. OK so this means it's not a leak, and instead it's just that stuff is consuming more RAM than expected. > Here is the histogram (application classes marked with --

Re: OutOfMemoryError using IndexWriter

2009-06-24 Thread Otis Gospodnetic
Hi Stefan, While not directly th source of your problem, I have a feeling you are optimizing too frequently (and wasting time/CPU by doing so). Is there a reason you optimize so often? Try optimizing only at the end, when you know you won't be adding any more documents to the index for a whi

AW: OutOfMemoryError using IndexWriter

2009-06-24 Thread stefan
Hi, I tried with 100MB heap size and got the Error as well, it runs fine with 120MB. Here is the histogram (application classes marked with --) Heap Histogram All Classes (excluding platform) Class Instance Count Total Size class [C234200 30245722 class [B1087565 25

Re: OutOfMemoryError using IndexWriter

2009-06-24 Thread Michael McCandless
Hmm -- I think your test env (80 MB heap, 50 MB used by app + 16 MB IndexWriter RAM buffer) is a bit too tight. The 16 MB buffer for IW is not a hard upper bound on how much RAM it may use. EG when merges are running, more RAM will be required, if a large doc brought it over the 16 MB limit it wi

AW: OutOfMemoryError using IndexWriter

2009-06-24 Thread stefan
Hi, I do not set a RAM Buffer size, I assume default is 16MB. My server runs with 80MB heap size, before starting lucene about 50MB is used. In a production environment I run in this problem with heap size set to 750MB with no other activity on the server (nighttime), though since then I diagnos

Re: Analyzing performance and memory consumption for boolean queries

2009-06-24 Thread Michael McCandless
Is it possible the occasional large merge is clearing out the IO cache (thus "unwarming" your searcher)? (Though since you're rsync'ing your updates in, it sounds like a separate machine is building the index). Or... linux will happily swap out a process's core in favor of IO cache (though I'd ex

Re: Analyzing performance and memory consumption for boolean queries

2009-06-24 Thread eks dev
another performance tip, waht helps "a lot" is collection sorting before you index. if you can somehow logically partition your index, you can improve locality of reference by sorting. What I mean by this: imagine index with following fields: zip, user_group, some text if typical query

Re: OutOfMemoryError using IndexWriter

2009-06-24 Thread Michael McCandless
How large is the RAM buffer that you're giving IndexWriter? How large a heap size do you give to JVM? Can you post one of the OOM exceptions you're hitting? Mike On Wed, Jun 24, 2009 at 4:08 AM, stefan wrote: > Hi, > > I am using Lucene 2.4.1 to index a database with less than a million records

Re: Analyzing performance and memory consumption for boolean queries

2009-06-24 Thread eks dev
We've also had the same Problem on 150Mio doc setup (Win 2003, java 1.6). After monitoring response time distribution over time for couple of weeks, it was clear that such long running response times were due to bad warming-up. There were peeks short after index reload (even comprehensive warmi

OutOfMemoryError using IndexWriter

2009-06-24 Thread stefan
Hi, I am using Lucene 2.4.1 to index a database with less than a million records. The resulting index is about 50MB in size. I keep getting an OutOfMemory Error if I re-use the same IndexWriter to index the complete database. This is though recommended in the performance hints. What I now do is

RE: Analyzing performance and memory consumption for boolean queries

2009-06-24 Thread Uwe Schindler
> 1. For search time to vary from < 1 second => 20 seconds, the only > two things I've seen are: > > * Serious JVM garbage collection problems. > * You're in Linux swap hell. > > We tracked similar issued down by creating a testbed that let us run > a set of real-world queries, such that we could