Timon Roth wrote:
hello list
im figgering about the following problem. in my index i cant find the word BE,
but it exists in two documents. im usinglucene 2.4 with the standardanalyzer.
other querys with words like de, et or de la works good. any ideas?
gruess,
timon
be is a stopword. Do
hello list
im figgering about the following problem. in my index i cant find the word BE,
but it exists in two documents. im usinglucene 2.4 with the standardanalyzer.
other querys with words like de, et or de la works good. any ideas?
gruess,
timon
Likely this is because under the hood when IndexWriter flushes your
deletes, it opens readers. It closes the readers as soon as the
deletes are done, thus creating a fair amount of garbage (which looks
like memory used by the JVM).
How are you measuring the memory usage? Likely it's mostly garba
You can also run vmstat or iostat and watch if the high latency
queries correspond to lots of swap-ins.
Mike
On Wed, Jun 24, 2009 at 3:54 PM, Nigel wrote:
> This is interesting, and counter-intuitive: more queries could actually
> improve overall performance.
>
> The big-index-and-slow-query-rate
On Wed, Jun 24, 2009 at 3:38 PM, Nigel wrote:
> Yes, we're indexing on a separate server, and rsyncing from index snapshots
> there to the search servers. Usually rsync has to copy just a few small
> .cfs files, but every once in a while merging will product a big one. I'm
> going to try to limi
Have you tried out, if GC affects you? A first step would be to turn on GC
logging with -verbosegc -XX:+PrintGCDetails
If you see some relation between query time and gc messages, you should try
to use a better parallelized GC and change the perm size and so on (se docs
about GC tuning).
-
Uw
This is interesting, and counter-intuitive: more queries could actually
improve overall performance.
The big-index-and-slow-query-rate does describe our situation. I'll try
running some tests that run queries at various rates concurrent with
occasional big I/O operations that use the disk cache.
Hi Mike,
Yes, we're indexing on a separate server, and rsyncing from index snapshots
there to the search servers. Usually rsync has to copy just a few small
.cfs files, but every once in a while merging will product a big one. I'm
going to try to limit this by setting maxMergeMB, but of course t
Hi Uwe,
Good points, thank you. The obvious place where GC really has to work hard
is when index changes are rsync'd over and we have to open the new index and
close the old one. Our slow performance times don't seem to be directly
correlated with the index rotation, but maybe it just appears th
Thanks Otis -- I'll give that a try. I think this relates to the first
question in my original message, which was what (if any) of the inverted
index structure is explicitly cached by Lucene in the JVM. Clearly there's
something, since a large JVM heap is required to avoid running out of
memory,
Beside choosing the right analyzer you could run into problems if you
use a query parser as it will interpret you parentesis.
simon
On Wed, Jun 24, 2009 at 8:11 PM, Erick Erickson wrote:
> First, I highly, highly recommend you get a copy of Luke to examineyour
> index. It'll also help you underst
First, I highly, highly recommend you get a copy of Luke to examineyour
index. It'll also help you understand the role of Analyzers.
Your first problem is that StandardAnalyzer probably removes
the open and close parens. See:
http://lucene.apache.org/java/2_4_1/api/index.html
so you can't search o
Hi Ken,
Thanks for your reply. I agree that your overall diagnosis (GC problems
and/or swapping) sounds likely. To follow up on some the specific things
you mentioned:
2. 250M/4 = 60M docs/index. The old rule of thumb was 10M docs/index as a
> reasonable size. You might just need more hardware.
Hi all,
I am using a Standard analyzer on both my search field and my query.
I use a SpanNearQuery to search on the search field.
One of the query terms has special characters like ( - round open bracket
and ) - round close bracket : How does Lucene handle this?
Also, the search field has ( and
I was wondering if anybody else that has been using updateDocument
noticed it uses a large amount of memory when updating an existing document.
For example, when using updateDocument on an empty Lucene directory, the
resulting 12K documents creates a 3MB index, the amount of memory the
program
On Wed, Jun 24, 2009 at 10:23 AM, stefan wrote:
> does Lucene keep the complete index in memory ?
No.
Certain things (deleted docs, norms, field cache, terms index) are
loaded into memory, but these are tiny compared to what's not loaded
into memory (postings, stored docs, term vectors).
> As s
On Wed, Jun 24, 2009 at 10:18 AM, stefan wrote:
>
> Hi,
>
>
>>OK so this means it's not a leak, and instead it's just that stuff is
>>consuming more RAM than expected.
> Or that my test db is smaller than the production db which is indeed the case.
But a "leak" would keep leaking over time, right?
My opinion is swappiness should generally be set to zero, thus turning
off "swap core out in favor of IO cache".
I don't think the OS's simplistic LRU policy is smart enough to know
which RAM (that Lucene had allocated & filled) are OK to move to
disk. EG you see the OS evict stuff because Lucene
Hi Stefan,
Generally, I run the memory monitor and see how much of it is being used. The
total memory usage by application, is what we see on the monitor. So long as it
is less that 1.8GB things are fine.
In your case, you are assuming that the program takes only 50MB, which may be
actually
Hi,
does Lucene keep the complete index in memory ?
As stated before the result index is 50MB, this would correlate with the memory
footprint required by Lucene as seen in my app:
jvm 120MB - 50MB(Lucene) - 50MB(my App) = something left
jvm 100MB - 50MB(Lucene) - 50MB(my App) = OOError
though s
When the segments are merged, but not optimized. It happened at 1.8GB to our
program, and now we develop and test in Win32 but run the code on Linux, which
seems to be handling atleast upto 3GB of index.
In fact, if the index size if beyond 1.8GB even, Luke throws Java Heap Error,
if I try to
Hi,
>OK so this means it's not a leak, and instead it's just that stuff is
>consuming more RAM than expected.
Or that my test db is smaller than the production db which is indeed the case.
>Hmm -- there are quite a few buffered deletes pending. It could be we
>are under-accounting for RAM used
Hi,
I do use Win32.
What do you mean by "the index file before
optimizations crosses your jvm memory usage settings (if say 512MB)" ?
Could you please further explain this ?
Stefan
-Ursprüngliche Nachricht-
Von: Sudarsan, Sithu D. [mailto:sithu.sudar...@fda.hhs.gov]
Gesendet: Mi 24.06
Hi,
there seems to be a little misunderstanding. The index will only be optimized
if the IndexWriter is to be closed and then only with a probability of 2%
(meaning occasionaly).
In other words, I only close the IndexWriter (and thus optimize) to avoid the
OOMError.
When I keep the same Index
Stealing this thread/idea, but changing subject, so we can branch and I don't
look like a thread thief.
I never played with /proc/sys/vm/swappiness, but I wonder if there are points
in the lifetime of an index where this number should be changed. For example,
does it make sense to in/decreas
Hi Stefan,
Are you using Windows 32 bit? If so, sometimes, if the index file before
optimizations crosses your jvm memory usage settings (if say 512MB),
there is a possibility of this happening.
Increase JVM memory settings if that is the case.
Sincerely,
Sithu D Sudarsan
Off: 301-796-2587
On Wed, Jun 24, 2009 at 7:43 AM, stefan wrote:
> I tried with 100MB heap size and got the Error as well, it runs fine with
> 120MB.
OK so this means it's not a leak, and instead it's just that stuff is
consuming more RAM than expected.
> Here is the histogram (application classes marked with --
Hi Stefan,
While not directly th source of your problem, I have a feeling you are
optimizing too frequently (and wasting time/CPU by doing so). Is there a
reason you optimize so often? Try optimizing only at the end, when you know
you won't be adding any more documents to the index for a whi
Hi,
I tried with 100MB heap size and got the Error as well, it runs fine with 120MB.
Here is the histogram (application classes marked with --)
Heap Histogram
All Classes (excluding platform)
Class Instance Count Total Size
class [C234200 30245722
class [B1087565 25
Hmm -- I think your test env (80 MB heap, 50 MB used by app + 16 MB
IndexWriter RAM buffer) is a bit too tight. The 16 MB buffer for IW
is not a hard upper bound on how much RAM it may use. EG when merges
are running, more RAM will be required, if a large doc brought it over
the 16 MB limit it wi
Hi,
I do not set a RAM Buffer size, I assume default is 16MB.
My server runs with 80MB heap size, before starting lucene about 50MB is used.
In a production environment I run in this problem with heap size set to 750MB
with no other activity on the server (nighttime), though since then I diagnos
Is it possible the occasional large merge is clearing out the IO cache
(thus "unwarming" your searcher)? (Though since you're rsync'ing your
updates in, it sounds like a separate machine is building the index).
Or... linux will happily swap out a process's core in favor of IO
cache (though I'd ex
another performance tip, waht helps "a lot" is collection sorting before you
index.
if you can somehow logically partition your index, you can improve locality of
reference by sorting.
What I mean by this:
imagine index with following fields: zip, user_group, some text
if typical query
How large is the RAM buffer that you're giving IndexWriter? How large
a heap size do you give to JVM?
Can you post one of the OOM exceptions you're hitting?
Mike
On Wed, Jun 24, 2009 at 4:08 AM, stefan wrote:
> Hi,
>
> I am using Lucene 2.4.1 to index a database with less than a million records
We've also had the same Problem on 150Mio doc setup (Win 2003, java 1.6). After
monitoring response time distribution over time for couple of weeks, it was
clear that such long running response times were due to bad warming-up. There
were peeks short after index reload (even comprehensive warmi
Hi,
I am using Lucene 2.4.1 to index a database with less than a million records.
The resulting index is about 50MB in size.
I keep getting an OutOfMemory Error if I re-use the same IndexWriter to index
the complete database. This is though
recommended in the performance hints.
What I now do is
> 1. For search time to vary from < 1 second => 20 seconds, the only
> two things I've seen are:
>
> * Serious JVM garbage collection problems.
> * You're in Linux swap hell.
>
> We tracked similar issued down by creating a testbed that let us run
> a set of real-world queries, such that we could
37 matches
Mail list logo