100MB of text for a single lucene document, into a single analyzed field. The
analyzer is basically the StandardAnalyzer, with minor changes:
1. UAX29URLEmailTokenizer instead of the StandardTokenizer. This doesn't
split URLs and email addresses (so we can do it ourselves in the next step).
2. Spli
I've had success limiting the number of documents by size, and doing them 1
at a time works OK with 2G heap. I'm also hoping to understand why memory
usage would be so high to begin with, or maybe this is expected?
I agree that indexing 100+M of text is a bit silly, but the use case is a
legal con
On Wed, Nov 26, 2014 at 2:09 PM, Erick Erickson wrote:
> Well
> 2> seriously consider the utility of indexing a 100+M file. Assuming
> it's mostly text, lots and lots and lots of queries will match it, and
> it'll score pretty low due to length normalization. And you probably
> can't return it to
Is that 100MB for a single Lucene document? And is that 100MB for a single
field? Is that field analyzed text? How complex is the analyzer? Like, does
it do ngrams or something else that is token or memory intensive? Posting
the analyzer might help us see what the issue might be.
Try indexing
Well
1> don't send 20 docs at once. Or send docs over some size N by themselves.
2> seriously consider the utility of indexing a 100+M file. Assuming
it's mostly text, lots and lots and lots of queries will match it, and
it'll score pretty low due to length normalization. And you probably
can't re
o disable norms, do we need to rebuild our index entirely?
> By the way, We have 8 million documents and our jvm heap is 5G.
>
>
> Thanks & Best Regards!
>
>
>
>
>
>
> -- Original --
> From: "Michael McCandless"
We have 8 million documents and our jvm heap is 5G.
>
>
> Thanks & Best Regards!
>
>
>
>
>
>
> -- Original --
> From: "Michael McCandless";;
> Date: Sat, Sep 13, 2014 06:29 PM
> To: "Lucene Users";
&
e norms, do we need to rebuild our index entirely? By
the way, We have 8 million documents and our jvm heap is 5G.
Thanks & Best Regards!
-- Original --
From: "Michael McCandless";;
Date: Sat, Sep 13, 2014 06:29 PM
To: "Lucene Us
documents and our jvm heap is 5G.
Thanks & Best Regards!
-- Original --
From: "Michael McCandless";;
Date: Sat, Sep 13, 2014 06:29 PM
To: "Lucene Users";
Subject: Re: OutOfMemoryError throwed by SimpleMergedSegmentWarmer
The w
The warmer just tries to load norms/docValues/etc. for all fields that
have them enabled ... so this is likely telling you an IndexReader
would also hit OOME.
You either need to reduce the number of fields you have indexed, or at
least disable norms (takes 1 byte per doc per indexed field regardle
ok, found it:
we are using Cloudera CDHu3u, they change the ulimit for child jobs.
but I still don't know how to change their default settings yet
On Wed, Jun 13, 2012 at 2:15 PM, Yang wrote:
> I got the OutOfMemoryError when I tried to open an Lucene index.
>
> it's very weird since this is o
,
Tamara
- Original Message -
> From: "Otis Gospodnetic"
> To: java-user@lucene.apache.org
> Sent: Tuesday, October 18, 2011 11:14:12 PM
> Subject: Re: OutOfMemoryError
>
> Bok Tamara,
>
> You didn't say what -Xmx value you are using. Try a little higher
&
Hi,
> ...I get around 3
> million hits. Each of the hits is processed and information from a certain
> field is
> used.
Thats of course fine, but:
> After certain number of hits, somewhere around 1 million (not always the same
> number) I get OutOfMemory exception that looks like this:
You did
Tamara,
You may use StringBuffer instead of String docText =
hits.doc(j).getField("DOCUMENT").stringValue() ;
after that you may use StringBuffer.delete() to release memery.
Another way is using x64-bit machine.
Regards,
Mead
On Wed, Oct 19, 2011 at 5:14 AM, Otis Gospodnetic <
otis_gospodne...@
Bok Tamara,
You didn't say what -Xmx value you are using. Try a little higher value. Note
that loading field values (and it looks like this one may be big because is
compressed) from a lot of hits is not recommended.
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene e
Try 1) reducing the RAM buffer of your IndexWriter
(IndexWriter.setRAMBufferSizeMB), 2) using a term divisor when opening
your reader (pass 2 or 3 or 4 as termInfosIndexDivisor when opening
IndexReader), and 3) disabling norms or not indexing as many fields as
possible.
70Mb is not that much RAM t
Ok Erick,
Thanks for your quick answer.
FSDirectory will, indeed, store the index on disk. However,
when *using* that index, lots of stuff happens. Specifically:
When indexing, there is a buffer that accumulates documents
until it's flushed to disk. Are you indexing?
When searching (and this
FSDirectory will, indeed, store the index on disk. However,
when *using* that index, lots of stuff happens. Specifically:
When indexing, there is a buffer that accumulates documents
until it's flushed to disk. Are you indexing?
When searching (and this is the more important part), various
caches a
Hi Otis,
no, I don't use sort. But I use TopFieldCollector and I have to
instantiate a Sort object with new Sort(). The data are returned unsorted.
On Fri, Mar 5, 2010 at 7:38 PM, Otis Gospodnetic wrote:
> Maybe it's not a leak, Monique. :)
> If you use sorting in Lucene, then the FieldCache
Maybe it's not a leak, Monique. :)
If you use sorting in Lucene, then the FieldCache object will keep some data
permanently in memory, for example.
Otis
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/
- Original Message -
ww.thetaphi.de
eMail: u...@thetaphi.de
-Original Message-
From: Nuno Seco [mailto:ns...@dei.uc.pt]
Sent: Thursday, November 12, 2009 6:08 PM
To: java-user@lucene.apache.org
Subject: Re: OutOfMemoryError when using Sort
Ok. Thanks.
The doc. says:
"Finds the top |n| hits for |que
should be enough for
5-grams).
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
> From: Nuno Seco [mailto:ns...@dei.uc.pt]
> Sent: Thursday, November 12, 2009 6:08 PM
> To: java-user@lucene.apache.org
&
It is only sorting the top 50 hits, yes, but do do that, it needs to look at
the
*value* of the field for each and every of the billions of documents. You
can
do this without using memory if you're willing to deal with disk seeks, but
doing billions of those are going to mean that this query most
Ok. Thanks.
The doc. says:
"Finds the top |n| hits for |query|, applying |filter| if non-null, and
sorting the hits by the criteria in |sort|."
I understood that only the hits (50 in this) for the current search
would be sorted...
I'll just do the ordering afterwards. Thank you for clarifyin
Sorting utilizes a FieldCache: the forward lookup - the value a document has
for a
particular field (as opposed to the usual "inverted" way of looking at all
documents
which contains a given term), which lives in memory, and takes up as much
space
as one 4-bytes * numDocs.
If you've indexed the en
To sort on the count field must be indexed (but not tokenized), it does not
need to be stored. But In any case, sort needs lots of memory. How many
documents do you have?
Uwe
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original M
Michael McCandless [mailto:luc...@mikemccandless.com]
> Gesendet: Do 25.06.2009 13:13
> An: java-user@lucene.apache.org
> Betreff: Re: OutOfMemoryError using IndexWriter
>
> Can you post your test code? If you can make it a standalone test,
> then I can repro and dig down faster.
OK it looks like no merging was done.
I think the next step is to call
IndexWriter.setMaxBufferedDeleteTerms(1000) and see if that prevents
the OOM.
Mike
On Thu, Jun 25, 2009 at 7:16 AM, stefan wrote:
> Hi,
>
> Here are the result of CheckIndex. I ran this just after I got the OOError.
>
> OK [4
rays, I will need some
>> more time for this.
>>
>> Stefan
>>
>> -Ursprüngliche Nachricht-
>> Von: Michael McCandless [mailto:luc...@mikemccandless.com]
>> Gesendet: Mi 24.06.2009 17:50
>> An: java-user@lucene.apache.org
>> Betreff: Re: OutO
time for this.
>
> Stefan
>
> -Ursprüngliche Nachricht-
> Von: Michael McCandless [mailto:luc...@mikemccandless.com]
> Gesendet: Mi 24.06.2009 17:50
> An: java-user@lucene.apache.org
> Betreff: Re: OutOfMemoryError using IndexWriter
>
> On Wed, Jun 24, 2009 at 10:1
On Thu, Jun 25, 2009 at 3:02 AM, stefan wrote:
>>But a "leak" would keep leaking over time, right? Ie even a 1 GB heap
>>on your test db should eventually throw OOME if there's really a leak.
> No not necessarily, since I stop indexing ones everything is indexed - I
> shall try repeated runs wit
On Wed, Jun 24, 2009 at 10:23 AM, stefan wrote:
> does Lucene keep the complete index in memory ?
No.
Certain things (deleted docs, norms, field cache, terms index) are
loaded into memory, but these are tiny compared to what's not loaded
into memory (postings, stored docs, term vectors).
> As s
On Wed, Jun 24, 2009 at 10:18 AM, stefan wrote:
>
> Hi,
>
>
>>OK so this means it's not a leak, and instead it's just that stuff is
>>consuming more RAM than expected.
> Or that my test db is smaller than the production db which is indeed the case.
But a "leak" would keep leaking over time, right?
: Sudarsan, Sithu D. [mailto:sithu.sudar...@fda.hhs.gov]
Gesendet: Mi 24.06.2009 16:18
An: java-user@lucene.apache.org
Betreff: RE: OutOfMemoryError using IndexWriter
When the segments are merged, but not optimized. It happened at 1.8GB to our
program, and now we develop and test in Win32 but run the
apache.org
Betreff: RE: OutOfMemoryError using IndexWriter
Hi Stefan,
Are you using Windows 32 bit? If so, sometimes, if the index file before
optimizations crosses your jvm memory usage settings (if say 512MB),
there is a possibility of this happening.
Increase JVM memory settings if that i
Hi Stefan,
Are you using Windows 32 bit? If so, sometimes, if the index file before
optimizations crosses your jvm memory usage settings (if say 512MB),
there is a possibility of this happening.
Increase JVM memory settings if that is the case.
Sincerely,
Sithu D Sudarsan
Off: 301-796-2587
On Wed, Jun 24, 2009 at 7:43 AM, stefan wrote:
> I tried with 100MB heap size and got the Error as well, it runs fine with
> 120MB.
OK so this means it's not a leak, and instead it's just that stuff is
consuming more RAM than expected.
> Here is the histogram (application classes marked with --
Hi Stefan,
While not directly th source of your problem, I have a feeling you are
optimizing too frequently (and wasting time/CPU by doing so). Is there a
reason you optimize so often? Try optimizing only at the end, when you know
you won't be adding any more documents to the index for a whi
) 3268608 (size)
>
> Well, something I should do differently ?
>
> Stefan
>
> -Ursprüngliche Nachricht-
> Von: Michael McCandless [mailto:luc...@mikemccandless.com]
> Gesendet: Mi 24.06.2009 10:48
> An: java-user@lucene.apache.org
> Betreff: Re: OutOfMemory
How large is the RAM buffer that you're giving IndexWriter? How large
a heap size do you give to JVM?
Can you post one of the OOM exceptions you're hitting?
Mike
On Wed, Jun 24, 2009 at 4:08 AM, stefan wrote:
> Hi,
>
> I am using Lucene 2.4.1 to index a database with less than a million records
I am very interested indeed, do I understand correctly that the tweak
you made reduces the memory when searching if you have many docs in
the index?? I am omitting norms too.
If that is the case, can someone point me to what is hte required
change that should be done? I understand from Yoniks comm
On Mon, 2008-01-07 at 14:20 -0800, Otis Gospodnetic wrote:
> Please post your results, Lars!
Tried the patch, and it failed to compile (plain Lucene compiled fine).
In the process, I looked at TermQuery and found that it'd be easier to
copy that code and just hardcode 1.0f for all norms. Did tha
On Jan 7, 2008 5:00 AM, Lars Clausen <[EMAIL PROTECTED]> wrote:
> Doesn't appear to be the case in our test. We had two fields with
> norms, omitting saved only about 4MB for 50 million entries.
It should be 50MB. If you are measuring with an external tool, then
that tool is probably in error.
Please post your results, Lars!
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Lars Clausen <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Monday, January 7, 2008 5:00:54 AM
Subject: Re: OutOfMemoryError on small sea
On Tue, 2008-01-01 at 23:38 -0800, Chris Hostetter wrote:
> : On Wed, 2007-12-12 at 11:37 +0100, Lars Clausen wrote:
>
> : Seems there's a reason we still use all this memory:
> : SegmentReader.fakeNorms() creates the full-size array for us anyway, so
> : the memory usage cannot be avoided as lon
: On Wed, 2007-12-12 at 11:37 +0100, Lars Clausen wrote:
: Seems there's a reason we still use all this memory:
: SegmentReader.fakeNorms() creates the full-size array for us anyway, so
: the memory usage cannot be avoided as long as somebody asks for the
: norms array at any point. The solution
On Wed, 2007-12-12 at 11:37 +0100, Lars Clausen wrote:
> I've now made trial runs with no norms on the two indexed fields, and
> also tried with varying TermIndexIntervals. Omitting the norms saves
> about 4MB on 50 million entries, much less than I expected.
Seems there's a reason we still use
On Wed, 2007-12-12 at 11:37 +0100, Lars Clausen wrote:
> Increasing
> the TermIndexInterval by a factor of 4 gave no measurable savings.
Following up on myself because I'm not 100% sure that the indexes have
the term index intervals I expect, and I'd like to check. Where can I
see what term ind
On Tue, 2007-11-13 at 07:26 -0800, Chris Hostetter wrote:
> : > Can it be right that memory usage depends on size of the index rather
> : > than size of the result?
> :
> : Yes, see IndexWriter.setTermIndexInterval(). How much RAM are you giving to
> : the JVM now?
>
> and in general: yes. Luc
: > Can it be right that memory usage depends on size of the index rather
: > than size of the result?
:
: Yes, see IndexWriter.setTermIndexInterval(). How much RAM are you giving to
: the JVM now?
and in general: yes. Lucene is using memory so that *lots* of searches
can be fast ... if you r
On Dienstag, 13. November 2007, Lars Clausen wrote:
> Can it be right that memory usage depends on size of the index rather
> than size of the result?
Yes, see IndexWriter.setTermIndexInterval(). How much RAM are you giving to
the JVM now?
Regards
Daniel
--
http://www.danielnaber.de
---
I belive the problem is that the text value is not the only data
associated with a token, there is for instance the position offset.
Depending on your JVM, each instance reference consume 64 bits or so,
so even if the text value is flyweighted by String.intern() there is
a cost. I doubt tha
I have indexed around 100 M of data with 512M to the JVM heap. So that gives
you an idea. If every token is the same word in one file, shouldn't the
tokenizer recognize that ?
Try using Luke. That helps solving lots of issues.
-
AZ
On 9/1/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> I can't
I can't answer the question of why the same token
takes up memory, but I've indexed far more than
20M of data in a single document field. As in on the
order of 150M. Of course I allocated 1G or so to the
JVM, so you might try that
Best
Erick
On 8/31/07, Per Lindberg <[EMAIL PROTECTED]> wrote:
Thx for ur quick reply.
I will go through it.
Rgds,
Jelda
> -Original Message-
> From: mark harwood [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, May 02, 2006 5:03 PM
> To: java-user@lucene.apache.org
> Subject: RE: OutOfMemoryError while enumerating through
> read
"Category counts" should really be a FAQ entry.
There is no one right solution to prescribe because it
depends on the shape of your data.
For previous discussions/code samples see here:
http://www.mail-archive.com/java-user@lucene.apache.org/msg05123.html
and here for more space-efficient repre
mailto:[EMAIL PROTECTED]
> Sent: Tuesday, May 02, 2006 4:41 PM
> To: java-user@lucene.apache.org
> Subject: RE: OutOfMemoryError while enumerating through
> reader.terms(fieldName)
>
> I am trying to implement category count almost similar to
> CNET approach.
> At the initia
PM
> To: java-user@lucene.apache.org
> Subject: RE: OutOfMemoryError while enumerating through
> reader.terms(fieldName)
>
> >>Any advise is relly welcome.
>
> Don't cache all that data.
> You need a minimum of (numUniqueTerms*numDocs)/8 bytes to
> hold that info.
>
>>Any advise is relly welcome.
Don't cache all that data.
You need a minimum of (numUniqueTerms*numDocs)/8 bytes
to hold that info.
Assuming 10,000 unique terms and 1 million docs you'd
need over 1 Gig of RAM.
I suppose the question is what are you trying to
achieve and why can't you use the exis
Hi,
I just debugged it closely.. Sorry I am getting OutOfMemoryError not because
of reader.terms()
But because of invoking QueryFilter.bits() method for each unique term.
I will try explain u with psuedo code.
while(term != null){
if(term.field().equals(name)){
String termText
Aha, it's not initially clear, but after looking at it more closely, I see how
it works
now. This is very good to know.
Tony Schwartz
[EMAIL PROTECTED]
> Tony Schwartz wrote:
>> What about the TermInfosReader class? It appears to read the entire term
>> set for the
>> segment into 3 arrays
Tony Schwartz wrote:
What about the TermInfosReader class? It appears to read the entire term set
for the
segment into 3 arrays. Am I seeing double on this one?
p.s. I am looking at the current sources.
see TermInfosReader.ensureIndexIsRead();
The index only has 1/128 of the terms, by def
Tony Schwartz wrote:
I think you're jumping into the conversation too late. What you have said here
does not
address the problem at hand. That is, in TermInfosReader, all terms in the
segment get
loaded into three very large arrays.
That's not true. Only 1/128th of the terms are loaded by
On Thursday 18 August 2005 14:32, Tony Schwartz wrote:
> Is this a viable solution?
> Doesn't this make sorting and filtering much more complex and much more
> expensive as well?
Sorting would have to be done on more than one field.
I would expect that to be possible.
As for filtering: would you
ww.aviransplace.com
>
> -Original Message-
> From: Tony Schwartz [mailto:[EMAIL PROTECTED]
> Sent: Thursday, August 18, 2005 8:32 AM
> To: java-user@lucene.apache.org
> Subject: Re: OutOfMemoryError on addIndexes()
>
> Is this a viable solution?
> Doesn't this m
-user@lucene.apache.org
Subject: Re: OutOfMemoryError on addIndexes()
Is this a viable solution?
Doesn't this make sorting and filtering much more complex and much more
expensive as well?
Tony Schwartz
[EMAIL PROTECTED]
> On Wednesday 17 August 2005 22:49, Paul Elschot wrote:
>> > the i
Is this a viable solution?
Doesn't this make sorting and filtering much more complex and much more
expensive as well?
Tony Schwartz
[EMAIL PROTECTED]
> On Wednesday 17 August 2005 22:49, Paul Elschot wrote:
>> > the index could potentially be huge.
>> >
>> > So if this is indeed the case, it is
On Wednesday 17 August 2005 22:49, Paul Elschot wrote:
> > the index could potentially be huge.
> >
> > So if this is indeed the case, it is a potential scalability
> > bottleneck in lucene index size.
>
> Splitting the date field into century, year in century, month, day, hour,
> seconds, and
>
hanks,
>
> Tony Schwartz
> [EMAIL PROTECTED]
>
>
>
>
>
> From: John Wang <[EMAIL PROTECTED]>
> Subject: Re: OutOfMemoryError on addIndexes()
>
>
>
>
> Under many us
urned me in the past. I am going to start
working on and
testing a solution to this, but was wondering if anyone had already messed with
it or
had any ideas up front?
Thanks,
Tony Schwartz
[EMAIL PROTECTED]
From: John Wang <[EMAIL PROTECTED]>
Subject: Re: OutOfMemoryError on
the nature of your indexes?
>
>
> : Date: Fri, 12 Aug 2005 09:45:40 +0200
> : From: Trezzi Michael <[EMAIL PROTECTED]>
> : Reply-To: java-user@lucene.apache.org
> : To: java-user@lucene.apache.org
> : Subject: RE: OutOfMemoryError on addIndexes()
> :
> : I did some m
aps some binary data is mistakenly getting treated as
strings?
can you tell us more about the nature of your indexes?
: Date: Fri, 12 Aug 2005 09:45:40 +0200
: From: Trezzi Michael <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subjec
omu: java-user@lucene.apache.org
Předmět: Re: OutOfMemoryError on addIndexes()
How much memory are you giving your programs?
java-Xmxset maximum Java heap size
--
Ian.
On 10/08/05, Trezzi Michael <[EMAIL PROTECTED]> wrote:
> Hello,
> I have a problem and i tried everythin
m: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> Sent: Thursday, August 11, 2005 11:15 AM
> To: java-user@lucene.apache.org
> Subject: Re: OutOfMemoryError on addIndexes()
>
> > > Is -Xmx case sensitive? Should it be 1000m instead of 1000M?
> Not
> > > sure.
>
would shrink the java memory pool back down to the min?
Thanks,
Tom
-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 11, 2005 11:15 AM
To: java-user@lucene.apache.org
Subject: Re: OutOfMemoryError on addIndexes()
> > Is -Xmx case sensitive?
> > Is -Xmx case sensitive? Should it be 1000m instead of 1000M? Not
> > sure.
> >
>
> I'am starting with:
> java -Xms256M -Xmx512M -jar Suchmaschine.jar
And if you look at the size of your JVM, does it really use all 512 MB?
If it does not, maybe you can try this:
java -Xms256m -Xmx512m -j
Otis Gospodnetic schrieb:
> Is -Xmx case sensitive? Should it be 1000m instead of 1000M? Not
> sure.
>
I'am starting with:
java -Xms256M -Xmx512M -jar Suchmaschine.jar
--
Die analytische Maschine (der Computer) kann nur das ausführen, was wir
zu programmieren imstande sind. (Ada Lovelace)
.
>
> Michael
>
>
>
> Od: Ian Lea [mailto:[EMAIL PROTECTED]
> Odesláno: st 10.8.2005 12:34
> Komu: java-user@lucene.apache.org
> Pøedmìt: Re: OutOfMemoryError on addIndexes()
>
>
>
> How much memory are you giving your programs
Předmět: Re: OutOfMemoryError on addIndexes()
How much memory are you giving your programs?
java-Xmxset maximum Java heap size
--
Ian.
On 10/08/05, Trezzi Michael <[EMAIL PROTECTED]> wrote:
> Hello,
> I have a problem and i tried everything i could think of to solve it. TO
How much memory are you giving your programs?
java-Xmxset maximum Java heap size
--
Ian.
On 10/08/05, Trezzi Michael <[EMAIL PROTECTED]> wrote:
> Hello,
> I have a problem and i tried everything i could think of to solve it. TO
> understand my situation, i create indexes on several
Hi,
If I replace my lucene wrapper with a dummy one the problem goes away.
If I close my index-thread every 30 minutes and start a new thread it
also goes away.
If I exit the thread on OutOfMemory errors it regains all memory.
I do not use static variables. If I did they wouldn't get garbage
colle
Might be interesting to know if it crashed on 2 docs if you ran it
with heap size of 512Mb. I guess you've already tried with default
merge values. Shouldn't need to optimize after every 100 docs. jdk
1.3 is pretty ancient - can you use 1.5?
I'd try it with a larger heap size, and then look
82 matches
Mail list logo