Re: Lucene Indexing out of memory

2010-03-03 Thread ajay_gupta
Erick, w_context and context_str are local to this method and are used only for 2500 K documents not entire 70 k. I am clearing the hashmap after each 2500k doc processing and also I printed memory consumed by hashmap which is kind of constant for each chunk processing. For each invocation of up

Re: Old Lucene src archive corrupt?

2010-03-03 Thread Scott Ribe
Or perhaps your download process is treating the archive file as text and translating "line endings" for you? -- Scott Ribe scott_r...@killerbytes.com http://www.killerbytes.com/ (303) 722-0567 voice - To unsubscribe, e-mail:

RE: Old Lucene src archive corrupt?

2010-03-03 Thread Uwe Schindler
Are you sure that you have no virus modifying the zip files after download? Have you compared the byte size, too? By the way as this is windows, can you set md5sum explicitely to binary mode when creating the sum (but the "*" before the filename is a sign for binary mode)? Also the md5 files fr

Re: Phrase search on NOT_ANALYZED content

2010-03-03 Thread Erick Erickson
NOT_ANALYZED is probably not what you want. NOT_ANALYZED stores the entire input as a *single* token, so you can never match on anything except the entire input. What did you hope to accomplish by indexint NOT_ANALYZED? That's actually a pretty specialized thing to do, perhaps there's a better way

RE: Old Lucene src archive corrupt?

2010-03-03 Thread An Hong
I'm on WinXP. Somehow the md5 don't match. C:\download>md5sum lucene-2.9.1-src.zip 6135854e793302274c4e2384ae54bfde *lucene-2.9.1-src.zip C:\download>cat lucene-2.9.1-src.zip.md5 e10b3833b8d324caec9d6f62aae6497c C:\download>md5sum -c lucene-2.9.1-src.zip.md5 md5sum: lucene-2.9.1-src.zip.md5: no

Phrase search on NOT_ANALYZED content

2010-03-03 Thread Murdoch, Paul
If I have indexed some content that contains some words and a single whitespace between each word as NOT_ANALYZED, is it possible to perform a phrase search on that a portion of that content? I'm indexing and searching with the StandardAnalyzer 2.9. Using the KeywordAnalyzer works, but I have to

RE: FastVectorHighlighter truncated queries

2010-03-03 Thread Digy
Before query.Rewrite if query is MultiTermQuery then ((MultiTermQuery)query).setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUER Y_REWRITE); solved my problem. DIGY -Original Message- From: halbtuerderschwarze [mailto:halbtuerderschwa...@web.de] Sent: Wednesday, Fe

Lucene: Finite-State Queries, Flexible Indexing, Scoring, and more

2010-03-03 Thread Otis Gospodnetic
Hello folks, Those of you in or near New York and using Lucene or Solr should come to "Lucene: Finite-State Queries, Flexible Indexing, Scoring, and more" on March 24th: http://www.meetup.com/NYC-Search-and-Discovery/calendar/12720960/ The presenter will be the hyper active Lucene committer R

2-pass scoring of top docs

2010-03-03 Thread Justin
I've looked at this for a couple days and hope someone can offer suggestions... In the past, we overrode Scorer::score(Collector), called super.score(Collector), called Collector.topDocs(), adjusted the scores for a portion of the top docs, then ran Collector.collect(int) to collect based on th

Re: Lucene Indexing out of memory

2010-03-03 Thread Erick Erickson
The first place I'd look is how big my your strings got. w_context and context_str come to mind. My first suspicion is that you're building ever-longer strings and around 70K documents your strings are large enough to produce OOMs. FWIW Erick On Wed, Mar 3, 2010 at 1:09 PM, ajay_gupta wrote: >

Re: Lucene Indexing out of memory

2010-03-03 Thread ajay_gupta
Mike, Actually my documents are very small in size. We have csv files where each record represents a document which is not very large so I don't think document size is an issue. For each record I am tokenizing it and for each token I am keeping 3 neighbouring tokens in a Hashtable. After X number

Re: Lucene Indexing out of memory

2010-03-03 Thread Michael McCandless
The worst case RAM usage for Lucene is a single doc with many unique terms. Lucene allocates ~60 bytes per unique term (plus space to hold that term's characters = 2 bytes per char). And, Lucene cannot flush within one document -- it must flush after the doc has been fully indexed. This past thr

Re: Lucene Indexing out of memory

2010-03-03 Thread Erick Erickson
Interpolating from your data (and, by the way, some code examples would help a lot), if you're reopening the index reader to pick up recent additions but not closing it if a different one is returned from reopen, you'll consume resources. From the JavaDocs... IndexReader new = r.reopen(); if (ne

Re: Lucene Indexing out of memory

2010-03-03 Thread Ian Lea
Lucene doesn't load everything into memory and can carry on running consecutive searches or loading documents for ever without hitting OOM exceptions. So if it isn't failing on a specific document the most likely cause is that your program is hanging on to something it shouldn't. Previous docs? Fi

Re: Lucene Indexing out of memory

2010-03-03 Thread ajay_gupta
Ian, OOM exception point varies not fixed. It could come anywhere once memory exceeds a certain point. I have allocated 1 GB memory for JVM. I haven't used profiler. When I said after 70 K docs it fails i meant approx 70k documents but if I reduce memory then it will OOM before 70K so its not sp

[ANN] Carrot2 3.2.0 released

2010-03-03 Thread Stanislaw Osinski
Dear All, I'm happy to announce three releases from the Carrot Search team: Carrot2 v3.2.0, Lingo3G v1.3.1 and Carrot Search Labs. Carrot2 is an open source search results clustering engine. Version v3.2.0 introduces: * experimental support for clustering Korean and Arabic content, * a