Erick,
w_context and context_str are local to this method and are used only for
2500 K documents not entire 70 k. I am clearing the hashmap after each 2500k
doc processing and also I printed memory consumed by hashmap which is kind
of constant for each chunk processing. For each invocation of
up
Or perhaps your download process is treating the archive file as text and
translating "line endings" for you?
--
Scott Ribe
scott_r...@killerbytes.com
http://www.killerbytes.com/
(303) 722-0567 voice
-
To unsubscribe, e-mail:
Are you sure that you have no virus modifying the zip files after download?
Have you compared the byte size, too?
By the way as this is windows, can you set md5sum explicitely to binary mode
when creating the sum (but the "*" before the filename is a sign for binary
mode)? Also the md5 files fr
NOT_ANALYZED is probably not what you want.
NOT_ANALYZED stores the entire input as
a *single* token, so you can never match on
anything except the entire input.
What did you hope to accomplish by indexint
NOT_ANALYZED? That's actually a pretty
specialized thing to do, perhaps there's a better
way
I'm on WinXP. Somehow the md5 don't match.
C:\download>md5sum lucene-2.9.1-src.zip
6135854e793302274c4e2384ae54bfde *lucene-2.9.1-src.zip
C:\download>cat lucene-2.9.1-src.zip.md5
e10b3833b8d324caec9d6f62aae6497c
C:\download>md5sum -c lucene-2.9.1-src.zip.md5
md5sum: lucene-2.9.1-src.zip.md5: no
If I have indexed some content that contains some words and a single
whitespace between each word as NOT_ANALYZED, is it possible to perform
a phrase search on that a portion of that content? I'm indexing and
searching with the StandardAnalyzer 2.9. Using the KeywordAnalyzer
works, but I have to
Before query.Rewrite
if query is MultiTermQuery then
((MultiTermQuery)query).setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUER
Y_REWRITE);
solved my problem.
DIGY
-Original Message-
From: halbtuerderschwarze [mailto:halbtuerderschwa...@web.de]
Sent: Wednesday, Fe
Hello folks,
Those of you in or near New York and using Lucene or Solr should come to
"Lucene: Finite-State Queries, Flexible Indexing, Scoring, and more" on March
24th:
http://www.meetup.com/NYC-Search-and-Discovery/calendar/12720960/
The presenter will be the hyper active Lucene committer R
I've looked at this for a couple days and hope someone can offer suggestions...
In the past, we overrode Scorer::score(Collector), called
super.score(Collector), called Collector.topDocs(), adjusted the scores for a
portion of the top docs, then ran Collector.collect(int) to collect based on
th
The first place I'd look is how big my your strings
got. w_context and context_str come to mind. My
first suspicion is that you're building ever-longer
strings and around 70K documents your strings
are large enough to produce OOMs.
FWIW
Erick
On Wed, Mar 3, 2010 at 1:09 PM, ajay_gupta wrote:
>
Mike,
Actually my documents are very small in size. We have csv files where each
record represents a document which is not very large so I don't think
document size is an issue.
For each record I am tokenizing it and for each token I am keeping 3
neighbouring tokens in a Hashtable. After X number
The worst case RAM usage for Lucene is a single doc with many unique
terms. Lucene allocates ~60 bytes per unique term (plus space to hold
that term's characters = 2 bytes per char). And, Lucene cannot flush
within one document -- it must flush after the doc has been fully
indexed.
This past thr
Interpolating from your data (and, by the way, some code
examples would help a lot), if you're reopening the index
reader to pick up recent additions but not closing it if a
different one is returned from reopen, you'll consume
resources. From the JavaDocs...
IndexReader new = r.reopen();
if (ne
Lucene doesn't load everything into memory and can carry on running
consecutive searches or loading documents for ever without hitting OOM
exceptions. So if it isn't failing on a specific document the most
likely cause is that your program is hanging on to something it
shouldn't. Previous docs? Fi
Ian,
OOM exception point varies not fixed. It could come anywhere once memory
exceeds a certain point.
I have allocated 1 GB memory for JVM. I haven't used profiler.
When I said after 70 K docs it fails i meant approx 70k documents but if I
reduce memory then it will OOM before 70K so its not sp
Dear All,
I'm happy to announce three releases from the Carrot Search team: Carrot2
v3.2.0, Lingo3G v1.3.1 and Carrot Search Labs.
Carrot2 is an open source search results clustering engine. Version v3.2.0
introduces:
* experimental support for clustering Korean and Arabic content,
* a
16 matches
Mail list logo