IndexWriter is thread-safe and has been for a while
(http://www.mail-archive.com/[EMAIL PROTECTED]/msg00157.html)
so you don't have to worry about that.
As reported in my blog in April
(http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html)
but perhaps not explicitly enoug
Hi gang,
Wondering how folks have address scaled up indexing. I saw old threads
about using clustered webapp with JNDI singleton index writer due to the
Lucene single writer limitation. Is this limitation lifted in 3 maybe?
Is there a best strategy for parallel writing to an index by many
threads
Actually looking at this a little deeper maybe Lucene could/should
automatically be doing this "short" optimisation here?
Given a comparitively small set of unique terms (as in your example) it seems
feasible that FieldCacheImpl could allocate a short[reader.maxDoc] array rather
than an int[rea
I have had a similar problem. What I do is load all the date field values at
index startup, convert dates (timestamps) to a Julian date (# of seconds since
1970/1/1). Then I pre-sort that array using a very fast O(n) distribution sort,
and then keep an array of integers which is the pre-sorted p
That's a really good idea Mark! :)
Thanks! Will try to see if can make a quick change with your suggestion.
(Too bad quick isn't really a word in my vocabulary when it's 6 o'clock on
a Friday :(
Guess it'll be a looong night.. :(
Cheers,
Aleks
On Fri, 10 Oct 2008 17:07:31 +0200, mark harwo
Update: The statement "...cost is field size (10 bytes ?) times number of
documents" is wrong.
What you actually have is the cost of the unique strings (estimated at 10 *
1460 -effectively nothing) BUT you have to add the cost of the array of object
references to those strings so
30m
Yes, I understand that, and I did mean the number of documents, but I read
in the javadoc that:
"For String fields, the cache is larger: in addition to the above array,
the value of every term in the field is kept in memory. If there are many
unique terms in the field, this could be quite l
I think you have your memory cost calculation wrong.
The cost is field size (10 bytes ?) times number of documents NOT number of
unique terms.
The cache is essentially an array of size reader.maxDoc() which is indexed
directly into on docId to retrieve field values.
You are right in needing to
Unfortunately no, since the documents that are added may come form a new
"source" containing old documents aswell..:/
I tried deploying our webapplication without any searcher objects and it
consumes basically ~200mb of memory in tomcat.
With 6 searchers the same applications manages to consume
Assuming content is added in chronological order and with no updates to
existing docs couldn't you rely on internal Lucene document id to give a
chronological sort order?
That would require no memory cache at all when sorting.
Querying across multiple indexes simultaneously however may present a
I'll follow up on my own question...
Let's say that we have 4 years of data, meaning that there will be roughly
4 * 365 = 1460 unique terms for our sort field.
For one index, lets say with 30 million docs, the cache should use approx
100mb, or am I wrong? and thus for 6 indexes we would need a
True, I guess I was thinking of things from a search-only
perspective when I claimed they were identical... But
you're absolutely right in that you can retrieve them in
order (assuming you stored them) by getFields.
Best
Erick
On Thu, Oct 9, 2008 at 10:29 PM, John Griffin <[EMAIL PROTECTED]>wrote
The release bits are indeed propagating through all mirrors, but I'm
going to wait until tomorrow to do the announcement, to make sure all
mirrors catch up.
Mike
Hardy Ferentschik wrote:
Hi there,
I've just noticed that there is already a 2.4 release available on
ibiblio (http://mirro
Hi there,
I've just noticed that there is already a 2.4 release available on ibiblio
(http://mirrors.ibiblio.org/pub/mirrors/maven2/org/apache/lucene/lucene-core/2.4.0/),
but there is no official release notification yet. What's the status of
these aretifacts? When will 2.4 be officially rele
Hello, I've read a lot of threads now on memory consumption and sorting,
and I think I have a pretty good understanding of how things work, but I
could still need some input here..
We currently have a system consisting of 6 different lucene indexes (all
have the same structure, so you could
Ah, sorry. Just saw the bit about the free text query too.
A FieldCache is the answer here I suspect in order to quickly retrieve the date
values for arbitrary queries.
- Original Message
From: mark harwood <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, 10 October,
Assuming your date data is held as MMDD and you want daily totals
Term startTerm=new Term("date","20080101");
TermEnum termEnum = indexReader.terms(startTerm);
do
{
Term currentTerm = termEnum.term();
if(currentTerm.field()!=startTerm
Hi.
Anyone have an idea of how I would create a query which finds the data
backing a trend graph where date is X and num(docs) is on Y axis ?
This is quite a common use case in "buzz" analysis and currently I'm doing a
stupid query which iterates over the date range and queries lucene for every
d
From what I can understand, you want to insert the word "history" and then
get proposed "related" terms in combination with your input query.
In essense this would be to do a "look-up" on top-terms in the subset of
documents matching the initial query "history". Exactly how you could do
this
19 matches
Mail list logo