Re: Indexing Scalability, Multiwriter?

2008-10-10 Thread Glen Newton
IndexWriter is thread-safe and has been for a while (http://www.mail-archive.com/[EMAIL PROTECTED]/msg00157.html) so you don't have to worry about that. As reported in my blog in April (http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html) but perhaps not explicitly enoug

Indexing Scalability, Multiwriter?

2008-10-10 Thread Darren Govoni
Hi gang, Wondering how folks have address scaled up indexing. I saw old threads about using clustered webapp with JNDI singleton index writer due to the Lucene single writer limitation. Is this limitation lifted in 3 maybe? Is there a best strategy for parallel writing to an index by many threads

Re: Question regarding sorting and memory consumption in lucene

2008-10-10 Thread mark harwood
Actually looking at this a little deeper maybe Lucene could/should automatically be doing this "short" optimisation here? Given a comparitively small set of unique terms (as in your example) it seems feasible that FieldCacheImpl could allocate a short[reader.maxDoc] array rather than an int[rea

RE: Question regarding sorting and memory consumption in lucene

2008-10-10 Thread Robert Stewart
I have had a similar problem. What I do is load all the date field values at index startup, convert dates (timestamps) to a Julian date (# of seconds since 1970/1/1). Then I pre-sort that array using a very fast O(n) distribution sort, and then keep an array of integers which is the pre-sorted p

Re: Question regarding sorting and memory consumption in lucene

2008-10-10 Thread Aleksander M. Stensby
That's a really good idea Mark! :) Thanks! Will try to see if can make a quick change with your suggestion. (Too bad quick isn't really a word in my vocabulary when it's 6 o'clock on a Friday :( Guess it'll be a looong night.. :( Cheers, Aleks On Fri, 10 Oct 2008 17:07:31 +0200, mark harwo

Re: Question regarding sorting and memory consumption in lucene

2008-10-10 Thread mark harwood
Update: The statement "...cost is field size (10 bytes ?) times number of documents" is wrong. What you actually have is the cost of the unique strings (estimated at 10 * 1460 -effectively nothing) BUT you have to add the cost of the array of object references to those strings so 30m

Re: Question regarding sorting and memory consumption in lucene

2008-10-10 Thread Aleksander M. Stensby
Yes, I understand that, and I did mean the number of documents, but I read in the javadoc that: "For String fields, the cache is larger: in addition to the above array, the value of every term in the field is kept in memory. If there are many unique terms in the field, this could be quite l

Re: Question regarding sorting and memory consumption in lucene

2008-10-10 Thread mark harwood
I think you have your memory cost calculation wrong. The cost is field size (10 bytes ?) times number of documents NOT number of unique terms. The cache is essentially an array of size reader.maxDoc() which is indexed directly into on docId to retrieve field values. You are right in needing to

Re: Question regarding sorting and memory consumption in lucene

2008-10-10 Thread Aleksander M. Stensby
Unfortunately no, since the documents that are added may come form a new "source" containing old documents aswell..:/ I tried deploying our webapplication without any searcher objects and it consumes basically ~200mb of memory in tomcat. With 6 searchers the same applications manages to consume

Re: Question regarding sorting and memory consumption in lucene

2008-10-10 Thread mark harwood
Assuming content is added in chronological order and with no updates to existing docs couldn't you rely on internal Lucene document id to give a chronological sort order? That would require no memory cache at all when sorting. Querying across multiple indexes simultaneously however may present a

Re: Question regarding sorting and memory consumption in lucene

2008-10-10 Thread Aleksander M. Stensby
I'll follow up on my own question... Let's say that we have 4 years of data, meaning that there will be roughly 4 * 365 = 1460 unique terms for our sort field. For one index, lets say with 30 million docs, the cache should use approx 100mb, or am I wrong? and thus for 6 indexes we would need a

Re: Only last field indexed

2008-10-10 Thread Erick Erickson
True, I guess I was thinking of things from a search-only perspective when I claimed they were identical... But you're absolutely right in that you can retrieve them in order (assuming you stored them) by getFields. Best Erick On Thu, Oct 9, 2008 at 10:29 PM, John Griffin <[EMAIL PROTECTED]>wrote

Re: Release 2.4 on ibiblio

2008-10-10 Thread Michael McCandless
The release bits are indeed propagating through all mirrors, but I'm going to wait until tomorrow to do the announcement, to make sure all mirrors catch up. Mike Hardy Ferentschik wrote: Hi there, I've just noticed that there is already a 2.4 release available on ibiblio (http://mirro

Release 2.4 on ibiblio

2008-10-10 Thread Hardy Ferentschik
Hi there, I've just noticed that there is already a 2.4 release available on ibiblio (http://mirrors.ibiblio.org/pub/mirrors/maven2/org/apache/lucene/lucene-core/2.4.0/), but there is no official release notification yet. What's the status of these aretifacts? When will 2.4 be officially rele

Question regarding sorting and memory consumption in lucene

2008-10-10 Thread Aleksander M. Stensby
Hello, I've read a lot of threads now on memory consumption and sorting, and I think I have a pretty good understanding of how things work, but I could still need some input here.. We currently have a system consisting of 6 different lucene indexes (all have the same structure, so you could

Re: Buzz measurement - Aggregate functions

2008-10-10 Thread mark harwood
Ah, sorry. Just saw the bit about the free text query too. A FieldCache is the answer here I suspect in order to quickly retrieve the date values for arbitrary queries. - Original Message From: mark harwood <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Friday, 10 October,

Re: Buzz measurement - Aggregate functions

2008-10-10 Thread mark harwood
Assuming your date data is held as MMDD and you want daily totals Term startTerm=new Term("date","20080101"); TermEnum termEnum = indexReader.terms(startTerm); do { Term currentTerm = termEnum.term(); if(currentTerm.field()!=startTerm

Buzz measurement - Aggregate functions

2008-10-10 Thread Marcus Herou
Hi. Anyone have an idea of how I would create a query which finds the data backing a trend graph where date is X and num(docs) is on Y axis ? This is quite a common use case in "buzz" analysis and currently I'm doing a stupid query which iterates over the date range and queries lucene for every d

Re: wizard for search in Lucene

2008-10-10 Thread Aleksander M. Stensby
From what I can understand, you want to insert the word "history" and then get proposed "related" terms in combination with your input query. In essense this would be to do a "look-up" on top-terms in the subset of documents matching the initial query "history". Exactly how you could do this