Re: Solid State Drives vs. RAMDirectory

2008-04-14 Thread Toke Eskildsen
On Mon, 2008-04-14 at 21:26 -0700, Otis Gospodnetic wrote: > Toke, this is *super* juicy information, very useful and educational. > Please do put this on the Wiki. There doesn't seem to be a benchmarking > page on the Wiki yet, so I suggest you go to > http://wiki.apache.org/lucene-java/LuceneBe

Re: Search for phrases

2008-04-14 Thread palexv
I have not tokenized phrases in index. What query should I use? Simple TermQuery does not work. If I try to use QueryParser , what analyzer should I use? Daniel Naber-10 wrote: > > On Montag, 14. April 2008, palexv wrote: > >> For example I need to search for "java de*" and recieve "java >> d

Re: Using Lucene partly as DB and 'joining' search results.

2008-04-14 Thread Antony Bowesman
Chris Hostetter wrote: you can't ... that's why i said you'd need to rebuild the smaller index completley on a periodic basis (going in the same order as the docs in the Mmm, the annotations would only be stored in the index. It would be possible to store them elsewhere, so I can investigate

Re: Using Lucene partly as DB and 'joining' search results.

2008-04-14 Thread Chris Hostetter
: would then have to make a join using mailId against the core. However, if I : want to use PR, I could have a single Document with multiple field, and using : stored fields can 'modify' that Document. However, what happens to the DocId : when the delete+add occurs and how do I ensure it stays t

Re: Solid State Drives vs. RAMDirectory

2008-04-14 Thread Otis Gospodnetic
Toke, this is *super* juicy information, very useful and educational. Please do put this on the Wiki. There doesn't seem to be a benchmarking page on the Wiki yet, so I suggest you go to http://wiki.apache.org/lucene-java/LuceneBenchmarks, create that page, and put everything you want and can s

Re: Using Lucene partly as DB and 'joining' search results.

2008-04-14 Thread Antony Bowesman
Thanks all for the suggestions - there was also another thread "Lucene index on relational data" which had crossover here. That's an interesting idea about using ParallelReader for the changable index. I had thought to just have a triplet indexed 'owner:mailId:label' in each Doc and have multi

Re: Using Lucene partly as DB and 'joining' search results.

2008-04-14 Thread Chris Hostetter
: The archive is read only apart from bulk deletes, but one of the requirements : is for users to be able to label their own mail. Given that a Lucene Document : cannot be updated, I have thought about having a separate Lucene index that : has just the 3 terms (or some combination of) userId + ma

Re: Sorting consumes hundreds of MBytes RAM

2008-04-14 Thread Chris Hostetter
: How does this work internally? It seems as if all data for this field found in : the entire index is read into memory (?). You can think of it as an "inverted-inverted index" Lucene needs a data structure it can usefor fast lookups where the key is the docId and the value is something "com

Re: Document ids in Lucene index

2008-04-14 Thread Chris Hostetter
: - check maxDoc() : - iterate from 0 to maxDoc() and process doc if it is not deleted For the record: that is exactly what MatchAllDocsQuery does ... except that you have an off by one error (maxDoc returns 1 more then the largest possible document number). Even if you don't want the Query AP

Re: How to improve performance of large numbers of successive searches?

2008-04-14 Thread Erick Erickson
OK, if you're going after simple terms without any logic (or with very simple logic), why search at all? Why not just use TermDocs and/or TermEnum to flip through the index noticing documents that match? I'd only recommend this if you are NOT trying to parse complex queries. That is, say, you are

Re: Search for phrases

2008-04-14 Thread Daniel Naber
On Montag, 14. April 2008, palexv wrote: > For example I need to search for "java de*" and recieve "java > developers", "java development", "developed by java" etc. If your text is tokenized, this is not supported by QueryParser but you can create such queries using MultiPhraseQuery. If you don'

RE: WildCardQuery and TooManyClauses

2008-04-14 Thread Beard, Brian
You can use your approach w/ or w/o the filter. >td = indexSearcher.search(query, filter, maxnumhits); You need to use a filter for the wildcards which is built in to the query. 1) Extend QueryParser to override the getWildcardQuery method. (Or even if you don't use QueryParser, j

Re: Lucene index on relational data

2008-04-14 Thread Rajesh parab
Hi Everyone, Any help around this topic will be very useful. Is anyone partitioning the data into 2 or more indexes and using parallelReader to search these indexes? If yes, how do you handle updates to the indexes and make sure the doc ids for all indexes are in same order? Regards, Rajesh ---

Re: How to improve performance of large numbers of successive searches?

2008-04-14 Thread Chris McGee
Hi Erick, Here is a quick overview of what I hope to accomplish with lucene. I am using a lucene database to store condensed information about a collection of data that I have. The data has to be constantly updated for correctness so that when one part changes certain other parts can be changed

Re: How to improve performance of large numbers of successive searches?

2008-04-14 Thread Erick Erickson
As I stated in my original reply, a Hits object re-executes the search every 100 or so objects you examine. So some loop like Hits hits = search for (int idx = 0; idx < hits.length; ++idx ) { Document doc = hits.get(idx); } really does something like for (int idx = 0; idx < hits.length; +

Search for phrases

2008-04-14 Thread palexv
Hi all. I have an index with a set of phrases(one or several words). I need to make search for these phrases. I am confused as I can not find a good way to search for phrases. For example I need to search for "java de*" and recieve "java developers", "java development", "developed by java" etc.

Re: How to improve performance of large numbers of successive searches?

2008-04-14 Thread Chris McGee
Hi Erick, Thanks for the information. I tried using a HitCollector and a FieldSelector. I'm getting some dramatic improvements gathering large result sets using the FieldSelector. As it turned out I was able to assume in many cases that I could break out after a specific field in each document