Re: index architectures

2006-10-17 Thread Paul Waite
Many thanks to Erik and Ollie for responding - a lot of ideas and I'll have my work cut out grokking them properly and thinking about what to do. I'll respond further as that develops. One quick thing though - Erik wrote: > So, I wonder if your out of memory issue is really related to the number

Re: Oracle Text 10g... or NOT

2006-10-17 Thread Kuassi Mensah
Right, as described in my book, The Oracle database furnishes an embedded Java run time, which can be > used by database components such as XDB, *inter*Media, Spatial, Text, > XQuery, and so on. Oracle Text leverages the XML DB framework, which > includes a protocol server and a

Re: Preventing merging by IndexWriter

2006-10-17 Thread Erick Erickson
True. But is it enough faster than TermDocs.seek(new Term("unique id", id)).doc() to be worth the complication for this situation? ... Erick On 10/17/06, Daniel Noll <[EMAIL PROTECTED]> wrote: Erick Erickson wrote: > Why go through all this effort when it's easy to make your own unique ID? I

RE: index architectures

2006-10-17 Thread Oliver Hutchison
I can certainly vouch for the benefits of partitioning, we've seen a very big improvement in searcher refresh times (our main pain point) since we implemented such an architecture. Our application has 1000's of indexes, ranging in size from a few meg up several gigabytes, updates occur very freque

Re: Oracle Text 10g... or NOT

2006-10-17 Thread Greg Colvin
Another option is to run Lucene inside your Oracle instance using it's JVM. This might help with combining Lucene and Oracle search results. On Oct 17, 2006, at 12:39 PM, Chris Lu wrote: Several additional reasons I can think of: 1) Being able to control the algorithsm, for example, 1.1)

Parameterized IndexModifier

2006-10-17 Thread vasu shah
Hi, The IndexModifier class always opens up an IndexWriter in the init method. If we need to update a document, it closes the IndexWriter and opens up IndexReader to delete the desired document. Then again it opens IndexWriter to add the document to the index. Instead can't we pass one extra

Re: Preventing merging by IndexWriter

2006-10-17 Thread Daniel Noll
Erick Erickson wrote: Why go through all this effort when it's easy to make your own unique ID? I can think of one reason: hits.id() is orders of magnitude faster than hits.doc(). Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 W

Re: PrefixFilter Memory Consumption

2006-10-17 Thread vasu shah
Thanks for the explanation. I am using ChainedFilter and it is taking some more time than using just one Filter. I read somewhere on the lucene forums that the speed can be increased for Filters if we have a large bitset and then work on it. Is it possible and if yes, how? I would like to kno

Re: PrefixFilter Memory Consumption

2006-10-17 Thread Yonik Seeley
On 10/17/06, vasu shah <[EMAIL PROTECTED]> wrote: Can anyone please tell as to what is the difference between PrefixFilter and WildcardQuery as far as memory is concerned? I saw the code of PrefixFilter and it gets TermEnum for all the terms in the index. Won't this consume memory?? It t

PrefixFilter Memory Consumption

2006-10-17 Thread vasu shah
Hi, Can anyone please tell as to what is the difference between PrefixFilter and WildcardQuery as far as memory is concerned? I saw the code of PrefixFilter and it gets TermEnum for all the terms in the index. Won't this consume memory?? I started using PrefixFilter, ConstantSc

Re: index architectures

2006-10-17 Thread Erick Erickson
I've been curious for a while about this scheme, and I'm hoping you implement it and tell me if it works . In truth, my data is pretty static so I haven't had to worry about it much. That said... Would it do (and, perhaps, be less complex) to have a FSDirectory and a RAMDirectory that you search?

Re: index architectures

2006-10-17 Thread Paul Waite
Hi chaps, Just looking for some ideas/experience as to how to improve our current architecture. We have a single-index system containing approx. 2.5 million docs of about 1-3k each. The Lucene implementation is a daemon and it services requests on a port in multi-threaded manner, and it runs on

Re: Oracle Text 10g... or NOT

2006-10-17 Thread Chris Lu
Several additional reasons I can think of: 1) Being able to control the algorithsm, for example, 1.1) applying your own analyzer to a field. 1.2) control your own way of ranking 2) De-couple your data model from the searching Searching directly on your data model may not be ideal. You may wan

RE: BooleanQuery.TooManyClauses exception

2006-10-17 Thread Steven Parkes
All has to do with the total focus on strings in an inverted index, as opposed to the more general model in an RDBMS. Lucene doesn't need to track the max length. It sees each date as a string and understands all string intervals lexicographically. That means 20060401 is less than 20060401HHMMSS f

Re: BooleanQuery.TooManyClauses exception

2006-10-17 Thread Erick Erickson
Under the covers, as I understand it, a BooleanQuery is assembled for each unique term in the range. So, if you store your dates with milliseconds, there can be, what, 86,000,000+ unique terms per day. If you stored your times as strings to millisecond resolution, you can have a lot of clauses in

Re: BooleanQuery.TooManyClauses exception

2006-10-17 Thread Peter W .
Another solution is work with plain java dates and calendar objects, convert into Lucene strings using DateTools (resolution day) then query this field with two RangeFilters using ChainedFilter. You will never get the BooleanQuery error. Peter On Oct 17, 2006, at 10:57 AM, Bushey, John wrote

RE: BooleanQuery.TooManyClauses exception

2006-10-17 Thread Bushey, John
Thanks. That's the explanation that I was looking for. The WIKI does not cover this in much detail. The architectural reason for this sounds strange to me since my background is in relational databases where this is not an issue so I still have a question. How does reducing the precision really h

RE: BooleanQuery.TooManyClauses exception

2006-10-17 Thread Doron Cohen
See also relevant FAQ entry & Wiki page: http://wiki.apache.org/jakarta-lucene/LuceneFAQ#head-06fafb5d19e786a50fb3dfb8821a6af9f37aa831 http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing "Steven Parkes" <[EMAIL PROTECTED]> wrote on 17/10/2006 09:12:55: > Lucene takes your date

Re: near duplicates

2006-10-17 Thread Andrzej Bialecki
karl wettin wrote: 17 okt 2006 kl. 17.54 skrev Find Me: How to eliminate near duplicates from the index? I would probably try to measure the Ecludian distance between all documents, computed on terms and their positions. Or perhaps use standard deviation to find the distribution of terms i

Re: near duplicates

2006-10-17 Thread karl wettin
17 okt 2006 kl. 17.54 skrev Find Me: How to eliminate near duplicates from the index? Oh, one more thing. You should probably look at the norms in order to avoid comparing all documents to each other.

RE: Oracle Text 10g... or NOT

2006-10-17 Thread Bryzek.Michael
We used Oracle interMedia/Text for search within the RDMS beginning with oracle 8i through oracle 10g. Two primary reasons we switched to solr/lucene: * We saw random errors (< .1% of the time) when users ran full text search. We believe the source of this error occurred during index update as

Re: near duplicates

2006-10-17 Thread karl wettin
17 okt 2006 kl. 17.54 skrev Find Me: How to eliminate near duplicates from the index? I would probably try to measure the Ecludian distance between all documents, computed on terms and their positions. Or perhaps use standard deviation to find the distribution of terms in a document. On

Re: Big problem with big indexes

2006-10-17 Thread karl wettin
17 okt 2006 kl. 15.55 skrev Ariel Isaac Romero Cartaya: Here are pieces of my source code: public Hits search(String query) throws IOException { for (int i = 0; i < IndexCount; i++) { searchables[i] = new IndexSearcher (RAMIndexsManager.getInstance ().getDir

Re: Preventing merging by IndexWriter

2006-10-17 Thread Erick Erickson
Why go through all this effort when it's easy to make your own unique ID? Add a new field to each document "myuniqueid" and fill it in yourself. It'll never change then. The complex coordination way. To coordinate things, you could keep the last ID used (and maybe other information) in a unique

RE: BooleanQuery.TooManyClauses exception

2006-10-17 Thread Steven Parkes
Lucene takes your date range, enumerates all the unique date/time values in your corpus within that range, and then executes that query. So the number of terms in your query is going to be equal to the number of unique date/time values in the range. The most common way of handling this is to not i

Oracle Text 10g... or NOT

2006-10-17 Thread Rene Pineda
Hi - I'm currently looking into adding full text search capabilities to our site. While some threads in this list had the same basic question (RDBMS full-text versus lucene), their configurations and conderns were different. Here's my configuration * RDBMS is Enteprise Oracle 10g * RAC-enabled

near duplicates

2006-10-17 Thread Find Me
How to eliminate near duplicates from the index? Someone suggested that I could look at the TermVectors and do a comparision to remove the duplicates. One major problem with this is the structure of the document is no longer important. Are there any obvious pitfalls? For example: Document A being

RE: Lucene 2.0.1 release date

2006-10-17 Thread Steven Parkes
I think the idea is that 2.0.1 would be a patch-fix release from the branch created at 2.0 release. This release would incorporate only back-ported high-impact patches, where "high-impact" is defined by the community. Certainly security vulnerabilities would be included. As Otis said, to date, nobo

Re: Preventing merging by IndexWriter

2006-10-17 Thread Yonik Seeley
On 10/17/06, Johan Stuyts <[EMAIL PROTECTED]> wrote: So my questions are: is there a way to prevent the IndexWriter from merging, forcing it to create a new segment for each indexing batch? Already done in the Lucene trunk: http://issues.apache.org/jira/browse/LUCENE-672 Background: http://www

Help with design

2006-10-17 Thread Patrick Turcotte
Hi, I'm trying to come up with the best design for a problem. I want to search texts for expressions that shouldn't be found in them. My bad expressions list is quite stable. But the texts that I want to scan change often. Design I Index my texts, and then loop on my expressions list to see i

Re: Preventing merging by IndexWriter

2006-10-17 Thread Erick Erickson
Ignore the bit about keeping the mappings, it's too tricky unless really really necessary, since by virtue of updating the meta-data document, you'll delete a document, thus perhaps changing the Lucene IDs. I should proofread before hitting the "send" button ... Erick On 10/17/06, Erick Erickso

Preventing merging by IndexWriter

2006-10-17 Thread Johan Stuyts
Hi, (I am using Lucene 2.0.0) I have been looking at a way to use stable IDs with Lucene. The reason I want this is so I can efficiently store and retrieve information outside of Lucene for filtering search results. It looks like this is going to require most of Lucene to be rewritten, so I gave

Re: Big problem with big indexes

2006-10-17 Thread Ariel Isaac Romero Cartaya
Here are pieces of my source code: First of all, I search in all the indexes given a query String with a parallel searcher. As you can see I make a multi field query. Then you can see the index format I use, I store in the index all the fields. My index is optimized. public Hits search

Re: PrefixFilter and WildcardQuery

2006-10-17 Thread vasu shah
Thanks for all your help. I used PrefixFilter, ChainedFilter, CachingWrapperFilter, ConstantScoreQuery and the search speed has been dramatically improved. I am just doing wildcard search like abc*. It used to give me OOM problem with WildcardQuery. Will I get the same problem with

Re: Which field cause a hit in multifield query

2006-10-17 Thread Grant Ingersoll
Take a look at the explain functionality on the Searcher On Oct 17, 2006, at 5:43 AM, Mukesh Bhardwaj wrote: Hi, If I do a search such as "field1:jim OR field2:bob" is there any way to determine for each document that was a hit, which field caused the hit? Or rather, since they both migh

Which field cause a hit in multifield query

2006-10-17 Thread Mukesh Bhardwaj
Hi, If I do a search such as "field1:jim OR field2:bob" is there any way to determine for each document that was a hit, which field caused the hit? Or rather, since they both might, is there any easy way to find out which fields definitely cause a hit? Regards, --Mukesh