Can I use Lucene to solve this problem?

2011-08-16 Thread Josh Rehman
My organization is looking to solve a difficult problem, and I believe that Lucene is a close fit (although perhaps it is not). However I'm not sure exactly how to approach this problem. The problem is this: given a small set of fixed noun phrases and a much larger set of human generated short sen

Overriding default handling of '/' and '-'

2011-08-16 Thread SBS
Our document base includes terms which are in fact codes that may contain dashes and slashes such as "M1234/5" and "12345-00". Presently Lucene appears to breaking up these codes according to the slashes and dashes and searches are therefore not working properly. Instead of matching an exact code

Re: Searching for words containing accents or umlauts?

2011-08-16 Thread SBS
Thanks, ASCIIFoldingFilter works well. -- View this message in context: http://lucene.472066.n3.nabble.com/Searching-for-words-containing-accents-or-umlauts-tp3244774p3259979.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --

RE: What kind of System Resources are required to index 625 million row table...???

2011-08-16 Thread Uwe Schindler
Hi Erick, This is only true, if you have string fields. Once you have the long values in FieldCache they will always use exactly the same space. Having more fields will in contrast blow up your IndexReader, as it needs much more RAM to hold an even larger term index (because you have an even large

Re: What kind of System Resources are required to index 625 million row table...???

2011-08-16 Thread Erick Erickson
Using a new field with coarser granularity will work fine, this a common thing to do for this kind of issue. Lucene is trying to load 625M longs into memory, in addition to any other stuff. Ouch! If you want to get really clever, you can index several fields, say year, month, and day for each dat

RE: What kind of System Resources are required to index 625 million row table...???

2011-08-16 Thread Bennett, Tony
Thanks for the suggestion. Yes, we are using "no_norms". -Original Message- From: Mark Harwood [mailto:markharw...@yahoo.co.uk] Sent: Tuesday, August 16, 2011 10:12 AM To: java-user@lucene.apache.org Subject: Re: What kind of System Resources are required to index 625 million row table.

Re: What kind of System Resources are required to index 625 million row table...???

2011-08-16 Thread Mark Harwood
Check "norms" are disabled on your fields because they'll cost you1byte x NumberOfDocs x numberOfFieldsWithNormsEnabled. On 16 Aug 2011, at 15:11, Bennett, Tony wrote: > Thank you for your response. > > You are correct, we are sorting on timestamp. > Timestamp has microsecond granualarity, a

RE: What kind of System Resources are required to index 625 million row table...???

2011-08-16 Thread Bennett, Tony
Thank you for your response. You are correct, we are sorting on timestamp. Timestamp has microsecond granualarity, and we are storing it as "NumericField". We are sorting on timestamp, so that we can give our users the most "current" matches, since we are limiting the number of responses to about

Re: What kind of System Resources are required to index 625 million row table...???

2011-08-16 Thread Erick Erickson
About your OOM. Grant asked a question that's pretty important, how many unique terms in the field(s) you sorted on? At a guess, you tried sorting on your timestamp and your timestamp has millisecond or less granularity, so there are 625M of them. Memory requirements for sorting grow as the number

[SPATIAL] Spatial search runs forever

2011-08-16 Thread drazen.nis
Hello, Recently we have introduced distance searching/sorting into the existing Lucene index, using the Spatial contrib for Lucene 2.9.4. There are 100K+ documents into the index where only 20K docs had latitude/longitude and _tier_* fields. Spatial queries ran quite OK. After enriching the inde