My organization is looking to solve a difficult problem, and I believe that
Lucene is a close fit (although perhaps it is not). However I'm not sure
exactly how to approach this problem.
The problem is this: given a small set of fixed noun phrases and a much
larger set of human generated short sen
Our document base includes terms which are in fact codes that may contain
dashes and slashes such as "M1234/5" and "12345-00". Presently Lucene
appears to breaking up these codes according to the slashes and dashes and
searches are therefore not working properly. Instead of matching an exact
code
Thanks, ASCIIFoldingFilter works well.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Searching-for-words-containing-accents-or-umlauts-tp3244774p3259979.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
--
Hi Erick,
This is only true, if you have string fields. Once you have the long values
in FieldCache they will always use exactly the same space. Having more
fields will in contrast blow up your IndexReader, as it needs much more RAM
to hold an even larger term index (because you have an even large
Using a new field with coarser granularity will work fine,
this a common thing to do for this kind of issue.
Lucene is trying to load 625M longs into memory, in
addition to any other stuff. Ouch!
If you want to get really clever, you can index several
fields, say year, month, and day for each dat
Thanks for the suggestion.
Yes, we are using "no_norms".
-Original Message-
From: Mark Harwood [mailto:markharw...@yahoo.co.uk]
Sent: Tuesday, August 16, 2011 10:12 AM
To: java-user@lucene.apache.org
Subject: Re: What kind of System Resources are required to index 625 million
row table.
Check "norms" are disabled on your fields because they'll cost you1byte x
NumberOfDocs x numberOfFieldsWithNormsEnabled.
On 16 Aug 2011, at 15:11, Bennett, Tony wrote:
> Thank you for your response.
>
> You are correct, we are sorting on timestamp.
> Timestamp has microsecond granualarity, a
Thank you for your response.
You are correct, we are sorting on timestamp.
Timestamp has microsecond granualarity, and we are
storing it as "NumericField".
We are sorting on timestamp, so that we can give our
users the most "current" matches, since we are limiting
the number of responses to about
About your OOM. Grant asked a question that's pretty important,
how many unique terms in the field(s) you sorted on? At a guess,
you tried sorting on your timestamp and your timestamp has
millisecond or less granularity, so there are 625M of them.
Memory requirements for sorting grow as the number
Hello,
Recently we have introduced distance searching/sorting into the existing
Lucene index, using the Spatial contrib for Lucene 2.9.4. There are 100K+
documents into the index where only 20K docs had latitude/longitude and
_tier_* fields. Spatial queries ran quite OK.
After enriching the inde
10 matches
Mail list logo