Hi -
I've used Lucene on a previous project, so I am somewhat familiar with the API.
However, I've never had to do anything "fancy" (where "fancy" means things like
using filters, different analyzers, boosting, payloads, etc).
I'm about to embark on implementing the full-text search feature of
Hi AlexElba,
Did you completely re-index?
If you did, then there is some other problem - can you share (more of) your
code?
Do you know about Luke? It's an essential tool for Lucene index debugging:
http://www.getopt.org/luke/
Steve
On 01/13/2010 at 8:34 PM, AlexElba wrote:
>
> Hello,
>
Hello,
I change filter to follow
RangeFilter rangeFilter = new RangeFilter(
"rank", NumberTools
.longToString(rating), NumberTools
.longToString(10), true, true);
and change index to store rank the same way.
Actually I meant to say indexes... However when optimize(numsegments)
is used they're segments...
On Wed, Jan 13, 2010 at 3:05 PM, Otis Gospodnetic
wrote:
> I think Jason meant "15-20GB segments"?
> Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
>
> _
Right... It all blends together, I need an NLP analyzer for my emails
On Wed, Jan 13, 2010 at 3:05 PM, Otis Gospodnetic
wrote:
> I think Jason meant "15-20GB segments"?
> Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
>
>
> From: Jaso
I think Jason meant "15-20GB segments"?
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
From: Jason Rutherglen
To: java-user@lucene.apache.org
Sent: Wed, January 13, 2010 5:54:38 PM
Subject: Re: Max Segmentation Size when Optimizing Index
Ye
Yes... You could hack LogMergePolicy to do something else.
I use optimise(numsegments:5) regularly on 80GB indexes, that if
optimized to 1 segment, would thrash the IO excessively. This works
fine because 15-20GB indexes are plenty large and fast.
On Wed, Jan 13, 2010 at 2:44 PM, Trin Chavalittu
Seems like optimize() only cares about final number of segments rather than
the size of the segment. Is it so?
On Wed, Jan 13, 2010 at 2:35 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:
> There's a different method in LogMergePolicy that performs the
> optimize... Right, so normal mer
There's a different method in LogMergePolicy that performs the
optimize... Right, so normal merging uses the findMerges method, then
there's a findMergeOptimize (method names could be inaccurate).
On Wed, Jan 13, 2010 at 2:29 PM, Trin Chavalittumrong wrote:
> Do you mean MergePolicy is only used
Do you mean MergePolicy is only used during index time and will be ignored
by by the Optimize() process?
On Wed, Jan 13, 2010 at 1:57 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:
> Oh ok, you're asking about optimizing... I think that's a different
> algorithm inside LogMergePolicy.
Oh ok, you're asking about optimizing... I think that's a different
algorithm inside LogMergePolicy. I think it ignores the maxMergeMB
param.
On Wed, Jan 13, 2010 at 1:49 PM, Trin Chavalittumrong wrote:
> Thanks, Jason.
>
> Is my understanding correct that LogByteSizeMergePolicy.setMaxMergeMB(10
Ooooh, isn't that easier. You just prompted me to think
that you don't even have to do that, just index the pairs as single
tokens (KeywordAnalyzer? but watch out for no case folding)...
On Wed, Jan 13, 2010 at 4:30 PM, Digy wrote:
> How about using languages as fieldnames?
> Doc1(Ra):
>
Thanks, Jason.
Is my understanding correct that LogByteSizeMergePolicy.setMaxMergeMB(100)
will prevent
merging of two segments that is larger than 100 Mb each at the optimizing
time?
If so, why do think would I still see segment that is larger than 200 MB?
On Wed, Jan 13, 2010 at 1:43 PM, Jaso
Hi Trin,
There was recently a discussion about this, the max size is
for the before merge segments, rather than the resultant merged
segment (if that makes sense). It'd be great if we had a merge
policy that limited the resultant merged segment, though that'd
by a rough approximation at best.
Jas
Hi,
I am trying to optimize the index which would merge different segment
together. Let say the index folder is 1Gb in total, I need each segmentation
to be no larger than 200Mb. I tried to use *LogByteSizeMergePolicy *and
setMaxMergeMB(100) to ensure no segment after merging would be 200Mb.
How
How about using languages as fieldnames?
Doc1(Ra):
Java:5
C:2
PHP:3
Doc2(Rb)
Java:2
C:5
VB:1
Query:Java:5 AND C:2
DIGY
-Original Message-
From: TJ Kolev [mailto:tjko...@gmail.com]
Sent: Wednesday, January 13, 2010 11:00 PM
To: jav
One approach would be to do this with multi-valued fields. The
idea here is to index all your E fields in the *same* Lucene
field with an increment gap (see getPositionIncrementGap) > 1.
For this example, assume getPositionIncrementGap returns 100.
Then, for your documents you have something like
Greetings,
Let's assume I have to index and search "resume" documents. Two fields are
defined: Language and Years. The fields are associated together in a group
called Experience. A resume document may have 0 or more Experience groups:
Ra{ E1(Java,5); E2(C,2); E3(PHP,3);}
Rb{ E1(Java,2); E2(C,5);
Thanks Steve.
Mike for now I can not upgrade...
--
View this message in context:
http://old.nabble.com/RangeFilter-tp27148785p27151315.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
-
To unsubscri
Grant Ingersoll wrote:
On Jan 5, 2010, at 7:44 AM, Paul Taylor wrote:
So currently in my index I index and store a number of small fields, I need
both so I can search on the fields, then I use the stored versions to generate
the output document (which is either an XML or JSON representatio
Thanks for the answer Mike
indeed it is possible, but practically...
I start the loop immediately after searcher.search(), and with my index size
of 3 MB, the whole operation takes max 100 ms. Given the rate of like 50
updates - addDocument()/expungeDeletes()/IR.reopen() per day, the
probability
Actually, as of Lucene 2.9 (if you can upgrade), you should use
NumericField to index numerics and NumericRangeQuery to do range
search/filter -- it all just works -- no more padding.
Mike
On Wed, Jan 13, 2010 at 1:17 PM, Steven A Rowe wrote:
> Hi AlexElba,
>
> The problem is that Lucene only kn
Hi AlexElba,
The problem is that Lucene only knows how to handle character strings, not
numbers. Lexicographically, "3" > "10", so you get the expected results
(nothing).
The standard thing to do is transform your numbers into strings that sort as
you want them to. E.g., you can left-pad the
Hello,
I am currently using lucene 2.4 and have document with 3 fields
id
name
rank
and have query and filter when I am trying to use rang filter on rank I am
not getting any result back
RangeFilter rangeFilter = new RangeFilter("rank", "3", "10", true, true);
I have documents which are in
Before answering, how to you measure "proximity"? You can make
Lucene work with locations (there's an example in Lucene In Action)
readily enough though
HTH
Erick
On Wed, Jan 13, 2010 at 11:39 AM, Ortelli, Gian Luca <
gianluca.orte...@truvo.com> wrote:
> Hi community,
>
>
>
> I have a genera
Lucene will probably only be helpful if you know what you are looking
for, e.g. that you search for a given person, a given street and given
time intervals.
Is this what you want to do?
If you instead are looking for a way to really extract any person,
street and time interval that a docum
Is it possible you are closing the searcher before / while running
that for loop?
Mike
On Wed, Jan 13, 2010 at 9:26 AM, Konstantyn Smirnov wrote:
>
> Hi all
>
> Consider the following piece of code:
>
> Searcher s = this.getSearcher()
> def hits = s.search( query, filter, params.offset + params.
Hi community,
I have a general understanding of Lucene concepts, and I'm wondering if
it's the right tool for my job:
- I need to extract data like e.g. time intervals ("8am - 12pm"), street
addresses from a set of files. The common issue with this data unit is
that they contain spaces and
On 2010-01-13 15:29, Benjamin Heilbrunn wrote:
Thanks!
Didn't know that it's so easy ;)
2010/1/13 Uwe Schindler:
Why not simply add the field twice, one time with TokenStream, one time stored
only? Internally stored/indexed fields are handled like that.
Actually, you can implement your own F
Thanks!
Didn't know that it's so easy ;)
2010/1/13 Uwe Schindler :
> Why not simply add the field twice, one time with TokenStream, one time
> stored only? Internally stored/indexed fields are handled like that.
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaph
Hi all
Consider the following piece of code:
Searcher s = this.getSearcher()
def hits = s.search( query, filter, params.offset + params.max, sort )
for( hit in hits.scoreDocs[ lower..http://www.poiradar.ru www.poiradar.ru
http://www.poiradar.com.ua www.poiradar.com.ua
http://www.poiradar.com
Why not simply add the field twice, one time with TokenStream, one time stored
only? Internally stored/indexed fields are handled like that.
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
> From: Benjamin Heilb
Sorry for pushing this thing.
Would it be possible to add the demanded constructor or would it break
anything of lucenes logic?
2010/1/11 Benjamin Heilbrunn :
> Hey out there,
>
> in lucene it's not possible to create a Field based on a TokenStream
> AND supply a stored value.
>
> Is there a rea
So not much help here, (I wonder if its because I posted 3 questions in
one day) but Ive made some progress in my understaning.
I understand there is only one norm per field and I think Lucene does no
differentiating between adding the same field a number of times and
adding mutiple text to th
On Sun, Jan 10, 2010 at 7:33 AM, Dvora wrote:
>
> I'm storing and reading the documents using Compass, not Lucene directly. I
> didn't touch those parameters, so I guess the default values are being used
> (I do see cfs files in the index).
OK. If your index directory has *.cfs files, then you a
We could also fix WhitespaceAnalyzer to filter that character out?
(Or you could make your own analyzer to do so...).
You could also try asking on the tika-user list whether Tika has a
solution for mapping "extended" whitespace characters...
Mike
On Mon, Jan 11, 2010 at 3:04 PM, maxSchlein wrot
Indeed, getReader is an expensive way to get the segment count (it
flushes the current RAM buffer to disk as a new segment).
Since SegmentInfos is now public, you could use SegmentInfos.read to
read the current segments_N file, and then call its .size() method?
But, this will only count as of the
If you follow the rules Otis listed, you should never hit index
corruption, unless something is wrong with your hardware.
Or, if you hit an as-yet-undiscovered bug in Lucene ;)
Mike
On Wed, Jan 13, 2010 at 1:11 AM, zhang99 wrote:
>
> what is the longest time you ever keep index file without req
38 matches
Mail list logo