It's LUCENE-1487.
Tim
On 12/10/08 1:13 PM, "Tim Sturge" <[EMAIL PROTECTED]> wrote:
> Yes (mostly). It turns those terms into an OpenBitSet on the term array.
> Then it does a fastGet() in the next() and skipTo() loops to see if the term
> for that document is
et this into Lucene.
>
> Does FieldCacheTermsFilter let you specify a set of arbitrary terms to
> filter for, like TermsFilter in contrib/queries? And it's space/time
> efficient once FieldCache is populated?
>
> Mike
>
> Tim Sturge wrote:
>
>> Mike,
Mike, Mike,
I have an implementation of FieldCacheTermsFilter (which uses field cache to
filter for a predefined set of terms) around if either of you are
interested. It is faster than materializing the filter roughly when the
filter matches more than 1% of the documents.
So it's not better for a
Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message
>> From: Tim Sturge <[EMAIL PROTECTED]>
>> To: "java-user@lucene.apache.org"
>> Sent: Thursday, December 4, 2008 3:27:30 PM
>> Subject: Slow queri
to sort
> those N matches, leaving out all the rest of the matches. Note
> this assumes that when you say "sorting" you mean sorting
> by something other than relevance.
>
> Hope this helps
> Erick
>
> On Thu, Dec 4, 2008 at 3:27 PM, Tim Sturge <[EMAIL PROTECTE
Hi all,
I have an interesting problem with my query traffic. Most of the queries run
in a fairly short amount of time (< 100ms) but a few take over 1000ms. These
queries are predominantly those with a huge number of hits (>1 million hits
in a >100 million document index). The time taken (as far as
> With "Allow Filter as clause to BooleanQuery":
> https://issues.apache.org/jira/browse/LUCENE-1345
> one could even skip the ConstantScoreQuery with this.
> Unfortunately 1345 is unfinished for now.
>
That would be interesting; I'd like to see how much performance improves.
>> startup: 2811
I've finished a query time implementation of a column stride filter, which
implements DocIdSetIterator. This just builds the filter at process start
and uses it for each subsequent query. The index itself is unchanged.
The results are very impressive. Here are the results on a 45M document
index:
ne
had tried it before.
Part of me assumes that someone must have done this already; so either
there's an implementation out there already or there's a good reason I don't
see that this is entirely impractical. So I'm interested to get feedback.
Tim
On 11/10/08 2:26 PM, &
with this issue. In particular date ranges
seem to be something that lots of people use but Lucene implements fairly
poorly.
Tim
On 11/10/08 1:58 PM, "Paul Elschot" <[EMAIL PROTECTED]> wrote:
> Op Monday 10 November 2008 22:21:20 schreef Tim Sturge:
>> Hmmm -- I ha
gt; Paul Elschot
>
>
> Op Monday 10 November 2008 19:18:38 schreef Tim Sturge:
>> Yes, that is a significant issue. What I'm coming to realize is that
>> either I will end up with something like
>>
>> class MultiFilter {
>>String field;
>>
ld
> MultiSegmentReader compute this for its terms? Or maybe you'd just do
> this per-segment?
>
> Mike
>
> Tim Sturge wrote:
>
>> Hi,
>>
>> I¹m wondering if there is any easy technique to number the terms in
>> an index
>> (By num
Hi,
I¹m wondering if there is any easy technique to number the terms in an index
(By number I mean map a sequence of terms to a contiguous range of integers
and map terms to these numbers efficiently)
Looking at the Term class and the .tis/.tii index format it appears that the
terms are stored in
Hi,
I have an index which contains two very distinct types of fields:
- Some fields are large (many term documents) and change fairly slowly.
- Some fields are small (mostly titles, names, anchor text) and change fairly
rapidly.
Right now I keep around the large fields in raw form and when the
the function that relates and . I feel like
there's a correct information-theorectical answer and I'd like to know
what it is.
Tim
Karl Wettin wrote:
29 aug 2007 kl. 19.13 skrev Tim Sturge:
I'm looking for a boost when the anchor text is more commonly
associated with
them both with "USA" once, they will rank equally. I
want the United States of America to rank higher.
Tim
Karl Wettin wrote:
28 aug 2007 kl. 21.41 skrev Tim Sturge:
Hi,
I have fields which have high multiplicity; for example I have a
topic with 1000 names, 500 of which are &
Hi,
I have fields which have high multiplicity; for example I have a topic
with 1000 names, 500 of which are "USA" and 200 are "United States of
America".
Previously I was indexing "USA USA .(500x).. USA United States of
America .(200x).. United States of America" as as single field. The
pr
Can anyone explain to me why commit() on IndexReader is a protected method?
I want to do periodic deletes from my main index. I don't want to reopen
the index (all that is changing are things are being deleted), so I
don't want to call close(), but I can't call commit() from outside the
class
a world of slow.
On 7/30/07, Mark Miller <[EMAIL PROTECTED]> wrote:
I believe there is an issue in JIRA that handles reopening an IndexReader
without reopening segments that have not changed.
On 7/30/07, Tim Sturge < [EMAIL PROTECTED]> wrote:
Thanks for the reply Erick,
I be
have to cope with data
integrity if your process barfs before you've closed your FSDir
Or, you could ask whether 5 seconds is really necessary.I've seen a lot
of times when "real time" could be 5 minutes and nobody would really
complain, and other times when it really
Hi,
I am indexing a set of constantly changing documents. The change rate is
moderate (about 10 docs/sec over a 10M document collection with a 6G
total size) but I want to be right up to date (ideally within a second
but within 5 seconds is acceptable) with the index.
Right now I have code
:-) The use of wikipedia data here is no secret; it's all over
www.freebase.com. I just hoped to avoid being sucked into a "what is the best
way to index wikipedia with Lucene?" discussion, which I believe several other
groups are already tackling.
At index time, I used a per document boost (o
mplementation, to give a bigger boost to documents that have more
matching terms? The point of coord is to give a little bump to those
docs that have more terms from the query in a given document. Sounds
like you want a bigger bump once you have multiple query terms in a
document. Would this w
(
(title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush) )
or maybe
( (title:"John Bush"~4^4.0) OR (body:"John Bush"~4) ) AND (
(title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush) )
Tim Sturge wrote:
I'm following myself up here to ask if anyone has experie
at I will discover once I implement it and look at the
results?) or does it make searching a lot slower?
Thanks,
Tim
Tim Sturge wrote:
I have an index with two different sources of information, one small
but of high quality (call it "title"), and one large, but of lower
quality (call it
I have an index with two different sources of information, one small but
of high quality (call it "title"), and one large, but of lower quality
(call it "body"). I give boosts to certain documents related to their
popularity (this is very similar to what one would do indexing the web).
The pr
I'm asking for a use-case scenario here. Something like
"I want the docs to score equally no matter how many
links with 'United States' exist in them". Or
"A document with 100 links mentioning 'United States' should
score way higher than a document with only o
Hi,
I'm trying to index some fairly standard html documents. For each of the
documents, there is a unique (which I believe is generally of
high quality), some content, and some anchor text from the
linking documents (which is of good but more variable quality).
I'm indexing them in "title"
28 matches
Mail list logo