Re: Issue upgrading from lucene 2.3.2 to 2.4 (moving from bitset to docidset)

2008-12-10 Thread Tim Sturge
It's LUCENE-1487. Tim On 12/10/08 1:13 PM, "Tim Sturge" <[EMAIL PROTECTED]> wrote: > Yes (mostly). It turns those terms into an OpenBitSet on the term array. > Then it does a fastGet() in the next() and skipTo() loops to see if the term > for that document is

Re: Issue upgrading from lucene 2.3.2 to 2.4 (moving from bitset to docidset)

2008-12-10 Thread Tim Sturge
et this into Lucene. > > Does FieldCacheTermsFilter let you specify a set of arbitrary terms to > filter for, like TermsFilter in contrib/queries? And it's space/time > efficient once FieldCache is populated? > > Mike > > Tim Sturge wrote: > >> Mike,

Re: Issue upgrading from lucene 2.3.2 to 2.4 (moving from bitset to docidset)

2008-12-10 Thread Tim Sturge
Mike, Mike, I have an implementation of FieldCacheTermsFilter (which uses field cache to filter for a predefined set of terms) around if either of you are interested. It is faster than materializing the filter roughly when the filter matches more than 1% of the documents. So it's not better for a

Re: Slow queries with lots of hits

2008-12-05 Thread Tim Sturge
Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > - Original Message >> From: Tim Sturge <[EMAIL PROTECTED]> >> To: "java-user@lucene.apache.org" >> Sent: Thursday, December 4, 2008 3:27:30 PM >> Subject: Slow queri

Re: Slow queries with lots of hits

2008-12-04 Thread Tim Sturge
to sort > those N matches, leaving out all the rest of the matches. Note > this assumes that when you say "sorting" you mean sorting > by something other than relevance. > > Hope this helps > Erick > > On Thu, Dec 4, 2008 at 3:27 PM, Tim Sturge <[EMAIL PROTECTE

Slow queries with lots of hits

2008-12-04 Thread Tim Sturge
Hi all, I have an interesting problem with my query traffic. Most of the queries run in a fairly short amount of time (< 100ms) but a few take over 1000ms. These queries are predominantly those with a huge number of hits (>1 million hits in a >100 million document index). The time taken (as far as

Re: Term numbering and range filtering

2008-11-18 Thread Tim Sturge
> With "Allow Filter as clause to BooleanQuery": > https://issues.apache.org/jira/browse/LUCENE-1345 > one could even skip the ConstantScoreQuery with this. > Unfortunately 1345 is unfinished for now. > That would be interesting; I'd like to see how much performance improves. >> startup: 2811

Re: Term numbering and range filtering

2008-11-18 Thread Tim Sturge
I've finished a query time implementation of a column stride filter, which implements DocIdSetIterator. This just builds the filter at process start and uses it for each subsequent query. The index itself is unchanged. The results are very impressive. Here are the results on a 45M document index:

Re: Term numbering and range filtering

2008-11-10 Thread Tim Sturge
ne had tried it before. Part of me assumes that someone must have done this already; so either there's an implementation out there already or there's a good reason I don't see that this is entirely impractical. So I'm interested to get feedback. Tim On 11/10/08 2:26 PM, &

Re: Term numbering and range filtering

2008-11-10 Thread Tim Sturge
with this issue. In particular date ranges seem to be something that lots of people use but Lucene implements fairly poorly. Tim On 11/10/08 1:58 PM, "Paul Elschot" <[EMAIL PROTECTED]> wrote: > Op Monday 10 November 2008 22:21:20 schreef Tim Sturge: >> Hmmm -- I ha

Re: Term numbering and range filtering

2008-11-10 Thread Tim Sturge
gt; Paul Elschot > > > Op Monday 10 November 2008 19:18:38 schreef Tim Sturge: >> Yes, that is a significant issue. What I'm coming to realize is that >> either I will end up with something like >> >> class MultiFilter { >>String field; >>

Re: Term numbering and range filtering

2008-11-10 Thread Tim Sturge
ld > MultiSegmentReader compute this for its terms? Or maybe you'd just do > this per-segment? > > Mike > > Tim Sturge wrote: > >> Hi, >> >> I¹m wondering if there is any easy technique to number the terms in >> an index >> (By num

Term numbering and range filtering

2008-11-07 Thread Tim Sturge
Hi, I¹m wondering if there is any easy technique to number the terms in an index (By number I mean map a sequence of terms to a contiguous range of integers and map terms to these numbers efficiently) Looking at the Term class and the .tis/.tii index format it appears that the terms are stored in

Almost parallel indexes

2007-09-27 Thread Tim Sturge
Hi, I have an index which contains two very distinct types of fields: - Some fields are large (many term documents) and change fairly slowly. - Some fields are small (mostly titles, names, anchor text) and change fairly rapidly. Right now I keep around the large fields in raw form and when the

Re: indexing fields with multiplicity

2007-08-29 Thread Tim Sturge
the function that relates and . I feel like there's a correct information-theorectical answer and I'd like to know what it is. Tim Karl Wettin wrote: 29 aug 2007 kl. 19.13 skrev Tim Sturge: I'm looking for a boost when the anchor text is more commonly associated with

Re: indexing fields with multiplicity

2007-08-29 Thread Tim Sturge
them both with "USA" once, they will rank equally. I want the United States of America to rank higher. Tim Karl Wettin wrote: 28 aug 2007 kl. 21.41 skrev Tim Sturge: Hi, I have fields which have high multiplicity; for example I have a topic with 1000 names, 500 of which are &

indexing fields with multiplicity

2007-08-28 Thread Tim Sturge
Hi, I have fields which have high multiplicity; for example I have a topic with 1000 names, 500 of which are "USA" and 200 are "United States of America". Previously I was indexing "USA USA .(500x).. USA United States of America .(200x).. United States of America" as as single field. The pr

calling commit() on IndexReader

2007-07-31 Thread Tim Sturge
Can anyone explain to me why commit() on IndexReader is a protected method? I want to do periodic deletes from my main index. I don't want to reopen the index (all that is changing are things are being deleted), so I don't want to call close(), but I can't call commit() from outside the class

Re: java gc with a frequently changing index?

2007-07-30 Thread Tim Sturge
a world of slow. On 7/30/07, Mark Miller <[EMAIL PROTECTED]> wrote: I believe there is an issue in JIRA that handles reopening an IndexReader without reopening segments that have not changed. On 7/30/07, Tim Sturge < [EMAIL PROTECTED]> wrote: Thanks for the reply Erick, I be

Re: java gc with a frequently changing index?

2007-07-30 Thread Tim Sturge
have to cope with data integrity if your process barfs before you've closed your FSDir Or, you could ask whether 5 seconds is really necessary.I've seen a lot of times when "real time" could be 5 minutes and nobody would really complain, and other times when it really

java gc with a frequently changing index?

2007-07-25 Thread Tim Sturge
Hi, I am indexing a set of constantly changing documents. The change rate is moderate (about 10 docs/sec over a 10M document collection with a 6G total size) but I want to be right up to date (ideally within a second but within 5 seconds is acceptable) with the index. Right now I have code

Re: product based term combination for BooleanQuery?

2007-07-04 Thread Tim Sturge
:-) The use of wikipedia data here is no secret; it's all over www.freebase.com. I just hoped to avoid being sucked into a "what is the best way to index wikipedia with Lucene?" discussion, which I believe several other groups are already tackling. At index time, I used a per document boost (o

Re: product based term combination for BooleanQuery?

2007-07-03 Thread Tim Sturge
mplementation, to give a bigger boost to documents that have more matching terms? The point of coord is to give a little bump to those docs that have more terms from the query in a given document. Sounds like you want a bigger bump once you have multiple query terms in a document. Would this w

Re: product based term combination for BooleanQuery?

2007-07-03 Thread Tim Sturge
( (title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush) ) or maybe ( (title:"John Bush"~4^4.0) OR (body:"John Bush"~4) ) AND ( (title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush) ) Tim Sturge wrote: I'm following myself up here to ask if anyone has experie

product based term combination for BooleanQuery?

2007-07-03 Thread Tim Sturge
at I will discover once I implement it and look at the results?) or does it make searching a lot slower? Thanks, Tim Tim Sturge wrote: I have an index with two different sources of information, one small but of high quality (call it "title"), and one large, but of lower quality (call it

multi-term query weighting

2007-07-02 Thread Tim Sturge
I have an index with two different sources of information, one small but of high quality (call it "title"), and one large, but of lower quality (call it "body"). I give boosts to certain documents related to their popularity (this is very similar to what one would do indexing the web). The pr

Re: indexing anchor text

2007-06-27 Thread Tim Sturge
I'm asking for a use-case scenario here. Something like "I want the docs to score equally no matter how many links with 'United States' exist in them". Or "A document with 100 links mentioning 'United States' should score way higher than a document with only o

indexing anchor text

2007-06-27 Thread Tim Sturge
Hi, I'm trying to index some fairly standard html documents. For each of the documents, there is a unique (which I believe is generally of high quality), some content, and some anchor text from the linking documents (which is of good but more variable quality). I'm indexing them in "title"