Any CommonGrams-inspired tricks to speed up other proximity query types?

2012-06-21 Thread Chris Harris
CommonGrams provides a neat trick for optimizing slow phrase queries that contain common words. (E.g. Hathi Trust has some datashowing how effective this can be.) Unfortunately, it does nothing for other positi

Re: Approches/semantics for arbitrarily combining boolean and proximity search operators?

2012-05-25 Thread Chris Harris
ortions of the query, and considered only binary versions of and/or.) a not/n b means "a, not within n words of b". I don't think it can be implemented directly using existing SpanQueries, but I think it's probably easy to extend SpanQuery to do the job. On Wed, May 16, 201

Re: Approches/semantics for arbitrarily combining boolean and proximity search operators?

2012-05-17 Thread Chris Harris
First impression is, that's a reasonably clever way to get the user intent basically right without having to add a new SpanQuery. Have you come up with any edge cases where it could do something unexpected? So far I've thought of one, though you could argue it has more to do with the "minimum/lazy

Approches/semantics for arbitrarily combining boolean and proximity search operators?

2012-05-16 Thread Chris Harris
I'm working on a product for librarians and similar people, who apparently expect to be able to combine classic boolean operators (i.e. AND, OR, NOT) with proximity operators (especially w/n and pre/n -- which basically map to unordered and ordered SpanQueries with slop n, respectively) in unrestri

Re: Will doc ids ever change if nothing is deleted?

2010-05-14 Thread Chris Harris
Could you address your needs by assigning each document a unique identifier (maybe you have a natural key, or maybe you could generate a new GUID or something for each doc), and using those identifiers, rather than internal Lucene docids, to track documents between the search stage and the loading

Can you use reduced sized test indexes to predict performance gains for a larger index?

2010-02-12 Thread Chris Harris
I'd like to try some experiments to see if I can improve search performance by changing analysis (e.g. adding/removing word bigrams or commongrams), or by changing how I map my source records into Lucene documents. The problem is that my index currently is about 1TB in size and takes about 2-3 week

Where to download Mark Miller's Qsol Parser?

2010-02-03 Thread Chris Harris
The QSol query parser (brief overview here: http://www.lucidimagination.com/blog/2009/02/22/exploring-query-parsers/) used to be available at http://myhardshadow.com/qsol.php (there was documentation as well as a link to a SVN server) but it looks like the myhardshadow.com has been relinquished t

Re: Tag Index patch (LUCENE-1292) status?

2010-01-21 Thread Chris Harris
on it? > > Jason > > On Tue, Jan 19, 2010 at 4:42 PM, Chris Harris wrote: >> I'm interested in the Tag Index patch (LUCENE-1292), in particular >> because of how it enables you to modify certain fields without >> reindexing a whole document. However, that issu

Re: Lucene as a primary datastore

2010-01-20 Thread Chris Harris
I don't do a lot of work with straight Lucene right now, but I do use Solr, and from time to time the Lucene index inside my master Solr server gets corrupted; in particular, some of the Lucene segment files that are still in use somehow get deleted, resulting in Lucene throwing FileNotFoundExcepti

Tag Index patch (LUCENE-1292) status?

2010-01-19 Thread Chris Harris
I'm interested in the Tag Index patch (LUCENE-1292), in particular because of how it enables you to modify certain fields without reindexing a whole document. However, that issue is marked Lucene 2.3.1 and hasn't been updated since July 2008. Can anyone provide any status updates on this patch? Que