RE: Wanting batch update to avoid high disk usage

2010-08-24 Thread Beard, Brian
tire segment files are rewritten every time. So it looks like our only option is to bail out when there's not enough space to duplicate the existing index. - Original Message ---- From: "Beard, Brian" To: java-user@lucene.apache.org Sent: Tue, August 24, 2010 8:19:52 AM Sub

RE: Wanting batch update to avoid high disk usage

2010-08-24 Thread Beard, Brian
We had a situation where our index size was inflated to roughly double. It took about a couple of months, but the size eventually dropped back down, so it does seem to eventually get rid of the deleted documents. With that said, in the future expungeDeletes will get called once a day to better man

RE: Tokenization / Analyzer question

2010-08-20 Thread Beard, Brian
e this metaData information while inside the TokenFilter. I guess this would be similar to adding column stride fields, but have multiple ones at different positions in the document. -Original Message----- From: Beard, Brian [mailto:brian.be...@mybir.com] Sent: Thursday, August 19, 2010 2:02 P

Tokenization / Analyzer question

2010-08-19 Thread Beard, Brian
I'm using lucene 2.9.1. I'm indexing documents which correspond to an ID. Each field in the ID document is made up of data from all subId's. (It's a requirement that searches must work across all subId's within an ID). They will be indexed and stored in some format similar to: subId0Value0 subId0

FieldValueHitQueue question - migration to 3.0

2010-01-21 Thread Beard, Brian
Since FieldSortedHitQueue was deprecated in 3.0, I'm converting to the new FieldValueHitQueue. The trouble I'm having is coming up with a way to use FieldValueHitQueue in a Collector so it is decoupled from a TopDocsCollector. What I'd like to do is have a custom Collector that can add objects ex

termDocs / termEnums performance increase for 2.4.0

2009-02-05 Thread Beard, Brian
Thought I would report a performance increase noticed in migrating from 2.3.2 to 2.4.0. Performing an iterated loop using termDocs & termEnums like below is about 30% faster. The example test set I'm running has about 70K documents to go through and process (on a dual processor windows machine) w

RE: Poor QPS with highlighting

2009-02-05 Thread Beard, Brian
A while ago someone posted a link to a project called XTF which does this: http://xtf.wiki.sourceforge.net/ The one problem with this approach still lurking for me (or maybe I don't understand how to get around) is how to handle multiple terms which "must" appear in the query, but are in non-overl

highlighter / fragmenter performance for large fields

2008-10-13 Thread Beard, Brian
We index some documents which have an "all" field containing all of the data which can be searched on. One of the problems we're having is when this field is say 10Mbytes the highlighter takes about a second to calculate the best fragments. The search only takes 30 milliseconds. I've accomodated t

RE: Memory eaten up by String, Term and TermInfo?

2008-10-06 Thread Beard, Brian
I played around with GC quite a bit in our app and found the following java settings to help a lot (Used with jboss, but should be good for any jvm). set JAVA_OPTS=%JAVA_OPTS% -XX:MaxPermSize=512M -XX:+UseConcMarkSweepGC -XX:+CMSPermGenSweepingEnabled -XX:+CMSClassUnloadingEnabled While these set

RE: performance feedback

2008-07-10 Thread Beard, Brian
35 AM, Beard, Brian <[EMAIL PROTECTED]> wrote: > I will try tweaking RAM, and check about autoCommit=false. It's on the > future agenda to multi-thread through the index writer. The indexing > time I quoted includes the document creation time which would definitely > improve w

RE: performance feedback

2008-07-09 Thread Beard, Brian
performance feedback This is great to hear! If you tweak things a bit (increase RAM buffer size, use autoCommit=false, use threads, etc) you should be able to eke out some more gains... Are you storing fields & using term vectors on any of your fields? Mike Beard, Brian wrote: > > I

performance feedback

2008-07-09 Thread Beard, Brian
I just did an update from lucene 2.2.0 to 2.3.2 and thought I'd give some kudos for the indexing performance enhancements. The lucene indexing portion is about 6-8 times faster. Previously we were doing ~60-120 documents per second, now we're between 400-1000, depending on the type of document, s

performance feedback

2008-07-09 Thread Beard, Brian
I just did an update from lucene 2.2.0 to 2.3.2 and thought I'd give some kudos for the indexing performance enhancements. The lucene indexing portion is about 6-8 times faster. Previously we were doing ~60-120 documents per second, now we're between 400-1000, depending on the type of document, si

search performance & caching

2008-04-28 Thread Beard, Brian
I'm using lucene 2.2.0 & have two questions: 1) Should search times be linear wrt number of queries hitting a single searcher? I've run multiple search threads against a single searcher, and the search times are very linear - 10x slower for 10 threads vs 1 thread, etc. I'm using a paralle multi-

RE: WildCardQuery and TooManyClauses

2008-04-14 Thread Beard, Brian
You can use your approach w/ or w/o the filter. >td = indexSearcher.search(query, filter, maxnumhits); You need to use a filter for the wildcards which is built in to the query. 1) Extend QueryParser to override the getWildcardQuery method. (Or even if you don't use QueryParser, j

RE: Boolean Query search performance

2008-03-10 Thread Beard, Brian
AHA! That is consistent with what is happening now, and explains the discrepancy. The original post of parens around each term was because I was adding them as separate boolean queries, but now with using just the clause the parens is around the entire clause with the boost. -Original Message

RE: Boolean Query search performance

2008-03-06 Thread Beard, Brian
Thanks for all replies. Today when I printed out the query that's generated it does not have the extra paren's. And query.rewrite(reader).toString() now gives the same result as query.toString(). All I can figure is I must have changed something between starting the email and sending it out. The o

Boolean Query search performance

2008-03-05 Thread Beard, Brian
I'm using lucene 2.2.0. I'm in the process of re-writing some queries to build BooleanQueries instead of using query parser. Bypassing query parser provides almost an order of magnitude improvement for very large queries, but then the search performance takes 20-30% longer. I'm adding boost valu

RE: how do I get my own TopDocHitCollector?

2008-01-11 Thread Beard, Brian
- From: Antony Bowesman [mailto:[EMAIL PROTECTED] Sent: Thursday, January 10, 2008 3:19 PM To: java-user@lucene.apache.org Subject: Re: how do I get my own TopDocHitCollector? Beard, Brian wrote: > Ok, I've been thinking about this some more. Is the cache mechanism > pulling from the cac

RE: how do I get my own TopDocHitCollector?

2008-01-10 Thread Beard, Brian
----- From: Beard, Brian [mailto:[EMAIL PROTECTED] Sent: Thursday, January 10, 2008 10:08 AM To: java-user@lucene.apache.org Subject: RE: how do I get my own TopDocHitCollector? Thanks for the post. So you're using the doc id as the key into the cache to retrieve the external id. Then what mechan

RE: how do I get my own TopDocHitCollector?

2008-01-10 Thread Beard, Brian
Wednesday, January 09, 2008 7:19 PM To: java-user@lucene.apache.org Subject: Re: how do I get my own TopDocHitCollector? Beard, Brian wrote: > Question: > > The documents that I index have two id's - a unique document id and a > record_id that can link multiple documents together that

how do I get my own TopDocHitCollector?

2008-01-09 Thread Beard, Brian
Question: The documents that I index have two id's - a unique document id and a record_id that can link multiple documents together that belong to a common record. I'd like to use something like TopDocs to return the first 1024 results that have unique record_id's, but I will want to skip some o

RE: Post processing to get around TooManyClauses?

2007-12-11 Thread Beard, Brian
I had a similar problem (I think). Look at using a WildcardFilter (below), possibly wrapped in a CachingWrapperFilter, depending if you want to re-use it. I over-rode the method QueryParser.getWildcardQuery to customize it. In your case you would probably have to specifically detect for the presenc

RE: Wildcard & filters

2007-10-12 Thread Beard, Brian
ew BitSet(); with = new BitSet(reader.maxDocs()); Beard, Brian wrote: > Mark, > > Thanks so much. > > -Original Message- > From: Mark Miller [mailto:[EMAIL PROTECTED] > Sent: Friday, October 12, 2007 1:54 PM > To: java-user@lucene.apache.org > Subject: Re: Wildcard

RE: Wildcard & filters

2007-10-12 Thread Beard, Brian
.doc()); } } else { break; } } while (enumerator.next()); } finally { termDocs.close(); enumerator.close(); } return bits; } } - Mark Beard, Brian wrote: > I'm trying to over-ride QueryParser.getW

Wildcard & filters

2007-10-12 Thread Beard, Brian
I'm trying to over-ride QueryParser.getWildcardQuery to use filtering. I'm missing something, because the following still gets the maxBooleanClauses limit. I guess the terms are still expanded even though the query is wrapped in a filter. How do I avoid the term expansion altogether? Is there a b

RE: combining Filter's & Query's to search

2007-10-09 Thread Beard, Brian
idable to provide a hook for you to return a Query object of your choosing (e.g. ConstantScoreQuery wrapping your choice of filter) Cheers Mark - Original Message From: "Beard, Brian" <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Tuesday, 9 October, 2007 3:2

combining Filter's & Query's to search

2007-10-09 Thread Beard, Brian
I'm currently using rangeFilter's and queryWrapperFilter's to get around the max boolean clause limit. A couple of questions concerning this: 1) Is it good design practice to substitue every term containing a wildcard with a queryWrapperFilter, and a rangeQuery with a RangeFilter and ChainedFilt

RE: nfs mount problem

2007-06-25 Thread Beard, Brian
l now work fine, this has not been heavily tested yet. Also note that performance over NFS is generally not great. If you do go down this route please report back on any success or failure! Thanks. Mike "Beard, Brian" <[EMAIL PROTECTED]> wrote: > > http://issues.apache.org/j

nfs mount problem

2007-06-22 Thread Beard, Brian
http://issues.apache.org/jira/browse/LUCENE-673 This says the NFS mount problem is still open, is this the case? Has anyone been able to deal with this adequately? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional

RE: All keys for a field

2007-06-21 Thread Beard, Brian
parser = new QueryParser(); parser.setAllowLeadingWildcard(true); -Original Message- From: Martin Spamer [mailto:[EMAIL PROTECTED] Sent: Thursday, June 21, 2007 7:06 AM To: java-user@lucene.apache.org Subject: All keys for a field I need to return all of the keys for a

RE: MultiSearcher holds on to index - optimization not one segment

2007-06-19 Thread Beard, Brian
That works, thanks. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Tuesday, June 19, 2007 9:57 AM To: java-user@lucene.apache.org Subject: Re: MultiSearcher holds on to index - optimization not one segment On 6/19/07, Beard, Brian

RE: MultiSearcher holds on to index - optimization not one segment

2007-06-19 Thread Beard, Brian
: Tuesday, June 19, 2007 9:06 AM To: java-user@lucene.apache.org Subject: Re: MultiSearcher holds on to index - optimization not one segment On 6/19/07, Beard, Brian <[EMAIL PROTECTED]> wrote: > The problem I'm having is once the MultiSearcher is open, it holds on to > t

MultiSearcher holds on to index - optimization not one segment

2007-06-19 Thread Beard, Brian
We're using a MultiSearcher to search against multiple lucene indexes which runs inside of a web application in jboss 4.0.4. We're also using a standalone app running in a different jboss server which gets periodic updates from an oracle database and updates the lucene index. Both the searcher a

index integrity detection in lucene 2.1.0?

2007-05-16 Thread Beard, Brian
I noticed in previous discussion about some index integrity detection classes that were around in version 1.4 (NoOpDirectory or NullDirectory). Does anyone know if this in the 2.1.0 release? I didn't see in 2.1.0 or the contrib folders. Brian Beard