Re: FieldCache Question

2009-02-04 Thread Mark Miller
Todd Benge wrote: Hi, I've been looking into the FieldCache API because of memory problems we've been seeing in our production environment. We use various different sorts so over time the cache builds up and servers stop responding. I decided to apply the patch for JIRA 831: https://issues.ap

Re: FieldCache Question

2009-02-04 Thread Mark Miller
Todd Benge wrote: The intent is to reduce the amount of memory that is held in cache. As it is now, it looks like there is an array of comparators for each index reader. Most of the data in the array appears to be the same for each cache so there is duplication for each type ( string, float).

Re: Fragment Highlighter Phrase?

2009-02-14 Thread Mark Miller
se it as a base class for my own. Do you have a simple example on how, in Java, to use the SpanScorer to get a highlighter to return only fragments that are part of the phrase in the Query? Ian On Mon, Dec 8, 2008 at 8:28 AM, Mark Miller wrote: Ian Vink wrote: Is there a way to get

Re: search(Query query, HitCollector results)

2009-02-15 Thread Mark Miller
spr...@gmx.eu wrote: Hi, in what order does search(Query query, HitCollector results) return the results? By relevance? Thank you. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-

Re: search(Query query, HitCollector results)

2009-02-15 Thread Mark Miller
So HitCollector#collect(int doc, float score) is not called in a special (default) order and must order the docs itself by score if one needs the hits sorted by relevance? Presumably there is no score ordering to the hit id's lucene delivers to a HitCollector? i.e. they are delivered in th

Re: search(Query query, HitCollector results)

2009-02-15 Thread Mark Miller
Michael McCandless wrote: Mark Miller wrote: So HitCollector#collect(int doc, float score) is not called in a special (default) order and must order the docs itself by score if one needs the hits sorted by relevance? Presumably there is no score ordering to the hit id's l

Re: Upper limit on number of Fields

2009-02-15 Thread Mark Miller
In my experience, the main issue to be concerned about with tons of fields is norms. You'll likely have to turn them off for most of the fields unless you have plenty of RAM to burn. They are stored in byte arrays of size maxdoc for each field (eg non sparse). Other than that, I don't think the

Re: Fragment Highlighter Phrase?

2009-02-16 Thread Mark Miller
Ian Vink wrote: Thanks Mark, I got the latest Contrib bits for Highlighter.net (Jan 28/2008 Version 2.3.2) but it looks similar to the older 2.0.0 There is a QueryScroer only. Any ideas? (Really important to me :) Ian I'll send out an email and see if I can get my hands on the C# port a

Re: index large size file

2009-03-10 Thread Mark Miller
Amy Zhou wrote: Hi, I'm having a couple of questions about indexing large size file. As my understanding, the default MaxFieldLength 100,000. In Lucene 2.4, we can set the MaxFieldLength during constructor. My questions are: The default is 10,000. 1) How's the performance if MaxFieldLengt

Re: A model for predicting indexing memory costs?

2009-03-11 Thread Mark Miller
Michael McCandless wrote: Ie, it's still not clear if you are running out of memory vs hitting some weird "it's too hard for GC to deal" kind of massive heap fragmentation situation or something. It reminds me of the special ("I cannot be played on record player X") record (your application)

Re: Lucene 2.9

2009-03-11 Thread Mark Miller
Hmmm - you can probably get qsol to do it: http://myhardshadow.com/qsol. I think you can setup any token to expand to anything with a regex matcher and use group capturing in the replacement (I don't fully remember though, been a while since I've used it). So you could do a regex of something

Re: Search using MultiSearcher generates OOM on a 1GB total Partitioned indeces

2009-04-02 Thread Mark Miller
You might try a constant score wildcard query (similar to a filter) - I think you'd have to grab it from solr's codebase until 2.9 comes out though. No clause limit, and reportedly *much* faster on large indexes. -- - Mark http://www.lucidimagination.com Lebiram wrote: Hi Erick The query

Re: Speed of fuzzy searches

2009-04-02 Thread Mark Miller
Matt Schraeder wrote: I've got a simple Lucene index and search built for testing purposes. So far everything seems great. Most searches take 0.02 seconds or less. Searches with 4-5 terms take 0.25 seconds or less. However, once I began playing with fuzzy searches everything seemed to really sl

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Mark Miller
When I did some profiling I saw that the slow down came from tons of extra seeks (single segment vs multisegment). What was happening was, the first couple segments would have thousands of terms for the field, but as the segments logarithmically shrank in size, the number of terms for the segme

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Mark Miller
Michael McCandless wrote: On Fri, Apr 10, 2009 at 2:32 PM, Mark Miller wrote: I had thought we would also see the advantage with multi-term queries - you rewrite against each segment and avoid extra seeks (though not nearly as many as when enumerating every term). As Mike pointed out to me

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Mark Miller
Michael McCandless wrote: which is why I'm baffled that Raf didn't see a speedup on upgrading. Mike Another point is that he may not have such a nasty set of segments - Raf says he has 24 indexes, which sounds like he may not have the logarithmic sizing you normally see. If you have somewh

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Mark Miller
Mark Miller wrote: Michael McCandless wrote: which is why I'm baffled that Raf didn't see a speedup on upgrading. Mike Another point is that he may not have such a nasty set of segments - Raf says he has 24 indexes, which sounds like he may not have the logarithmic sizing yo

Re: RangeFilter performance problem using MultiReader

2009-04-10 Thread Mark Miller
Raf wrote: We have more or less 3M documents in 24 indexes and we read all of them using a MultiReader. Is this a multireader containing multireaders? -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail

Is anybody using setNorm in Production?

2009-04-19 Thread Mark Miller
Just a curiosity poll. This is a question on the java-dev list that came up. Anyone taking advantage of setNorm out there? Care to share how/why? -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-user

Re: Getting matched words for PhraseQuery or SpanNearQuery

2009-04-28 Thread Mark Miller
The Span Highlighter gets positions by attempting to convert a standard Lucne Query to a SpanQuery approximate, and then calling getSpans on the span query to find start end positions (getSpans is called against a fast single document MemoryIndex). You might check out WeightedSpanTermExtractor

Re: RegexQuery Incomplete Results

2009-05-12 Thread Mark Miller
Use JavaUtilRegexCapabilities or put the Jakarata RegEx jar on your classpath: http://jakarta.apache.org/regexp/index.html -- - Mark http://www.lucidimagination.com Seid Mohammed wrote: I need it similar functionality, but while running the above code it breaks after outputing the following

Re: Apache Lucene Crawler search

2009-05-27 Thread Mark Miller
Lucene is more like a search utility library than a full blown Search Engine like FAST. The Lucene sub project, Solr is more comparable to FAST, but Solr does not have a built in crawler available either (though its easy enough to do basic crawls). There are many open source crawlers you could

Re: Phrase Highlighting

2009-06-03 Thread Mark Miller
Max Lynch wrote: Well what happens is if I use a SpanScorer instead, and allocate it like such: analyzer = StandardAnalyzer([]) tokenStream = analyzer.tokenStream("contents", lucene.StringReader(text)) ctokenStream = lucene.CachingTokenFilter(tokenStre

Re: Phrase Highlighting

2009-06-04 Thread Mark Miller
doing all the work to properly locate the full span for the phrase (I think?), so it's ashame that because there's no way for it to "communicate" this information to the formatter. The strong decoupling of fragmenting from highlighting is hurting us here... Mike On Wed, Jun 3,

Lucene 2.9 Release

2009-06-10 Thread Mark Miller
So... how about we try to wrap up 2.9/3.0 and ship with what we have, now? It's been 8 months since 2.4.0 was released, and 2.9's got plenty of new stuff, and we are all itching to remove these deprecated APIs, switch to Java 1.5, etc. We should try to finish the issues that are open and under

Re: Lucene 2.9 Release

2009-06-11 Thread Mark Miller
rough, I'm going to concentrate on pushing back on those issues that have yet to find an assignee. Please assign yourself if you plan on fishing an unassigned issue off for 2.9. Ill wait a few days at least. - Mark Mark Miller wrote: So... how about we try to wrap up 2.9/3.0 and ship wi

Re: wheres the word

2009-06-24 Thread Mark Miller
Timon Roth wrote: hello list im figgering about the following problem. in my index i cant find the word BE, but it exists in two documents. im usinglucene 2.4 with the standardanalyzer. other querys with words like de, et or de la works good. any ideas? gruess, timon be is a stopword. Do

Re: Lucene 2.9

2009-06-30 Thread Mark Miller
I hope July. Could easily be August though. I'm kicking and screaming to get it out soon though. Its been hurting my high brow reputation. On Tue, Jun 30, 2009 at 2:41 PM, Siraj Haider wrote: > is there an ETA for Lucene 2.9 release? > > -siraj > > ---

Re: Order of fields within a Document in Lucene 2.4+

2009-06-30 Thread Mark Miller
Yeah, I've heard rumblings about this issue before. I can't remember what patch changed it though - one of Mike M's I think? On Tue, Jun 30, 2009 at 8:40 PM, Chris Hostetter wrote: > > Hmmm... i'm not an expert on the internals of indexing, and i don't use > FieldSelectors much, but this seems li

Re: Modifying score based on tf and slop

2009-07-06 Thread Mark Miller
tf() is used, just not with the term freq - the length of the matching Spans is used instead. The terms from nested Spans will still affect the score (you still get IDF), but term freq is substituted with matching Span length. Also, boosts of nested Spans are ignored - only the top level boos

Re: CompareBottom and setBottom in TopFieldCollector and FieldComparator

2009-07-10 Thread Mark Miller
There are a lot of calls to compare that only compare to the bottom (think of the common case when the queue fills quickly). Set and compare bottom cache that value. So you can pre cache the bottom ord and save derefing into the array. It could just as easily be a call to compare, but it would be s

Re: SpanScorer problem?

2009-07-17 Thread Mark Miller
Thanks Koji - I just made a patch for the fix if you want to pop open a JIRA issue. Two query types were making their own terms map and passing them to extract, rather than using the top level term map - but extract would use the term map to see if it saw the term before. The result was, for the t

Re: Sorting field contating NULL values consumes field cache memory

2009-07-20 Thread Mark Miller
Right now, you can't really do anything about it. In the future, with the new FieldCache API that may go in, you could plug in a custom implementation that makes tradeoffs for a sparse array of some kind. The docid is currently the index into the array, but with a custom impl you may be able to use

Re: Multiline Regex with Lucene

2009-07-29 Thread Mark Miller
>>I came across qsol where in the paragraphseperator and sentence seperator >>can be specified and string can be searched within the paragraph. Qsol does this by using SpanQuerys. First you inject special marker tokens as your paragraph/sentence markers, then you use a SpanNotQuery that looks for a

Re: score from spans

2009-08-10 Thread Mark Miller
Hey Eran, I've started work on this in the past - you are right, it gets complicated quick! Its also likely to bring with it a sizable performance cost. We already have an issue in JIRA for this that is quite old: https://issues.apache.org/jira/browse/LUCENE-533 If you get any work going,

Re: IndexSearcher.search Behavior

2009-08-17 Thread Mark Miller
Unfortunately, many Query's toString output is not actually parsable by QueryParser (though some are). If you look at the result Query object that gets built from the toString output, its likely different than the BooleanQuery you are putting together. -- - Mark http://www.lucidimagination.

Re: Lucene-Core test failures

2009-08-18 Thread Mark Miller
Bryan Swift wrote: I was running the tests which lucene-core version 2.4 and I noticed a failure in org.apache.lucene.index.TestIndexInput for testRead at line 89. The assertions in question have to do with reading "Modified UTF-8 null bytes" according to the comments in the file. It seems thes

[ANNOUNCEMENT] LucidGaze for Lucene released

2009-08-24 Thread Mark Miller
Hey all, Just wanted to alert you to a new free offering we just released. Combine Lucene with a little aspect programming and you can do some pretty cool things :) Announce: Lucid Imagination has released LucidGaze for Lucene, a performance monitoring and analysis utility for Open Source Apache

Lucene release 2.9

2009-08-25 Thread Mark Miller
Hello all Lucene users, I just wanted to let you in on the current release schedule for Lucene 2.9 (still subject to change): Currently, we plan to go into official feature freeze tomorrow (Wednesday, August 26 2009). That means we will try and keep the 2.9 code as stable as possible, only commit

Re: Lucene release 2.9

2009-08-26 Thread Mark Miller
Looks like we are going to pull back a day and start the freeze sometime tomorrow (Thursday, August 27 2009). There is still a lot of documentation to catch up - wouldn't make sense to have anyone look over what we know is still wrong. Thanks, -- - Mark http://www.lucidimagination.com

Lucene 2.9 RC1 now available for testing

2009-08-27 Thread Mark Miller
er/staging-area/lucene2.9changes/contrib/CHANGES.txt Download release candidate 1 here: http://people.apache.org/~markrmiller/staging-area/lucene2.9rc1/ Be sure to report back with any issues you find! Thanks, Mark Miller -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with

Re: Lucene 2.9 RC1 now available for testing

2009-08-28 Thread Mark Miller
ase... > > I have one issue so far - I cannot find the contrib/analyzers jars, > only the sources are present. > > Bogdan > > On Fri, Aug 28, 2009 at 1:17 AM, Mark Miller wrote: > Hello Lucene users, > > On behalf of the Lucene dev community (a growing community far lar

Re: Lucene 2.9 RC1 now available for testing

2009-08-28 Thread Mark Miller
. -- - Mark http://www.lucidimagination.com Mark Miller wrote: > Apologies - you are correct - contrib/analyzers is in src but not the > jar distrib. I will address whatever is up with the build process and > put up another RC when apache servers are back up. > > Thanks for po

Lucene 2.9 RC2 now available for testing

2009-08-28 Thread Mark Miller
http://people.apache.org/~markrmiller/staging-area/lucene2.9rc2/ Be sure to report back with any issues you find! Thanks, Mark Miller -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkqYKcIACgkQ0DU3IV

Re: Lucene 2.9 RC2 now available for testing

2009-08-28 Thread Mark Miller
Mark Miller wrote: > > Download release candidate 1 here: > http://people.apache.org/~markrmiller/staging-area/lucene2.9rc2/ > In case anyone catches - yes that is a cut and paste typo - should read release candidate 2 (obvious, but just to cross my t's).

Re: JVM bug?

2009-08-28 Thread Mark Miller
Could be this issue in Lucene https://issues.apache.org/jira/browse/LUCENE-1342 ? -- - Mark http://www.lucidimagination.com Jason Rutherglen wrote: > While indexing with the latest nightly build of Solr on Amazon EC2 the > following JVM bug has occurred twice on two different servers. > > Pos

Re: Unintelligent implementation of IndexWriter locking?

2009-08-30 Thread Mark Miller
Jan Peter Stotz wrote: > Mark Miller wrote: > > >> Have you tried using a native lock factory? >> > > No - I did not even know of it's existence as it is nowhere "visible" from > the IndexWriter class (not directly used and nowhere mentioned

Re: Unintelligent implementation of IndexWriter locking?

2009-08-30 Thread Mark Miller
Jan Peter Stotz wrote: > Hi Lucene users, > > at the moment I have some problems with the locking mechanism of > IndexWriter. Some times my application quits/terminates before I can close > the IndexWriter. Then the "write.lock" file remains and prohibits every > write access to my index. Of course

Re: New "Stream closed" exception with Java 6

2009-09-08 Thread Mark Miller
Chris Hostetter wrote: > : I'm coming to the same conclusion - there must be >1 threads accessing this > index at the same time. Better go figure it out ... :-) > > careful about your assumptions ... you could get this same type of > exception even with only one thread, the stream that's being

Re: Lucene 2.9 RC2 now available for testing

2009-09-09 Thread Mark Miller
How about the new score inorder/out of order stuff? It was an option before, but I think now it uses whats best by default? And pairs with the collector? I didn't follow any of that closely though. - Mark Peter Keegan wrote: > IndexSearcher.search is calling my custom scorer's 'next' and 'doc' me

Lucene 2.9 RC3 now available for testing

2009-09-09 Thread Mark Miller
cene2.9rc3/ Be sure to report back with any issues you find! Thanks, Mark Miller -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkqn3PYACgkQ0DU3IV7ywDnY9gCgrRhUaD3NoXbeSg8+VfqQH399 fDsAn1HFIUMSRfsiOyaiZK+

Re: NumberFormatException when creating field cache

2009-09-09 Thread Mark Miller
Antony Bowesman wrote: > I'm using Lucene 2.3.2 and have a date field used for sorting, which > is MMDDHHMM. I get an exception when the FieldCache is being > generated as follows: > > java.lang.NumberFormatException: For input string: "190400-412317" > java.lang.NumberFormatException.forInput

Lucene 2.9 RC4 now available for testing

2009-09-13 Thread Mark Miller
ople.apache.org/~markrmiller/staging-area/lucene2.9changes/CONTRIB-CHANGES.txt Download release candidate 4 here: http://people.apache.org/~markrmiller/staging-area/lucene2.9rc4/ Be sure to report back with any issues you find! Thanks, Mark Miller -BEGIN PGP SIGNATURE- Version: GnuPG v1

Re: Enumerating NumericField using TermEnum?

2009-09-13 Thread Mark Miller
>> NumericField uses a spezial encoding of terms for fast NumericRangeQueries. >> It indexes more than one term per value. How many terms depends on the >> precisionStep ctor parameter. If you set it to infinity (or something ge the >> bit size of your value, 32 for ints, it indexes exactly one va

Re: Lucene 2.9 RC4 now available for testing

2009-09-13 Thread Mark Miller
Mark Miller wrote: > Hello Lucene users, > > ... > > We let out a bug in the lock factory changes we made in RC3 - > making a new SimpleFSDirectory with a String param would throw > an illegal state exception - a fix for this is in RC4. My apologies - not S

Re: Enumerating NumericField using TermEnum?

2009-09-13 Thread Mark Miller
Uwe Schindler wrote: Maybe I add this t the javadocs. >> +1 - intuition might be to use it for anything numeric. >> > > If we do not need a new RC fort hat I can do it tomorrow! I am not yet sure > what to write: I tend to say: "Use NumericField, but with infinite > pre

Re: Enumerating NumericField using TermEnum?

2009-09-14 Thread Mark Miller
Uwe Schindler wrote: > >> My personal opinion is that we can make javadoc changes for the final >> without doing an RC, as long as no code/build/scipts at all is touched. >> Not sure how others feel though. >> > > I just wanted to ask for confirmation. > > Uwe > > I know - we always should

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Mark Miller
Hey Thomas - any chance you can do some quick profiling and grab the hotspots from the 3 configurations? Are your custom sorts doing anything tricky? -- - Mark http://www.lucidimagination.com Thomas Becker wrote: > Urm and uploaded here: > http://ankeschwarzer.de/tmp/graph.jpg > > Sorry. > >

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Mark Miller
Thomas Becker wrote: > Hey Mark, > > thanks for your reply. Will do. Results will follow in a couple of minutes. > > > Thanks, awesome. Also, how many segments (approx) are in your index? If there are a lot, have you/can you try the same tests on an optimized index? Don't want to get ahead of t

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Mark Miller
dating the index every 30 min. at the moment and it gets optimized > after > each update. > So this profiling is on an optimized index (eg a single segment) ? That would be odd indeed, and possibly point to some of the scoring changes. > > Mark Miller wrote: > >> Thomas

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Mark Miller
ys will come in with some ideas as well. Do confirm that those profiling results are on a single segment though. - Mark Mark Miller wrote: > Thomas Becker wrote: > >> Here's the results of profiling 10 different search requests: >> >> http://ankes

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Mark Miller
to release 2.9, and its been such a long haul, I'd hate to see a release with an unknown performance trap. -- - Mark http://www.lucidimagination.com > Thanks a lot for your support! > > Cheers, > Thomas > > Mark Miller wrote: > >> A few quick notes - >>

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Mark Miller
Can you run the following test on your RAMDISK? http://people.apache.org/~markrmiller/FileReadTest.java I've taken it from the following issue (in which NIOFSDirectory was developed): https://issues.apache.org/jira/browse/LUCENE-753 -- - Mark http://www.lucidimagination.com ---

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Mark Miller
en > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -Original Message- >> From: Mark Miller [mailto:markrmil...@gmail.com] >> Sent: Tuesday, September 15, 2009 5:30 PM >> To: java-user@lucene.apache.org >> Subject: Re: lucene 2.9.0RC4 sl

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Mark Miller
Mark Miller wrote: > Indeed - I just ran the FileReaderTest on a Linux tmpfs ramdisk - with > SeparateFile all 4 of my cores are immediately pinned and remain so. > With ChannelFile, all 4 cores hover 20-30%. > > It would appear it may not be a good idea to use NIOFSDirectory on ram

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Mark Miller
llee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > >> -Original Message- >> From: Mark Miller [mailto:markrmil...@gmail.com] >> Sent: Tuesday, September 15, 2009 7:15 PM >> To: java-user@lucene.apache.org >> Subject: Re: lucene

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Mark Miller
you use the same test file? > > -Yonik > http://www.lucidimagination.com > > > > On Tue, Sep 15, 2009 at 2:18 PM, Mark Miller wrote: > >> The results: >> >> config: impl=SeparateFile serial=false nThreads=4 iterations=100 >> bufsize=1024 pool

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Mark Miller
ley > wrote: > >> It's been a while since I wrote that benchmarker... is it OK that the >> answer is different? Did you use the same test file? >> >> -Yonik >> http://www.lucidimagination.com >> >> >> >> On Tue, Sep 15, 2009 at

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Mark Miller
4 poolsize=2 filelen=730554368 answer=-282295361, ms=766340, MB/sec=381.3212767179059 Mark Miller wrote: > Michael McCandless wrote: > >> I don't like that the answer is different... but it's really really >> odd that it's different-yet-almost-the-same. >>

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Mark Miller
I'm jealous of your 4 3.0Ghz to my 2.0Ghz. I was on dynamic scaling frequency and switched to 2.0Ghz hard. On ramdisk, my puny 2.0's almost catch you and get a bit over 1800MB/s with SeparateFile. I'm smoked on PooledPread and ChannelPread though. Still sub 500 for both, even on the ramdisk. It

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Mark Miller
nchmarking... some >> things with IO cause the freq to drop, and when it's CPU bound again >> it takes a while for Linux to scale up the freq again. >> >> For example, on my ubuntu box, ChannelFile went from 100MB/sec to >> 388MB/sec. This effect probably won't

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-15 Thread Mark Miller
=SeparateFile serial=false nThreads=4 iterations=100 bufsize=1024 poolsize=2 filelen=164956707 answer=-31115729, ms=45691, MB/sec=1444.106778140115 Mark Miller wrote: > I'm jealous of your 4 3.0Ghz to my 2.0Ghz. > > I was on dynamic scaling frequency and switched to 2.0Ghz hard. >

Re: What would be the fastest BooleanQuery possible?

2009-09-16 Thread Mark Miller
With the new Collector API in Lucene 2.9, you no longer have to compute the score. Now a Collector is passed a Scorer if they want to use it, but you can just ignore it. -- - Mark http://www.lucidimagination.com Benjamin Pasero wrote: > Hi, > > I am using Lucene not only for smart fulltext s

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Mark Miller
tests now with SimpleFSDirectory and MMapDirectory. Both are > faster than NIOFS and the response times improved. But it's still slower than > 2.4. > > I'll do some profiling now again and let you know the results. > > Thanks again for all the great support to all who&#

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Mark Miller
//www.thetaphi.de >> eMail: u...@thetaphi.de >> >> >> >>> -Original Message- >>> From: Mark Miller [mailto:markrmil...@gmail.com] >>> Sent: Wednesday, September 16, 2009 6:23 PM >>> To: java-user@lucene.apache.org >>> Subject: Re

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Mark Miller
gt; and even worse: > http://ankeschwarzer.de/tmp/lucene_29_newapi_mmap_singlereq.png > > Have to verify that the last one is not by accident more than one request. > Will > do the run again and then post the required info. > > Mark Miller wrote: > >> bq. I'll do

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Mark Miller
Ah - that explains a bit. Though if you divide by 2, the new one still appears to overcall each method in comparison to 2.4. - Mark Uwe Schindler wrote: >> http://ankeschwarzer.de/tmp/lucene_29_newapi_mmap_singlereq.png >> >> Have to verify that the last one is not by accident more than one reque

Re: What would be the fastest BooleanQuery possible?

2009-09-16 Thread Mark Miller
. >> >> Mike >> >> On Wed, Sep 16, 2009 at 9:14 AM, Benjamin Pasero >> wrote: >> >>> Ah wow that sounds great. I am using 2.3.2 though (and have to use it >>> for now). Anything >>> in that version that could speed things up? >>&

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Mark Miller
Something is very odd about this if they both cover the same search and the environ for both is identical. Even if one search was done twice, and we divide the numbers for the new api by 2 - its still *very* odd. With 2.4, ScorerDocQueue.topDoc is called half a million times. With 2.9, its called

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Mark Miller
Notice that while DisjunctionScorer.advance and DisjuntionScorer.advanceAfterCurrent appear to be called in 2.9, in 2.4, I am only seeing DisjuntionScorer.advanceAfterCurrent called. Can someone explain that? Mark Miller wrote: > Something is very odd about this if they both cover the s

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Mark Miller
times That just doesn't jive. Mark Miller wrote: > Notice that while DisjunctionScorer.advance and > DisjuntionScorer.advanceAfterCurrent appear to be called > in 2.9, in 2.4, I am only seeing DisjuntionScorer.advanceAfterCurrent > called. > > Can someone explain

Re: lucene 2.9.0RC4 slower than 2.4.1?

2009-09-16 Thread Mark Miller
round in 2.4. This is part of the >> DocIdSetIterator changes. >> >> Anyway - either these are just not comparable runs, or there is a major >> bug (which seems unlikely). >> >> Just to keep pointing out the obvious: >> >> 2.4 cal

Lucene 2.9 RC5 now available for testing

2009-09-19 Thread Mark Miller
ucene2.9changes/CONTRIB-CHANGES.txt Download release candidate 5 here: http://people.apache.org/~markrmiller/staging-area/lucene2.9rc5/ Be sure to report back with any issues you find! Thanks, Mark Miller -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with

Re: Getting Payload data from BooleanQuery results

2009-09-24 Thread Mark Miller
I should beef up that spans extractor - it can actually work on the constantscore multi term queries (the base ones that now have a constant score mode in 2.9), just like the Highlighter does. That class really belongs in contrib probably. You can use the filter and the spanquery to get the result

The Release of Lucene 2.9

2009-09-25 Thread Mark Miller
cene/ The Next Release: The next release will be Lucene 3.0. This should come along shortly, and will remove all of the deprecated code in Lucene 2.9. Lucene 3.0 will also be the first release to move from Java 1.4 to Java 1.5 as a requirement. Thanks, Mark Miller -BEGIN PGP SIGNATURE-

Re: PrefixQuery vs wildcardquery

2009-09-28 Thread Mark Miller
John Seer wrote: > Hello, > > Is there any benefit of using one or other for "start with query"? > > Which one is faster? > > > Regards > Prefix query is a bit more efficient - not sure what it turns into realworld, but prefix just checks if the term's start with the prefix - wildcard has a bi

Re: PrefixQuery vs wildcardquery

2009-09-28 Thread Mark Miller
Though in 2.9 this is not much of a concern - the multi term queries are smart - if it matches few enough terms it will rewrite to a constant score booleanquery - if it matches a lot of terms it will rewrite to a constantscore query - using a filter underneath. So maxclause issues should no

Re: TopDocCollector limits

2009-09-29 Thread Mark Miller
Max Lynch wrote: > Hi, > I am developing a search system that doesn't do pagination (searches are run > in the background and machine analyzed). However, TopDocCollector makes me > put a limit on how many results I want back. For my system, each result > found is important. How can I make it col

Re: TSDC, TopFieldCollector & co

2009-09-30 Thread Mark Miller
If you want relevance sorting (Sort.Score not Sort.Relevance right?), I'd think you want to use TopScoreDocCollector, not TopFieldCollector. The only reason to use relevance with TopFieldCollector is if you you are doing a nth sort with a field sort as well. You don't really need to worry about th

Re: TopDocCollector limits

2009-09-30 Thread Mark Miller
the deprecated Hits class? > > On Tue, Sep 29, 2009 at 7:40 PM, Mark Miller wrote: > > >> Max Lynch wrote: >> >>> Hi, >>> I am developing a search system that doesn't do pagination (searches are >>> >> run >> >&g

Re: Implement SpanScorer on 2.9 lucene lib!

2009-09-30 Thread Mark Miller
Felipe Lobo wrote: > Hi, i updated my lucene lib to 2.9.0 and i'm trying to instanciate the > spanscorer but the constructor is protected. > I looked in the javadoc of lucene and saw 2 subclasses of it > (PayloadNearQuery.PayloadNearSpanScorer, > PayloadTermQuery.PayloadTermWeight.PayloadTermSpanSc

Re: Highlighting phrases in 2.9

2009-09-30 Thread Mark Miller
Scott Smith wrote: > I've been looking at the changes I have to make in my code to go from > 2.4.1 to 2.9. One of the features I have is to highlight query hits in > documents which meet the search criteria. If the query has a phrase, > then I need to highlight the phrase, but not isolated words

Re: Lucene 2.9 and performance of readers per segment.

2009-10-01 Thread Mark Miller
Per segment over many segments is actually a bit faster for none sort cases and many sort cases -but an optimized index will still be fastest - the speed benifit of many segments comes when reopening - so say for realtime search - in that case you may want to sac the opt perf for a segment

Re: Implement SpanScorer on 2.9 lucene lib!

2009-10-01 Thread Mark Miller
e package as the QueryScorer, in the Highlighter contrib. > Thanks! > > On Wed, Sep 30, 2009 at 6:38 PM, Mark Miller wrote: > > >> Felipe Lobo wrote: >> >>> Hi, i updated my lucene lib to 2.9.0 and i'm trying to insta

Re: Implement SpanScorer on 2.9 lucene lib!

2009-10-01 Thread Mark Miller
" it don't. > Thanks a lot - I'll check it out and get back to you. > the name is realy TermQueryScorer or is QueryTermScorer(i found that in the > package)?? > Sorry! Thats what happens when I trust my memory ;) Its QueryTermScorer. > Thanks. > > > On Th

Re: Error using multireader searcher in Lucene 2.9

2009-10-02 Thread Mark Miller
Sorry Raf - technically your not allowed to use internal Lucene id's that way. It happened to work in the past if you didn't use MultiSearcher, but its not promised by the API, and no longer works as you'd expect in 2.9. You have to figure out another approach that doesn't use the internal ids (eg

Re: TimeLimitedCollector hang on, VM process doesn't die (TOMCAT)

2009-10-02 Thread Mark Miller
That thread will only be stopped if its interrupted. So it would appear there is a not a path that leads to it being interrupted ... why that is would be the next question ... -- - Mark http://www.lucidimagination.com Mani EZZAT wrote: > Hello everyone. > I'm using solrJ for an application de

Re: TimeLimitedCollector hang on, VM process doesn't die (TOMCAT)

2009-10-02 Thread Mark Miller
Mani EZZAT wrote: > Mark Miller wrote: >> That thread will only be stopped if its interrupted. So it would appear >> there is a not a path that leads to it being interrupted ... why that is >> would be the next question ... >> >> > I found someone (a japanes

Re: Efficiently reopening remotely-distributed indexes in 2.9?

2009-10-05 Thread Mark Miller
I keep considering a full response too this, but I just can't get over the hump and spend the time writing something up. Figured someone else would get to it - perhaps they still will. I will make a comment here though: >Before Lucene 2.9, I don't think this made any difference, as (I think) the

Re: Efficiently reopening remotely-distributed indexes in 2.9?

2009-10-07 Thread Mark Miller
y, and > closing the old one. We don't use IndexReader.reopen() because the updated > index is in a different directory (as opposed to being updated in-place). > > (Reading about some of the 2.9 changes motivated me to look into actually > using reopen(). And Michael Busch and Mark Mi

<    1   2   3   4   5   6   7   >