> caching them (as OpenBitSet)

How do you handle stop words in phrase queries?

On Thu, Jul 16, 2009 at 11:30 AM, eks dev<eks...@yahoo.co.uk> wrote:
>
> Sure, If you have enough memory to do postings caching, with or without P4... 
> I see P4 as a generally faster postings format, with stopwords or not.
>
> I wouldn't blow Term dictionary, that just moves the problem to another place.
>
> What I am thinking of is quite simple, probably not the most elegant 
> solution, but I am almost sure it would work:
> - get Top N terms from index, N depends on your available memory
> - create Filter from them, stick them into ConstantScoreQuery, caluculate 
> idf() and set boost() to this value, cache it
> - implement QueryOptimizer that loops all Terms in your Query and replaces 
> Terms with cached  ConstantScoreQuery
>
> and voila, your made perfectly fast search... but
>
> BAD:
> a) you reduce quality of your score value, as there is no tf() component. But 
> for stop words, I am not sure if that makes any significant  difference. 
> Also, if you are luck like me, you omitTf()... so no loss there
> b) if you load RAMIndex/MMAp, you duplicate ram needs for these postings...
>
> COOL:
> - Math on out index: Zipfian distribution does magic, top 30 terms make 36% 
> of our corpus! For caching them (as OpenBitSet) on 100Mio Documents  I need 
> ~0.35G
> My terms distribution follows collection terms distribution ... so I get 
> cache hit rate of 36% for only 0.38Gb ram... You save a lot of VInt decoding 
> (brings a lot, even if we ignore benefit of reducing disk access... these hot 
> terms must be OS cached anyhow). If you use something other filter, you need 
> even less memory... it is only important to use filter that is measurably 
> faster than VInt decoding with skip lists.
> - This speeds up the slowest queries, fast queries are anyhow fast :)
>
>
> I think it will work just fine
>
> Would be great if Lucene could do all this for me, I just say "here, I give 
> you 500Mb free for postings cache, do your magic for me"... but nothing 
> prevents me to provide patch :)
>
> I will try it, to see if theory works.We have cases where free memory is not 
> a problem, we are hitting CPU there (VInt decoding on our last profiled run). 
> To be honest, I do not know is anyone today runs high volume search from disk 
> (maybe SSD), even than, significant portion has to be in RAM...
>
> One day we could throw many CPUs at Query... but this is not an easy one...
>
>
>
>
>
> ----- Original Message ----
>> From: Jason Rutherglen <jason.rutherg...@gmail.com>
>> To: java-user@lucene.apache.org
>> Sent: Thursday, 16 July, 2009 19:22:28
>> Subject: Re: speed of BooleanQueries on 2.9
>>
>> Do we think that we'll be able to support indexing stop words
>> using PFOR (with relaxation on the compression to gain
>> performance?) Today it seems like the best approach to indexing
>> stop words is to use shingles? However this blows up the term
>> dict because shingles concatenates phrases together.
>>
>> On Thu, Jul 16, 2009 at 8:26 AM, eks devwrote:
>> >
>> > We did it for us, gave something back to community... all happy... open 
>> > source
>> works just fine here in lucene land :)
>> >
>> > Re, 10%
>> > I did not expect that much, but our index is quite dense, a lot of 
>> > documents
>> and not too many unique terms, omitTf ... so it is really hard pressure on
>> DocIDSetIterator and Scorers.
>> >
>> > I cannot wait to see P4, pulsing index... in action...
>> > We are alo going to try to cache postings for Top N high freq. terms in 
>> > plain
>> old ConstanScoreQuery via OpenBitSet ... with zipfian distribution this 
>> should
>> reduce VInt decoding to 50% with just a few hundred terms... having TF
>> independent score, we just need to adjust constant score value based on 
>> idf()...
>> so no loss in quality! expected huge performance benefit (said optimist 
>> without
>> numbers to prove it).
>> >
>> > Cheers, Eks
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > ----- Original Message ----
>> >> From: Michael McCandless
>> >> To: java-user@lucene.apache.org
>> >> Sent: Thursday, 16 July, 2009 16:23:57
>> >> Subject: Re: speed of BooleanQueries on 2.9
>> >>
>> >> Super, thanks for testing!
>> >>
>> >> And, the 10% speedup overall is good progress...
>> >>
>> >> Mike
>> >>
>> >> On Thu, Jul 16, 2009 at 9:16 AM, eks devwrote:
>> >> >
>> >> > and one final touch, 4X slow down does not exist with new Lucene...
>> >> > I did not verify it again on the old one, but hey, who cares. Trunk is
>> clean
>> >> and, at least so far, our favourite QA team has nothing to complain about 
>> >> ...
>> >> >
>> >> > They will keep it under stress for a while... so if somethings comes up 
>> >> > you
>> >> will hear from me...
>> >> > Thanks again to all.
>> >> >
>> >> > Cheers, Eks
>> >> >
>> >> >
>> >> >
>> >> > ----- Original Message ----
>> >> >> From: eks dev
>> >> >> To: java-user@lucene.apache.org
>> >> >> Sent: Thursday, 16 July, 2009 14:40:26
>> >> >> Subject: Re: speed of BooleanQueries on 2.9
>> >> >>
>> >> >>
>> >> >> ok new facts, less chaos :)
>> >> >>
>> >> >> - LUCENE-1744 fixed it definitely; I have it confirmed
>> >> >> Also, we found another example of the Query that was stuck (t1 t2 t3)~2
>> ...
>> >> this
>> >> >> is also fixed with LUCENE-1744
>> >> >>
>> >> >>
>> >> >> Re:  "some queries are 4X slower  than before".  Was that a different
>> issue?
>> >> >> (Because this issue is "the query runs forever").
>> >> >>
>> >> >> Maybe :) I do not know.
>> >> >> When I wrote this email about "the query runs forever" I did not know 
>> >> >> if
>> this
>> >> >> slowdown is the same or different issue... I have just reported some
>> unusual
>> >> >> observation (4 times slower) and was later convinced that this stuck 
>> >> >> Query
>> >> >> confirms the same problem ....
>> >> >>
>> >> >> Now, I do not know  if that was the same effect, or wrong measurement, 
>> >> >> or
>> >> >> something else lurking ... Good point, will try to repeat test on this
>> >> >> slowdown...
>> >> >>
>> >> >> Just a reminder This 4_times_slower Query is different:
>> >> >> +(a b c) +(x y z)
>> >> >>
>> >> >> +((NAME:hans NAME:hahns^0.23232001 NAME:hams^0.27648002 
>> >> >> NAME:hamz^0.25392
>> >> >> NAME:hanas^0.18722998 NAME:hanbs^0.18722998 NAME:hanfs^0.18722998
>> >> >> NAME:hangs^0.18722998 NAME:hanhs^0.24030754 NAME:hanis^0.18722998
>> >> >> NAME:hanjs^0.18722998 NAME:hanks^0.18722998 NAME:hanms^0.18722998
>> >> >> NAME:hanos^0.18722998 NAME:hanrs^0.18722998 NAME:hansb^0.20172001
>> >> >> NAME:hansd^0.20172001 NAME:hansf^0.20172001 NAME:hansg^0.20172001
>> >> >> NAME:hansi^0.20172001 NAME:hansj^0.20172001 NAME:hansk^0.20172001
>> >> >> NAME:hansl^0.20172001 NAME:hansn^0.20172001 NAME:hanso^0.20172001
>> >> >> NAME:hansp^0.20172001 NAME:hanst^0.20172001 NAME:hansu^0.20172001
>> >> >> NAME:hansw^0.20172001 NAME:hansy^0.20172001 NAME:hansz^0.20172001
>> >> >> NAME:hants^0.18722998 NAME:hanus^0.18722998 NAME:hanws^0.18722998
>> >> >> NAME:hehns^0.20172001 NAME:hens^0.2736075 NAME:hins^0.24843
>> NAME:hons^0.24843
>> >> >> NAME:huhns^0.1801875 NAME:huns^0.24843)^2.0)
>> >> >> +(((ZIPS:berlin ZIPS:barlin^0.28227 ZIPS:berien^0.25947002
>> >> >> ZIPS:berling^0.23232001 ZIPS:perlin^0.26133335))^1.2)
>> >> >>
>> >> >>
>> >> >> Thanks!
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> ----- Original Message ----
>> >> >> > From: Michael McCandless
>> >> >> > To: java-user@lucene.apache.org
>> >> >> > Sent: Thursday, 16 July, 2009 13:52:06
>> >> >> > Subject: Re: speed of BooleanQueries on 2.9
>> >> >> >
>> >> >> > On Thu, Jul 16, 2009 at 6:38 AM, eks devwrote:
>> >> >> >
>> >> >> > > and this String has exactly that form
>> >> >> > > (x OR y OR z) OR (a OR b OR c),
>> >> >> > > That is exactly how I construct the Query, have a look at brackets 
>> >> >> > > on
>> >> this
>> >> >> > toString result .
>> >> >> >
>> >> >> > Duh!  OK, I had missed that your large query actually had 2 clauses 
>> >> >> > at
>> >> >> > the top!  Sigh.
>> >> >> >
>> >> >> > OK, that part of the puzzle now at least makes sense.  The rewrite()
>> >> >> > of your query will not reduce to a single OR query (as I previously
>> >> >> > thought).
>> >> >> >
>> >> >> > So in fact you have a BS at the top (because you called
>> >> >> > setAllowDocsOutOfOrder(true)), with 2 clauses, and each of those
>> >> >> > clauses uses BS2 to score.
>> >> >> >
>> >> >> > I think advance() is not involved, but LUCENE-1744 could very well
>> >> >> > have fixed this, because BS calls sub.scorer.docID() when interacting
>> >> >> > with its sub-scorers, and due to LUCENE-1744, that would always 
>> >> >> > return
>> >> >> > -1 from a BS2, so BS could enter an infinite loop.
>> >> >> >
>> >> >> > If you run w/o the fix for LUCENE-1744, with my instrumentation, I 
>> >> >> > can
>> >> >> > confirm this.  But I think likely this is it.
>> >> >> >
>> >> >> > Also: you started this thread by saying "some queries are 4X slower
>> >> >> > than before".  Was that a different issue?  (Because this issue is
>> >> >> > "the query runs forever").
>> >> >> >
>> >> >> > Mike
>> >> >> >
>> >> >> > ---------------------------------------------------------------------
>> >> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> ---------------------------------------------------------------------
>> >> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >> >
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>> >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to