[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

Da Huang (JIRA) Mon, 14 Jul 2014 06:08:20 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060615#comment-14060615
 ]


Da Huang edited comment on LUCENE-4396 at 7/14/14 1:06 PM:
-----------------------------------------------------------

I have done tests for different SIZE of bucketTable.
The file 'SIZE.perf' is the original test result data.

'stat.cpp' is a C++ program used to do statistic on *.perf files.
You can compile it with 'g++ stat.cpp -std=c++0x -o stat'
and run by './stat < SIZE.perf'

The statistic result for SIZE.perf is supposed to be as follows.
{code}
                Task          size10          size11           size5           
size6           size7           size8           size9
  HighAndSomeHighNot           -14.5             4.0             6.6            
-3.0             5.2            10.0*            3.4
   HighAndSomeHighOr             2.4            10.9            17.3            
17.4            12.9            18.3            21.3*
   HighAndSomeLowNot           -36.8           -37.3           -47.8           
-47.8           -40.2           -42.2           -41.5
    HighAndSomeLowOr           -45.1           -46.4           -47.9           
-46.2           -38.7           -39.7           -44.9
  HighAndTonsHighNot           162.4*          145.1           149.1           
130.1           142.9           144.7           143.7
   HighAndTonsHighOr           154.8*          146.5           154.0           
137.8           144.9           150.0           149.1
   HighAndTonsLowNot           -27.0           -17.4           -73.7           
-49.6           -40.1           -28.6           -15.6
    HighAndTonsLowOr           -28.7           -14.3           -63.8           
-44.8           -33.0           -24.4           -13.9
   LowAndSomeHighNot             3.0             0.2             4.5            
 6.2*            5.7             6.2*            4.7
    LowAndSomeHighOr             5.3             1.4             6.8*           
 6.7             7.7             5.8             6.6
    LowAndSomeLowNot            -6.3           -24.4             3.7*           
 0.8             1.7            -2.3            -4.0
     LowAndSomeLowOr           -10.3           -22.7             2.2*           
 2.0             1.7            -2.3            -8.8
   LowAndTonsHighNot            27.3*           21.4            22.5            
21.5            21.0            23.8            26.5
    LowAndTonsHighOr            23.1            28.2            24.2            
23.9            29.1*           27.5            28.2
    LowAndTonsLowNot            33.0            46.5            39.1            
33.4            30.0            47.2*           44.3
     LowAndTonsLowOr            45.7*           34.6            29.9            
36.8            45.3            40.9            38.1
{code}

size7 means the bucketTable's size is 1 << 7.
the character '*', which is added manually, marks the best value.

It seems that we can get a better result on \*Some\* tasks if we combine size9 
with size5.



was (Author: dhuang):
I have done tests for different SIZE of bucketTable.
The file 'SIZE.perf' is the original test result data.

'stat.cpp' is a C++ program used to do statistic on *.perf files.
You can compile it with 'g++ stat.cpp -std=c++0x -o stat'
and run by './stat < SIZE.perf'

The statistic result for SIZE.perf is supposed to be as follows.
{code}
                Task          size10          size11           size5           
size6           size7           size8           size9
  HighAndSomeHighNot           -14.5             4.0             6.6            
-3.0             5.2            10.0*            3.4
   HighAndSomeHighOr             2.4            10.9            17.3            
17.4            12.9            18.3            21.3*
   HighAndSomeLowNot           -36.8           -37.3           -47.8           
-47.8           -40.2           -42.2           -41.5
    HighAndSomeLowOr           -45.1           -46.4           -47.9           
-46.2           -38.7           -39.7           -44.9
  HighAndTonsHighNot           162.4*          145.1           149.1           
130.1           142.9           144.7           143.7
   HighAndTonsHighOr           154.8*          146.5           154.0           
137.8           144.9           150.0           149.1
   HighAndTonsLowNot           -27.0           -17.4           -73.7           
-49.6           -40.1           -28.6           -15.6
    HighAndTonsLowOr           -28.7           -14.3           -63.8           
-44.8           -33.0           -24.4           -13.9
   LowAndSomeHighNot             3.0             0.2             4.5            
 6.2*            5.7             6.2*            4.7
    LowAndSomeHighOr             5.3             1.4             6.8*           
 6.7             7.7             5.8             6.6
    LowAndSomeLowNot            -6.3           -24.4             3.7*           
 0.8             1.7            -2.3            -4.0
     LowAndSomeLowOr           -10.3           -22.7             2.2*           
 2.0             1.7            -2.3            -8.8
   LowAndTonsHighNot            27.3*           21.4            22.5            
21.5            21.0            23.8            26.5
    LowAndTonsHighOr            23.1            28.2            24.2            
23.9            29.1*           27.5            28.2
    LowAndTonsLowNot            33.0            46.5            39.1            
33.4            30.0            47.2*           44.3
     LowAndTonsLowOr            45.7*           34.6            29.9            
36.8            45.3            40.9            38.1
{code}

size7 means the bucketTable's size is 1 << 7.

It seems that we can get a better result on \*SOME\* tasks if we combine size9 
with size5.


> BooleanScorer should sometimes be used for MUST clauses
> -------------------------------------------------------
>
>                 Key: LUCENE-4396
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4396
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: And.tasks, AndOr.tasks, AndOr.tasks, LUCENE-4396.patch, 
> LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, 
> LUCENE-4396.patch, LUCENE-4396.patch, LUCENE-4396.patch, SIZE.perf, 
> luceneutil-score-equal.patch, luceneutil-score-equal.patch, stat.cpp
>
>
> Today we only use BooleanScorer if the query consists of SHOULD and MUST_NOT.
> If there is one or more MUST clauses we always use BooleanScorer2.
> But I suspect that unless the MUST clauses have very low hit count compared 
> to the other clauses, that BooleanScorer would perform better than 
> BooleanScorer2.  BooleanScorer still has some vestiges from when it used to 
> handle MUST so it shouldn't be hard to bring back this capability ... I think 
> the challenging part might be the heuristics on when to use which (likely we 
> would have to use firstDocID as proxy for total hit count).
> Likely we should also have BooleanScorer sometimes use .advance() on the subs 
> in this case, eg if suddenly the MUST clause skips 1000000 docs then you want 
> to .advance() all the SHOULD clauses.
> I won't have near term time to work on this so feel free to take it if you 
> are inspired!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-4396) BooleanScorer should sometimes be used for MUST clauses

Reply via email to