[ 
https://issues.apache.org/jira/browse/LUCENE-7423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenczi Jim updated LUCENE-7423:
---------------------------------
    Attachment: LUCENE-7423.patch

I fixed another bug in the prefixes creation. Most of the prefixes were missing 
in my last patch so the latest result shows a completely different trend. Sorry 
for the noise [~rcmuir] , you were right, the 2-5 edge ngram is competitive and 
it beats the autoprefix by a large margin.
Here are the raw results actualized with the latest patch:

{panel:title=AutoPrefix 
minPrefixTerms=2|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}
{noformat}

Indexed 12600000: 60.369 sec
Final Indexed 12696047: 60.589 sec
Optimize...
After force merge: 87.115 sec
Close...
After close: 87.121 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 
id=2jb0oyddk8jizhc2jin5e5vwf
  1 of 1: name=_j maxDoc=12696047
    version=7.0.0
    id=2jb0oyddk8jizhc2jin5e5vwe
    codec=Lucene62
    compound=false
    numFiles=7
    size (MB)=300.525
    diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, 
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, 
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, 
source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472114132102}
    no deletions
    test: open reader.........OK [took 0.002 sec]
    test: check integrity.....OK [took 0.172 sec]
    test: check live docs.....OK [took 0.002 sec]
    test: field infos.........OK [2 fields] [took 0.001 sec]
    test: field norms.........OK [0 fields] [took 0.001 sec]
    test: terms, freq, prox...OK [3928482 terms; 222231646 terms/docs pairs; 0 
tokens] [took 5.478 sec]
      field "field-autoprefix":
        index FST:
          401257 bytes
        terms:
          1414516 terms
          10333460 bytes (7.3 bytes/term)
        blocks:
          45642 blocks
          33484 terms-only blocks
          4 sub-block-only blocks
          12154 mixed blocks
          10382 floor blocks
          14302 non-floor blocks
          31340 floor sub-blocks
          6305776 term suffix bytes (138.2 suffix-bytes/block)
          1484556 term stats bytes (32.5 stats-bytes/block)
          3661060 other bytes (80.2 other-bytes/block)
          by prefix length:
             0: 6
             1: 366
             2: 3205
             3: 13054
             4: 17849
             5: 7613
             6: 2466
             7: 752
             8: 202
             9: 62
            10: 58
            11: 7
            13: 1
            14: 1
      
      field "field":
        index FST:
          699971 bytes
        terms:
          2513966 terms
          20843092 bytes (8.3 bytes/term)
        blocks:
          80953 blocks
          59384 terms-only blocks
          10 sub-block-only blocks
          21559 mixed blocks
          18273 floor blocks
          25611 non-floor blocks
          55342 floor sub-blocks
          13294381 term suffix bytes (164.2 suffix-bytes/block)
          2538232 term stats bytes (31.4 stats-bytes/block)
          8839046 other bytes (109.2 other-bytes/block)
          by prefix length:
             0: 5
             1: 421
             2: 5620
             3: 18794
             4: 31598
             5: 16630
             6: 5322
             7: 1709
             8: 443
             9: 138
            10: 249
            11: 14
            12: 2
            13: 6
            14: 2
      
    test: stored fields.......OK [0 total field count; avg 0.0 fields per doc] 
[took 0.304 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq 
vector fields per doc] [took 0.002 sec]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.001 sec]
    test: points..............OK [0 fields, 0 points] [took 0.000 sec]

detailed segment RAM usage: 
_j(7.0.0):C12696047: 1.1 MB
|-- postings [PerFieldPostings(segment=_j formats=1)]: 1.1 MB
    |-- format 'AutoPrefix_0' 
[BlockTreeTermsReader(fields=2,delegate=Lucene50PostingsReader(positions=false,payloads=false))]:
 1.1 MB
        |-- field 'field' 
[BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 
683.7 KB
            |-- term index 
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
        |-- field 'field-autoprefix' 
[BlockTreeTerms(terms=1414516,postings=187518426,positions=-1,docs=12682564)]: 
392 KB
            |-- term index 
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 391.9 KB
        |-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 
32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 
61.9 KB
    |-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 
61.9 KB
        |-- doc base deltas: 30.5 KB
        |-- start pointer deltas: 29.1 KB

No problems were detected with this index.

Took 5.986 sec total.


Total index size: 315123344 bytes
{noformat}
{panel}
 
The index size is 301M which is bigger than the 225M of the 2-5 edge ngram 
index and it is not faster to build.
Bottom line is that this PostingsFormat could help only for cases with big 
prefixes that shares a lot of terms, since this is a corner case I'll close 
this issue and move the patch into my own repo. I'll leave the last patch on 
this issue in order to let people reiterate from the latest status if they have 
a use case that could be optimized with this format.
I ran a last test to see if selecting the prefixes based on the number of terms 
could help. I tried with different size, the following panel is the result with 
minAutoPrefixTerms=24:

{panel:title=AutoPrefix 
minPrefixTerms=24|borderStyle=dashed|borderColor=#ccc|titleBGColor=#F7D6C1|bgColor=#FFFFCE}
{noformat}

Final Indexed 12696047: 53.821 sec
Optimize...
After force merge: 75.871 sec
Close...
After close: 75.876 sec
Done CheckIndex:
Segments file=segments_1 numSegments=1 version=7.0.0 
id=4cf35tmfowkmzqs14df1f10te
  1 of 1: name=_j maxDoc=12696047
    version=7.0.0
    id=4cf35tmfowkmzqs14df1f10td
    codec=Lucene62
    compound=false
    numFiles=7
    size (MB)=231.003
    diagnostics = {os=Mac OS X, java.vendor=Oracle Corporation, 
java.version=1.8.0_77, java.vm.version=25.77-b03, lucene.version=7.0.0, 
mergeMaxNumSegments=1, os.arch=x86_64, java.runtime.version=1.8.0_77-b03, 
source=merge, mergeFactor=9, os.version=10.11.4, timestamp=1472114958183}
    no deletions
    test: open reader.........OK [took 0.002 sec]
    test: check integrity.....OK [took 0.165 sec]
    test: check live docs.....OK [took 0.000 sec]
    test: field infos.........OK [2 fields] [took 0.000 sec]
    test: field norms.........OK [0 fields] [took 0.000 sec]
    test: terms, freq, prox...OK [2564273 terms; 187695193 terms/docs pairs; 0 
tokens] [took 3.388 sec]
      field "field-autoprefix":
        index FST:
          13343 bytes
        terms:
          50307 terms
          223075 bytes (4.4 bytes/term)
        blocks:
          1626 blocks
          1250 terms-only blocks
          1 sub-block-only blocks
          375 mixed blocks
          355 floor blocks
          417 non-floor blocks
          1209 floor sub-blocks
          158541 term suffix bytes (97.5 suffix-bytes/block)
          87397 term stats bytes (53.7 stats-bytes/block)
          179907 other bytes (110.6 other-bytes/block)
          by prefix length:
             0: 6
             1: 189
             2: 773
             3: 595
             4: 49
             5: 10
             6: 2
             7: 1
             9: 1
      
      field "field":
        index FST:
          699971 bytes
        terms:
          2513966 terms
          20843092 bytes (8.3 bytes/term)
        blocks:
          80953 blocks
          59384 terms-only blocks
          10 sub-block-only blocks
          21559 mixed blocks
          18273 floor blocks
          25611 non-floor blocks
          55342 floor sub-blocks
          13294381 term suffix bytes (164.2 suffix-bytes/block)
          2538232 term stats bytes (31.4 stats-bytes/block)
          8839046 other bytes (109.2 other-bytes/block)
          by prefix length:
             0: 5
             1: 421
             2: 5620
             3: 18794
             4: 31598
             5: 16630
             6: 5322
             7: 1709
             8: 443
             9: 138
            10: 249
            11: 14
            12: 2
            13: 6
            14: 2
      
    test: stored fields.......OK [0 total field count; avg 0.0 fields per doc] 
[took 0.275 sec]
    test: term vectors........OK [0 total term vector count; avg 0.0 term/freq 
vector fields per doc] [took 0.000 sec]
    test: docvalues...........OK [0 docvalues fields; 0 BINARY; 0 NUMERIC; 0 
SORTED; 0 SORTED_NUMERIC; 0 SORTED_SET] [took 0.000 sec]
    test: points..............OK [0 fields, 0 points] [took 0.000 sec]

detailed segment RAM usage: 
_j(7.0.0):C12696047: 758.9 KB
|-- postings [PerFieldPostings(segment=_j formats=1)]: 697 KB
    |-- format 'AutoPrefix_0' 
[BlockTreeTermsReader(fields=2,delegate=Lucene50PostingsReader(positions=false,payloads=false))]:
 696.9 KB
        |-- field 'field' 
[BlockTreeTerms(terms=2513966,postings=34713220,positions=-1,docs=12682564)]: 
683.7 KB
            |-- term index 
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 683.6 KB
        |-- field 'field-autoprefix' 
[BlockTreeTerms(terms=50307,postings=152981973,positions=-1,docs=12682435)]: 
13.2 KB
            |-- term index 
[FST(input=BYTE1,output=ByteSequenceOutputs,packed=false]: 13 KB
        |-- delegate [Lucene50PostingsReader(positions=false,payloads=false)]: 
32 bytes
|-- stored fields [CompressingStoredFieldsReader(mode=FAST,chunksize=16384)]: 
61.9 KB
    |-- stored field index [CompressingStoredFieldsIndexReader(blocks=97)]: 
61.9 KB
        |-- doc base deltas: 30.5 KB
        |-- start pointer deltas: 29.1 KB

No problems were detected with this index.

Took 3.852 sec total.


Total index size: 242224063 bytes
{noformat}
{panel}

The index size is 231M which is very close to the edge 2-5 index size and it's 
a little bit faster to build than the edge ngram index but the gain are 
negligible compared to the cost of maintenance of a PostingsFormat like this.

> AutoPrefixPostingsFormat: a PostingsFormat optimized for prefix queries on 
> text fields.
> ---------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7423
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7423
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/sandbox
>            Reporter: Ferenczi Jim
>            Priority: Minor
>         Attachments: LUCENE-7423.patch
>
>
> The autoprefix terms dict added in 
> https://issues.apache.org/jira/browse/LUCENE-5879 has been removed with 
> https://issues.apache.org/jira/browse/LUCENE-7317.
> The new points API is now used to do efficient range queries but the 
> replacement for prefix string queries is unclear. The edge ngrams could be 
> used instead but they have a lot of drawbacks and are hard to configure 
> correctly. The completion postings format is also a good replacement but it 
> requires to have a big FST in RAM and it cannot be intersected with other 
> fields. 
> This patch is a proposal for a new PostingsFormat optimized for prefix query 
> on string fields. It detects prefixes that match "enough" terms and writes 
> auto-prefix terms into their own virtual field.
>  At search time the virtual field is used to speed up prefix queries that 
> match "enough" terms.
> The auto-prefix terms are built in two pass:
> * The first pass builds a compact prefix tree. Since the terms enum is sorted 
> the prefixes are flushed on the fly depending on the input. For each prefix 
> we build its corresponding inverted lists using a DocIdSetBuilder. The first 
> pass visits each term of the field TermsEnum only once. When a prefix is 
> flushed from the prefix tree its inverted lists is dumped into a temporary 
> file for further use. This is necessary since the prefixes are not sorted 
> when they are removed from the tree. The selected auto prefixes are sorted at 
> the end of the first pass.
> * The second pass is a sorted scan of the prefixes and the temporary file is 
> used to read the corresponding inverted lists.
> The patch is just a POC and there are rooms for optimizations but the first 
> results are promising:
> I tested the patch with the geonames dataset. I indexed all the titles with 
> the KeywordAnalyzer and compared the index/merge time and the size of the 
> indices. 
> The edge ngram index (with a min edge ngram size of 2 and a max of 20) takes 
> 572M on disk and it took 130s to index and optimize the 11M titles. 
> The auto prefix index takes 287M on disk and took 70s to index and optimize 
> the same 11M titles. Among the 287M, only 170M are used for the auto prefix 
> fields and the rest is for the regular keyword field. All the auto prefixes 
> were generated for this test (at least 2 terms per auto-prefix).  
> The queries have similar performance since we are sure on both sides that one 
> inverted list can answer any prefix query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to