[ 
https://issues.apache.org/jira/browse/LUCENE-6400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14498799#comment-14498799
 ] 

ASF subversion and git services commented on LUCENE-6400:
---------------------------------------------------------

Commit 1674159 from [~mikemccand] in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1674159 ]

LUCENE-6400: preserve original token when possible in SolrSynonymParser

> SynonymParser should encode 'expand' correctly.
> -----------------------------------------------
>
>                 Key: LUCENE-6400
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6400
>             Project: Lucene - Core
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: Trunk, 5.2
>
>         Attachments: LUCENE-6400.patch, LUCENE-6400.patch, LUCENE-6400.patch, 
> LUCENE-6400.patch, PositionLenghtAndType-unittests.patch, 
> unittests-expand-and-parse.patch
>
>
> Today SolrSynonymParser encodes something like A, B, C with 'expand=true' 
> like this:
> A -> A, B, C (includeOrig=false)
> B -> B, A, C (includeOrig=false)
> C -> C, A, B (includeOrig=false)
> This gives kinda buggy output (synfilter sees it all as replacements, and 
> makes all the terms with type synonym, positionLength isnt supported, etc) 
> and it wastes space in the FST (includeOrig is just one bit). 
> Example with "spiderman, spider man" and analysis on 'spider man'
> Trunk:
> term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=SYNONYM*
> term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=1*,type=SYNONYM
> term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=SYNONYM*
> You can see this is confusing, all the words have type SYNONYM, because 
> spider and man got deleted, and totally replaced by new terms (Which happen 
> to have the same text).
> Patch:
> term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=word*
> term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=2*,type=SYNONYM
> term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=word*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to