[
https://issues.apache.org/jira/browse/LUCENE-6400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14487436#comment-14487436
]
Robert Muir commented on LUCENE-6400:
-------------------------------------
Yeah, that problem is disappointing, but a difficult problem. Definitely one
that needs to be fixed. I get the impression from Mike (who is the expert on
it), that it requires changes to the tokenstream api so that it can be done
safely.
On the other hand we should look at your tests and try to integrate ones for
parsing that show we do the right thing. Maybe we can find or add an assert
method that just compares against the SynonymMap directly. Something like
assertEntryEquals(String word, boolean includeOrig, String synonyms...) as a
start and build from there. It could verify synonyms.length vs count and
includeOrig from the header and then the set of strings (empty string means a
hole).
> SynonymParser should encode 'expand' correctly.
> -----------------------------------------------
>
> Key: LUCENE-6400
> URL: https://issues.apache.org/jira/browse/LUCENE-6400
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Robert Muir
> Attachments: LUCENE-6400.patch, PositionLenghtAndType-unittests.patch
>
>
> Today SolrSynonymParser encodes something like A, B, C with 'expand=true'
> like this:
> A -> A, B, C (includeOrig=false)
> B -> B, A, C (includeOrig=false)
> C -> C, A, B (includeOrig=false)
> This gives kinda buggy output (synfilter sees it all as replacements, and
> makes all the terms with type synonym, positionLength isnt supported, etc)
> and it wastes space in the FST (includeOrig is just one bit).
> Example with "spiderman, spider man" and analysis on 'spider man'
> Trunk:
> term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=SYNONYM*
> term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=1*,type=SYNONYM
> term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=SYNONYM*
> You can see this is confusing, all the words have type SYNONYM, because
> spider and man got deleted, and totally replaced by new terms (Which happen
> to have the same text).
> Patch:
> term=spider,startOffset=0,endOffset=6,positionIncrement=1,positionLength=1,*type=word*
> term=spiderman,startOffset=0,endOffset=10,positionIncrement=0,*positionLength=2*,type=SYNONYM
> term=man,startOffset=7,endOffset=10,positionIncrement=1,positionLength=1,*type=word*
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]