[
https://issues.apache.org/jira/browse/SOLR-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13902824#comment-13902824
]
Greg Pendlebury commented on SOLR-5722:
---------------------------------------
I don't think it does. It has been a while since we looked into it, and that
link is currently returning 503 for me, but my understanding was that the
HyphenatedWordsFilter put two tokens back together when a hyphen was found on
the end of the first token. The catenateShingles options we are using addresses
the scenario where multiple hyphens are found internal to a single token.
> Add catenateShingles option to WordDelimiterFilter
> --------------------------------------------------
>
> Key: SOLR-5722
> URL: https://issues.apache.org/jira/browse/SOLR-5722
> Project: Solr
> Issue Type: Improvement
> Reporter: Greg Pendlebury
> Priority: Minor
> Labels: filter, newbie, patch
> Attachments: WDFconcatShingles.patch
>
>
> Apologies if I put this in the wrong spot. I'm attaching a patch (against
> current trunk) that adds support for a 'catenateShingles' option to the
> WordDelimiterFilter.
> We (National Library of Australia - NLA) are currently maintaining this as an
> internal modification to the Filter, but I believe it is generic enough to
> contribute upstream.
> Description:
> =========
> {code}
> /**
> * NLA Modification to the standard word delimiter to support various
> * hyphenation use cases. Primarily driven by requirements for
> * newspapers where words are often broken across line endings.
> *
> * eg. "hyphenated-surname" is printed printed across a line ending and
> * turns out like "hyphen-ated-surname" or "hyphenated-sur-name".
> *
> * In this scenario the stock filter, with 'catenateAll' turned on, will
> * generate individual tokens plus one combined token, but not
> * sub-tokens like "hyphenated surname" and "hyphenatedsur name".
> *
> * So we add a new 'catenateShingles' to achieve this.
> */
> {code}
> Includes unit tests, and as is noted in one of them CATENATE_WORDS and
> CATENATE_SHINGLES are logically considered mutually exclusive for sensible
> usage and can cause duplicate tokens (although they should have the same
> positions etc).
> I'm happy to work on it more if anyone finds problems with it.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]