John Berryman created SOLR-4930:
-----------------------------------
Summary: Make PathHierarchyTokenizer use regex and optionally
prefix the depth of the path.
Key: SOLR-4930
URL: https://issues.apache.org/jira/browse/SOLR-4930
Project: Solr
Issue Type: Improvement
Components: Schema and Analysis
Reporter: John Berryman
Priority: Minor
The PathHierarchyTokenizer lacks a couple of features that I think are commonly
needed.
1. Split and replace based upon regex.
2. Optionally prefix the token with the depth of the path token
Motivation: I recently had a client who asked me to index laws that were
organized in the chapters, sections, subsections, etc. The problem was that the
section number used a mixture of delimiters. Ex: 13.4-64.2, so I had to use
pattern replacement to map either delimiter to tilda. But the next problem was
that these could no longer be displayed as facets (at least not without extra
code on the front end). Also, I wanted to prefix the depth of the path at the
front of the token. Again, I can achieve this with pattern replacement - but it
is ugly and non-performant.
I propose we:
* update PathHierarchyTokenizer so that if the parameters for delimiter of
replacement are single character, then the behavior of PathHierarchyTokenizer
remains consistent, but if the length of these arguments is greater than one,
then they should be interpreted as regex.
* add a new parameter called depthPrefixNumChars that indicates how many
characters will be used for a depth prefix - this defaults to zero
Here's my current first stab at it:
https://github.com/o19s/statedecoded/blob/master/solr_home/statedecoded/src/src/main/java/com/o19s/RegexPathHierarchyTokenizer.java
This doesn't support the replacement or skip parameter yet. Before I go the
rest of the way, I wanted to gauge interest and see if others need this.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]