John Berryman created SOLR-4930:
-----------------------------------

             Summary: Make PathHierarchyTokenizer use regex and optionally 
prefix the depth of the path.
                 Key: SOLR-4930
                 URL: https://issues.apache.org/jira/browse/SOLR-4930
             Project: Solr
          Issue Type: Improvement
          Components: Schema and Analysis
            Reporter: John Berryman
            Priority: Minor


The PathHierarchyTokenizer lacks a couple of features that I think are commonly 
needed.

1. Split and replace based upon regex.
2. Optionally prefix the token with the depth of the path token

Motivation: I recently had a client who asked me to index laws that were 
organized in the chapters, sections, subsections, etc. The problem was that the 
section number used a mixture of delimiters. Ex: 13.4-64.2, so I had to use 
pattern replacement to map either delimiter to tilda. But the next problem was 
that these could no longer be displayed as facets (at least not without extra 
code on the front end). Also, I wanted to prefix the depth of the path at the 
front of the token. Again, I can achieve this with pattern replacement - but it 
is ugly and non-performant.

I propose we:

* update PathHierarchyTokenizer so that if the parameters for delimiter of 
replacement are single character, then the behavior of PathHierarchyTokenizer 
remains consistent, but if the length of these arguments is greater than one, 
then they should be interpreted as regex.
* add a new parameter called depthPrefixNumChars that indicates how many 
characters will be used for a depth prefix - this defaults to zero

Here's my current first stab at it:
https://github.com/o19s/statedecoded/blob/master/solr_home/statedecoded/src/src/main/java/com/o19s/RegexPathHierarchyTokenizer.java
 This doesn't support the replacement or skip parameter yet. Before I go the 
rest of the way, I wanted to gauge interest and see if others need this.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to