Re: Extending StandardTokenizer Jflex to not split on '/'

Steve Rowe Fri, 14 Feb 2014 12:57:37 -0800

Welcome Diego,

I think you’re right about MidLetter - adding a char to it should disable 
splitting on that char, as long as there is a letter on one side or the other.  
(If you’d like that behavior to be extended to numeric digits, you should use 
MidNumLet instead.)


I tested this by adding “/“ to MidLetter in StandardTokenizerImpl.jflex 
(compressed whitespace diff below):

    -MidLetter = (\p{WB:MidLetter}    | {MidLetterSupp})
    +MidLetter = ([/\p{WB:MidLetter}] | {MidLetterSupp})

then running ‘ant jflex’ under lucene/analysis/common/, and the following text 
was split as indicated (I tested by adding the method below to 
TestStandardAnalyzer.java):

  public void testMidLetterSlash() throws Exception {
    BaseTokenStreamTestCase.assertAnalyzesTo(a, "/one/two/three/ four”, 
                                  new String[]{ "one/two/three", "four" });
    BaseTokenStreamTestCase.assertAnalyzesTo(a, "1/two/3”, 
                                 new String[] { "1", "two", "3" });
  }

So it works for me - are you regenerating the scanner (‘ant jflex’)?

FYI, I found a bug when I was testing the above: “http://example.com” is left 
intact when “/“ is added to MidLetter, but it shouldn’t be; although ‘:’ and 
‘/‘ are in [/\p{WB:MidLetter}], the letter-on-both-sides requirement should 
instead result in “http://example.com” being split into “http” and 
“example.com”.  Further testing indicates that this is a problem for MidLetter, 
MidNumLet and MidNum.  I’ve filed an issue: 
<https://issues.apache.org/jira/browse/LUCENE-5447>.

Steve

On Feb 14, 2014, at 1:42 PM, Diego Fernandez <difer...@redhat.com> wrote:

> Hi guys, this is my first time posting on the Lucene list, so hello everyone.
> 
> I really like the way that the StandardTokenizer works, however I'd like for 
> it to not split tokens on / (forward slash).  I've been looking at 
> http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to understand 
> the rules, but I'm either misunderstanding or missing something.  If I 
> understand correctly, the symbols in MidLetter keep it from splitting a token 
> as long as there's alpha chars on either side.  I tried adding the forward 
> slash to the MidLetter and MidLetterSupp rules (tried different 
> combinations), but it still seems like it's splitting on it.
> 
> Does anyone have any tips or ideas?
> 
> Thanks
> 
> Diego Fernandez - 爱国
> Software Engineer
> US GSS Supportability - Diagnostics
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Extending StandardTokenizer Jflex to not split on '/'

Reply via email to