Welcome Diego, I think you’re right about MidLetter - adding a char to it should disable splitting on that char, as long as there is a letter on one side or the other. (If you’d like that behavior to be extended to numeric digits, you should use MidNumLet instead.)
I tested this by adding “/“ to MidLetter in StandardTokenizerImpl.jflex (compressed whitespace diff below): -MidLetter = (\p{WB:MidLetter} | {MidLetterSupp}) +MidLetter = ([/\p{WB:MidLetter}] | {MidLetterSupp}) then running ‘ant jflex’ under lucene/analysis/common/, and the following text was split as indicated (I tested by adding the method below to TestStandardAnalyzer.java): public void testMidLetterSlash() throws Exception { BaseTokenStreamTestCase.assertAnalyzesTo(a, "/one/two/three/ four”, new String[]{ "one/two/three", "four" }); BaseTokenStreamTestCase.assertAnalyzesTo(a, "1/two/3”, new String[] { "1", "two", "3" }); } So it works for me - are you regenerating the scanner (‘ant jflex’)? FYI, I found a bug when I was testing the above: “http://example.com” is left intact when “/“ is added to MidLetter, but it shouldn’t be; although ‘:’ and ‘/‘ are in [/\p{WB:MidLetter}], the letter-on-both-sides requirement should instead result in “http://example.com” being split into “http” and “example.com”. Further testing indicates that this is a problem for MidLetter, MidNumLet and MidNum. I’ve filed an issue: <https://issues.apache.org/jira/browse/LUCENE-5447>. Steve On Feb 14, 2014, at 1:42 PM, Diego Fernandez <difer...@redhat.com> wrote: > Hi guys, this is my first time posting on the Lucene list, so hello everyone. > > I really like the way that the StandardTokenizer works, however I'd like for > it to not split tokens on / (forward slash). I've been looking at > http://unicode.org/reports/tr29/#Default_Word_Boundaries to try to understand > the rules, but I'm either misunderstanding or missing something. If I > understand correctly, the symbols in MidLetter keep it from splitting a token > as long as there's alpha chars on either side. I tried adding the forward > slash to the MidLetter and MidLetterSupp rules (tried different > combinations), but it still seems like it's splitting on it. > > Does anyone have any tips or ideas? > > Thanks > > Diego Fernandez - 爱国 > Software Engineer > US GSS Supportability - Diagnostics > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org