On 19/10/2011 15:17, Steven A Rowe wrote:
Hi Paul,
What version of Lucene are you using? The JFlex spec you quote below looks
pre-v3.1?
Yes, we copied a version of StandardTokenizer from 2.4 to make some
changes, we are actually on 3.1 now but haven't spent any time looking
at the new tokenizer flex code which appears better.
Anyway I finally have a proof of concept that I think will work this time
I realised that if someone enters 'fred!!!' I dont want to just match
match to 'fred', because then another token will be created for '!!!' so
Ive created separate rules for matching
fred (ALPHANUM)
fred!!! (EMAIL)
!!! (COMPANY)
Modified jflex to catch
// basic word: a sequence of digits & letters (includes Thai to enable
ThaiAnalyzer to function)
ALPHANUM = ({LETTER}|{THAI}|[:digit:])+
// 'PUNCTUATIONCONTROL' control/punctuation chars
CONTROLANDPUNC = ("!"|"*"|"^"|"!"|"."|"@"|"%"|"♠"|"\"")
COMPANY = ({CONTROLANDPUNC})+
//MUST CONTAIN Alphanumeric and Punctuation Characters
EMAIL =
({ALPHANUM}|{CONTROLANDPUNC})*{CONTROLANDPUNC}{ALPHANUM}({ALPHANUM}|{CONTROLANDPUNC})*
|
({ALPHANUM}|{CONTROLANDPUNC})*{ALPHANUM}{CONTROLANDPUNC}({ALPHANUM}|{CONTROLANDPUNC})*
%%
{EMAIL} { return
EMAIL; }
{ALPHANUM} { return
ALPHANUM; }
{COMPANY} { return
COMPANY; }
Then I have a filter that looks for type=EMAIL and removes those
punctuation chars
public final boolean incrementToken() throws java.io.IOException {
if (!input.incrementToken()) {
return false;
}
char[] buffer = termAtt.buffer();
final int bufferLength = termAtt.length();
final String type = typeAtt.type();
if (type == EMAIL) { // remove control chars when they
make up only part of the token
int upto = 0;
for (int i = 0; i < bufferLength; i++) {
char c = buffer[i];
if (
(c == '!')
) {
//Do Nothing, (drop the character)
}
else {
buffer[upto++] = c;
}
}
termAtt.setLength(upto);
}
return true;
}
I just need to improve the code to use suitable list of control chars
rather than hardcoding individual chars.
This solution seems the closest fit to lucene.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org