I am implementing a filter that will remove certain characters from the
tokens - thing like '(', etc - but the chars to be removed will be
customizable.
This is what I have come up with - but it doesn't seem very efficient.
Is there a better way?
Should I be adjusting the token endOffset when I remove characters? If
I end up removing all characters, should I be returning null, rather
than returning a token with no text?
public class CharRemovingFilter extends TokenFilter
{
StringBuffer temp = new StringBuffer();
Set charsToRemove;
/**
* Builds a Set from an array of chars to remove, appropriate for
passing into the
* CharRemovingFilter constructor.
*/
public static final Set makeCharRemovalSet(char[] charsToRemove)
{
HashSet temp = new HashSet(charsToRemove.length);
for (int i = 0; i < charsToRemove.length; i++)
{
temp.add(new Character(charsToRemove[i]));
}
return temp;
}
public CharRemovingFilter(TokenStream in, Set charsToRemove)
{
super(in);
this.charsToRemove = charsToRemove;
}
public Token next() throws IOException
{
Token t = input.next();
if (t == null)
{
return null;
}
temp.setLength(0);
for (int i = 0; i < t.termText().length(); i++)
{
if (!charsToRemove.contains(new
Character(t.termText().charAt(i))))
{
temp.append(t.termText().charAt(i));
}
}
Token returnValue = new Token(temp.toString(), t.startOffset(),
t.endOffset());
return returnValue;
}
And here is part of the Analyzer that uses it:
public final TokenStream tokenStream(String fieldname, final Reader
reader)
{
TokenStream result = new WhitespaceTokenizer(reader);
result = new LowerCaseFilter(result);
if (stopTable != null)
{
result = new StopFilter(result, stopTable);
}
if (charRemovalTable != null)
{
result = new CharRemovingFilter(result, charRemovalTable);
}
return result;
}
Thanks,
Dan
--
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]