On Aug 18, 2005, at 3:51 PM, Dan Armbrust wrote:
I am implementing a filter that will remove certain characters from
the tokens - thing like '(', etc - but the chars to be removed will
be customizable.
This is what I have come up with - but it doesn't seem very
efficient. Is there a better way?
Without taking the time to look at your code much, here are some
things to note....
Should I be adjusting the token endOffset when I remove characters?
This really depends on what you plan on doing with the offsets. If
you're not using them at all, then it doesn't matter. But if you're
doing hit highlighting then it will matter and the offsets provide
the positions to highlight. If you've got text that says "(foo)" and
you want searches for "foo" to highlight only "foo" but not "(foo)"
then you'll want to adjust the offsets accordingly (this is presuming
your filter is seeing "(foo)" as a token)
If I end up removing all characters, should I be returning null,
rather than returning a token with no text?
If you return null, the analysis process ends thinking that is the
end of the token stream. Rather what you want to do is grab the next
token and process it and be sure to return successive tokens through
your filter, and only null at the end of them all.
Erik
public class CharRemovingFilter extends TokenFilter
{
StringBuffer temp = new StringBuffer();
Set charsToRemove;
/**
* Builds a Set from an array of chars to remove, appropriate
for passing into the
* CharRemovingFilter constructor.
*/
public static final Set makeCharRemovalSet(char[] charsToRemove)
{
HashSet temp = new HashSet(charsToRemove.length);
for (int i = 0; i < charsToRemove.length; i++)
{
temp.add(new Character(charsToRemove[i]));
}
return temp;
}
public CharRemovingFilter(TokenStream in, Set charsToRemove)
{
super(in);
this.charsToRemove = charsToRemove;
}
public Token next() throws IOException
{
Token t = input.next();
if (t == null)
{
return null;
}
temp.setLength(0);
for (int i = 0; i < t.termText().length(); i++)
{
if (!charsToRemove.contains(new Character(t.termText
().charAt(i))))
{
temp.append(t.termText().charAt(i));
}
}
Token returnValue = new Token(temp.toString(), t.startOffset
(), t.endOffset());
return returnValue;
}
And here is part of the Analyzer that uses it:
public final TokenStream tokenStream(String fieldname, final
Reader reader)
{
TokenStream result = new WhitespaceTokenizer(reader);
result = new LowerCaseFilter(result);
if (stopTable != null)
{
result = new StopFilter(result, stopTable);
}
if (charRemovalTable != null)
{
result = new CharRemovingFilter(result, charRemovalTable);
}
return result;
}
Thanks,
Dan
--
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]