I am implementing a filter that will remove certain characters from the tokens - thing like '(', etc - but the chars to be removed will be customizable.

This is what I have come up with - but it doesn't seem very efficient. Is there a better way? Should I be adjusting the token endOffset when I remove characters? If I end up removing all characters, should I be returning null, rather than returning a token with no text?



public class CharRemovingFilter extends TokenFilter
{
   StringBuffer temp = new StringBuffer();
   Set          charsToRemove;

   /**
* Builds a Set from an array of chars to remove, appropriate for passing into the
    * CharRemovingFilter constructor.
    */
   public static final Set makeCharRemovalSet(char[] charsToRemove)
   {
       HashSet temp = new HashSet(charsToRemove.length);
       for (int i = 0; i < charsToRemove.length; i++)
       {
           temp.add(new Character(charsToRemove[i]));
       }
       return temp;
   }

   public CharRemovingFilter(TokenStream in, Set charsToRemove)
   {
       super(in);
       this.charsToRemove = charsToRemove;
   }

   public Token next() throws IOException
   {
       Token t = input.next();

       if (t == null)
       {
           return null;
       }

       temp.setLength(0);

       for (int i = 0; i < t.termText().length(); i++)
       {
if (!charsToRemove.contains(new Character(t.termText().charAt(i))))
           {
               temp.append(t.termText().charAt(i));
           }
       }

Token returnValue = new Token(temp.toString(), t.startOffset(), t.endOffset());

       return returnValue;
   }


And here is part of the Analyzer that uses it:

public final TokenStream tokenStream(String fieldname, final Reader reader)
   {
       TokenStream result = new WhitespaceTokenizer(reader);
       result = new LowerCaseFilter(result);
       if (stopTable != null)
       {
           result = new StopFilter(result, stopTable);
       }
       if (charRemovalTable != null)
       {
           result = new CharRemovingFilter(result, charRemovalTable);
       }

       return result;
   }

Thanks,

Dan

--
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to