On Aug 18, 2005, at 3:51 PM, Dan Armbrust wrote:
I am implementing a filter that will remove certain characters from the tokens - thing like '(', etc - but the chars to be removed will be customizable.

This is what I have come up with - but it doesn't seem very efficient. Is there a better way?

Without taking the time to look at your code much, here are some things to note....

Should I be adjusting the token endOffset when I remove characters?

This really depends on what you plan on doing with the offsets. If you're not using them at all, then it doesn't matter. But if you're doing hit highlighting then it will matter and the offsets provide the positions to highlight. If you've got text that says "(foo)" and you want searches for "foo" to highlight only "foo" but not "(foo)" then you'll want to adjust the offsets accordingly (this is presuming your filter is seeing "(foo)" as a token)

If I end up removing all characters, should I be returning null, rather than returning a token with no text?

If you return null, the analysis process ends thinking that is the end of the token stream. Rather what you want to do is grab the next token and process it and be sure to return successive tokens through your filter, and only null at the end of them all.

    Erik





public class CharRemovingFilter extends TokenFilter
{
   StringBuffer temp = new StringBuffer();
   Set          charsToRemove;

   /**
* Builds a Set from an array of chars to remove, appropriate for passing into the
    * CharRemovingFilter constructor.
    */
   public static final Set makeCharRemovalSet(char[] charsToRemove)
   {
       HashSet temp = new HashSet(charsToRemove.length);
       for (int i = 0; i < charsToRemove.length; i++)
       {
           temp.add(new Character(charsToRemove[i]));
       }
       return temp;
   }

   public CharRemovingFilter(TokenStream in, Set charsToRemove)
   {
       super(in);
       this.charsToRemove = charsToRemove;
   }

   public Token next() throws IOException
   {
       Token t = input.next();

       if (t == null)
       {
           return null;
       }

       temp.setLength(0);

       for (int i = 0; i < t.termText().length(); i++)
       {
if (!charsToRemove.contains(new Character(t.termText ().charAt(i))))
           {
               temp.append(t.termText().charAt(i));
           }
       }

Token returnValue = new Token(temp.toString(), t.startOffset (), t.endOffset());

       return returnValue;
   }


And here is part of the Analyzer that uses it:

public final TokenStream tokenStream(String fieldname, final Reader reader)
   {
       TokenStream result = new WhitespaceTokenizer(reader);
       result = new LowerCaseFilter(result);
       if (stopTable != null)
       {
           result = new StopFilter(result, stopTable);
       }
       if (charRemovalTable != null)
       {
           result = new CharRemovingFilter(result, charRemovalTable);
       }

       return result;
   }

Thanks,

Dan

--
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to