Re: Token Filter question

Erik Hatcher Thu, 18 Aug 2005 13:59:37 -0700

On Aug 18, 2005, at 3:51 PM, Dan Armbrust wrote:

I am implementing a filter that will remove certain characters fromthe tokens - thing like '(', etc - but the chars to be removed willbe customizable.
This is what I have come up with - but it doesn't seem veryefficient. Is there a better way?

Without taking the time to look at your code much, here are somethings to note....

Should I be adjusting the token endOffset when I remove characters?

This really depends on what you plan on doing with the offsets. Ifyou're not using them at all, then it doesn't matter. But if you'redoing hit highlighting then it will matter and the offsets providethe positions to highlight. If you've got text that says "(foo)" andyou want searches for "foo" to highlight only "foo" but not "(foo)"then you'll want to adjust the offsets accordingly (this is presumingyour filter is seeing "(foo)" as a token)

If I end up removing all characters, should I be returning null,rather than returning a token with no text?

If you return null, the analysis process ends thinking that is theend of the token stream. Rather what you want to do is grab the nexttoken and process it and be sure to return successive tokens throughyour filter, and only null at the end of them all.


    Erik




public class CharRemovingFilter extends TokenFilter
{
   StringBuffer temp = new StringBuffer();
   Set          charsToRemove;

   /**

* Builds a Set from an array of chars to remove, appropriatefor passing into the

    * CharRemovingFilter constructor.
    */
   public static final Set makeCharRemovalSet(char[] charsToRemove)
   {
       HashSet temp = new HashSet(charsToRemove.length);
       for (int i = 0; i < charsToRemove.length; i++)
       {
           temp.add(new Character(charsToRemove[i]));
       }
       return temp;
   }

   public CharRemovingFilter(TokenStream in, Set charsToRemove)
   {
       super(in);
       this.charsToRemove = charsToRemove;
   }

   public Token next() throws IOException
   {
       Token t = input.next();

       if (t == null)
       {
           return null;
       }

       temp.setLength(0);

       for (int i = 0; i < t.termText().length(); i++)
       {

if (!charsToRemove.contains(new Character(t.termText().charAt(i))))

           {
               temp.append(t.termText().charAt(i));
           }
       }

Token returnValue = new Token(temp.toString(), t.startOffset(), t.endOffset());


       return returnValue;
   }


And here is part of the Analyzer that uses it:

public final TokenStream tokenStream(String fieldname, finalReader reader)

   {
       TokenStream result = new WhitespaceTokenizer(reader);
       result = new LowerCaseFilter(result);
       if (stopTable != null)
       {
           result = new StopFilter(result, stopTable);
       }
       if (charRemovalTable != null)
       {
           result = new CharRemovingFilter(result, charRemovalTable);
       }

       return result;
   }

Thanks,

Dan

--
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Token Filter question

Reply via email to