On Thursday 18 August 2005 21:51, Dan Armbrust wrote: > I am implementing a filter that will remove certain characters from the > tokens - thing like '(', etc - but the chars to be removed will be > customizable. > > This is what I have come up with - but it doesn't seem very efficient. > Is there a better way? > Should I be adjusting the token endOffset when I remove characters? If > I end up removing all characters, should I be returning null, rather > than returning a token with no text? > > > > public class CharRemovingFilter extends TokenFilter > { > StringBuffer temp = new StringBuffer(); > Set charsToRemove; > > /** > * Builds a Set from an array of chars to remove, appropriate for > passing into the > * CharRemovingFilter constructor. > */ > public static final Set makeCharRemovalSet(char[] charsToRemove) > { > HashSet temp = new HashSet(charsToRemove.length); > for (int i = 0; i < charsToRemove.length; i++) > { > temp.add(new Character(charsToRemove[i])); > } > return temp; > } > > public CharRemovingFilter(TokenStream in, Set charsToRemove) > { > super(in); > this.charsToRemove = charsToRemove; > } > > public Token next() throws IOException > { > Token t = input.next(); > > if (t == null) > { > return null; > } > > temp.setLength(0); > > for (int i = 0; i < t.termText().length(); i++) > { > if (!charsToRemove.contains(new > Character(t.termText().charAt(i))))
I'd expect String.indexOf( .... .charAt(i)) to be faster than creating a new Character. For this for loop, using String.replace() of java 1.5 might be faster than hashing. Removing the replaced characters would need one more scan. What is the maximum int value of a Unicode character anyway? Other than that, when the int value of the input characters is small enough, one can create a table mapping all chars to their replacements and index this table. Removing the replaced chars can then be done by not incrementing the target index for a replacement character indicating deletion. Inlining the t.termText() method call will probably help in any case. Regards, Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]