Damerian, The technique I mentioned would work for you with a little tweaking: when you see consecutive capitalized tokens, then just set the CharTermAttribute to the joined tokens, and clear the previous token.
Another idea: you could use ShingleFilter with min size = max size = 2, and then use a following Filter extending FilteringTokenFilter, with an accept() method that examines shingles and rejects ones that don't qualify, something like the following. (Notes: this is untested; I assume you will use the default shingle token separator " "; and this filter will reject all non-shingle terms, so you won't get anything but names, even if you configure ShingleFilter to emit single tokens): public final class MyNameFilter extends FilteringTokenFilter { private static final Pattern NAME_PATTERN = Pattern.compile("\\p{Lu}\\S*(?:\\s\\p{Lu}\\S*)+"); private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); @Override public boolean accept() throws IOException { return NAME_PATTERN.matcher(termAtt).matches(); } } Steve > -----Original Message----- > From: Damerian [mailto:dameria...@gmail.com] > Sent: Thursday, February 09, 2012 4:15 PM > To: java-user@lucene.apache.org > Subject: Re: Access next token in a stream > > Στις 9/2/2012 8:54 μμ, ο/η Steven A Rowe έγραψε: > > Hi Damerian, > > > > One way to handle your scenario is to hold on to the previous token, and > only emit a token after you reach at least the second token (or at end-of- > stream). Your incrementToken() method could look something like: > > > > 1. Get current attributes: input.incrementToken() > > 2. If previous token does not exist: > > 2a. Store current attributes as previous token (see > AttributeSource#cloneAttributes) > > 2b. Get current attributes: input.incrementToken() > > 3. Check for& store conditions that will affect previous token's > attributes > > 4. Store current attributes as next token (see > AttributeSource#cloneAttributes) > > 5. Copy previous token into current attributes (see > AttributeSource#copyTo); > > the target will be "this", which is an AttributeSource. > > 6. Make changes based on conditions found in step #3 above > > 7. set previous token = next token > > 8. return true > > > > (Everywhere I say "token" I mean "instance of AttributeSource".) > > > > The final token in the input stream will need special handling, as will > single-token input streams. > > > > Good luck, > > Steve > > > >> -----Original Message----- > >> From: Damerian [mailto:dameria...@gmail.com] > >> Sent: Thursday, February 09, 2012 2:19 PM > >> To: java-user@lucene.apache.org > >> Subject: Access next token in a stream > >> > >> Hello i want to implement my custom filter, my wuestion is quite simple > >> but i cannot find a solution to it no matter how i try: > >> > >> How can i access the TermAttribute of the next token than the one i > >> currently have in my stream? > >> > >> For example in the phrase "My name is James Bond" if let's say i am in > >> the token [My], i would like to be able to check the TermAttribute of > >> the following token [name] and fix my position increment accordingly. > >> > >> Thank you in advance! > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > Hi Steve, > Thank you for your immediate reply. i will try your solution but i feel > that it does not solve my case. > What i am trying to make is a filter that joins together two > terms/tokens that start with a capital letter (it is trying to find all > the Names/Surnames and make them one token) so in my aforementioned > example when i examine [James] even if i store the TermAttribute to a > temporary token how can i check the next one [Bond] , to join them > without actually emmiting (and therefore creating a term in my inverted > index) that has [James] on its own. > Thank you again for your insight and i would relly appreciate any other > views on the matter. > > Regards, Damerian > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org