Re: Hypenated word

Markus Wiederkehr Mon, 13 Jun 2005 06:45:40 -0700

I see, the list of exceptions makes this a lot more complicated than I
thought... Thanks a lot, Erik!


Markus

On 6/13/05, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> 
> On Jun 13, 2005, at 7:08 AM, Markus Wiederkehr wrote:
> > I work on an application that has to index OCR texts of scanned books.
> > Naturally there occur many words that are hyphenated across lines.
> >
> > I wonder if there is already an Analyzer or maybe a TokenFilter that
> > can merge those syllables back into whole words? It looks like Erik
> > Hatcher uses something like that at http://www.lucenebook.com/.
> 
> Markus - you're right, I did develop something to handle hyphenated
> words for lucenebook.com.  It was sort of a hack in that I had to
> build in a static list of exceptions in how I handled this, so you'll
> likely have to use caution as well.  The LiaAnalyzer is this:
> 
>    public TokenStream tokenStream(String fieldName, Reader reader) {
>      TokenFilter filter = new DashSplitterFilter(
>                new HyphenatedFilter(
>                  new DashDashFilter(
>                    new LiaTokenizer(reader))));
> 
>      filter = new LengthFilter(3, filter);
>      filter = new StopFilter(filter, stopSet);
> 
>      if (stem) {
>        filter = new SnowballFilter(filter, "English");
>      }
> 
>      return filter;
>    }
> 
> 
> And my HyphenatedFilter is this:
> 
> public class HyphenatedFilter extends TokenFilter {
>    private HashMap exceptions = new HashMap();
> 
>    private static final String[] EXCEPTION_LIST = {
>       "full-text", "information-retrieval", "license-code", "old-
> fashioned",
>       "well-designed", "free-form", "file-based", "ramdirectory-
> based", "ram-based",
>       "index-modifying", "read-only",
>       "top-scoring", "most-recently-used", "queryparser-parsed",
>       "in-order", "per-document", "lower-caser", "domain-specific",
> "high-level",
>       "utf-encoding", "non-english", "phraseprefix-it", "all-inclusive",
>       "date-range", "computation-intensive", "hits-returning", "lower-
> level",
>       "number-padding", "utf-address-book", "third-party", "plain-
> text", "google-like",
>       "re-add", "english-specific", "file-handling", "already-
> created", "d-add", "d-add",
>       "hits-length", "hits-doc", "hits-score", "d-get", "writer-new",
> "porteranalyzer-new",
>       "writer-set", "document-new", "doc-add", "field-keyword",
> "field-unstored", "writer-add",
>       "writer-optimize", "queryparser-new", "porteranalyzer-new",
> "parser-parse", "indexsearcher-new",
>       "hitcollector-new", "searcher-doc", "searcher-search", "jakarta-
> lucene", "www-ibm", "java-specific",
>       "non-java", "vis--vis", "medium-sized", "browser-based", "utf-
> before", "concept-based",
>       "natural-language", "queue-based", "high-likelihood", "slp-or",
> "noisy-channel", "al-rasheed",
>       "hands-free", "top-notch", "google-esque", "search-config",
> "java-related",
>       "lucene-so", "lucene-tar", "lucene-jar", "lucene-demos-jar",
> "lucene-web", "lucene-webindex",
>       "command-line", "lucene-version", "issue-tracking"
>    };
> 
>    protected HyphenatedFilter(TokenStream tokenStream) {
>      super(tokenStream);
> 
>      for (int i = 0; i < EXCEPTION_LIST.length; i++) {
>        exceptions.put(EXCEPTION_LIST[i], "");
>      }
>    }
> 
>    private Token savedToken;
> 
>    public Token next() throws IOException {
> 
>      if (savedToken != null) {
>        Token token = savedToken;
>        savedToken = null;
>        return token;
>      }
> 
>      Token firstToken = input.next();
> 
>      if (firstToken == null)
>        return firstToken;
> 
> 
>      if (firstToken.termText().endsWith("-")) {
>        String firstPart;
>        firstPart = firstToken.termText();
> 
>        // consume next token
>        Token secondToken = input.next();
>        if (secondToken == null)
>          return firstToken;
> 
>        String termText = firstPart.substring(0, firstPart.length() -
> 1) + secondToken.termText();
> 
>        if (exceptions.containsKey(firstPart + secondToken.termText())) {
>          savedToken = secondToken;
>          return firstToken;
>        }
> 
>        return new Token(termText, firstToken.startOffset(),
> firstToken.endOffset() + secondToken.termText().length() + 1);
>      }
> 
>      return firstToken;
>    }
> }
> 
> Not all that pretty, I'm afraid, but by all means use it if its useful.
> 
>      Erik
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-- 
Always remember you're unique. Just like everyone else.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Hypenated word

Reply via email to