Re: skip document header while indexing

Erik Hatcher Fri, 29 Apr 2005 06:51:33 -0700

On Apr 29, 2005, at 8:30 AM, Pablo Gomes Ludermir wrote:

Could you give me some pointers (example or website) to how I could do that?

Lucene's own source code has several analyzers that are worth investigating. We also include several in Lucene in Action that demonstrate additional features like incorporating synonym lookup with WordNet and metaphone (soundex-like) replacements. http://www.lucenebook.com to grab the source code download.

The trick would be to add a TokenFilter that dropped Tokens until N number of tokens had been dropped.

For an example, here's the Analyzer I wrote for the lucenebook.com site:

public class LiaAnalyzer extends Analyzer {
  private Set stopSet;
  boolean stem = true;

  public LiaAnalyzer() {
    stopSet = StopFilter.makeStopSet(StopAnalyzer.ENGLISH_STOP_WORDS);

    // just a few words that would not be queried on
    stopSet.add("isn");
    stopSet.add("xyz");
    stopSet.add("bcd");
    stopSet.add("blt");
    stopSet.add("dhb");
    stopSet.add("ttc");
    stopSet.add("you");
    stopSet.add("our");
  }

  public LiaAnalyzer(boolean stem) {
    this();
    this.stem = stem;
  }

  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenFilter filter = new DashSplitterFilter(
              new HyphenatedFilter(
                new DashDashFilter(
                  new LiaTokenizer(reader))));

    filter = new LengthFilter(3, filter);
    filter = new StopFilter(filter, stopSet);

    if (stem) {
      filter = new SnowballFilter(filter, "English");
    }

    return filter;
  }
}

        Erik

On 4/29/05, Erik Hatcher <[EMAIL PROTECTED]> wrote:


On Apr 29, 2005, at 7:50 AM, Pablo Gomes Ludermir wrote:

Hello all,
Is it possible to skip the first "xx" words while indexing a document? For instance, on the code bellow, I would like to skip the "xx" first words of "file" on the "CONTENTS_FIELD". Is that possible?
Document doc = new Document();
FileInputStream is = new FileInputStream(file);
Reader reader = new BufferedReader(new InputStreamReader(is));
doc.add(Field.Text(PATH_FIELD, artifactModel));
doc.add(Field.Text(CONTENTS_FIELD, reader, true));


I believe your best bet will be to put in a custom Analyzer that does
this.  It wouldn't be too hard to code a wrapper around an analyzer
that did this.

       Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Pablo Gomes Ludermir
[EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: skip document header while indexing

Reply via email to