How about making each line a separate document? You'd worry about scaling it later (e.g. the 32-bit limitation in the number of docs in an index)..
On Fri, Aug 6, 2010 at 11:37 AM, arun r <arun....@gmail.com> wrote: > I am trying to create a custom analyzer that will check for pagebreak > and linebreak and add the payload data for each term. In the custom > filter I have this code: > > public boolean incrementToken() throws IOException { > > if(input.incrementToken()) > { > if(termAtt.term().equals(pageBreak)){ > System.out.println("pageBreak"); > pageCount++; > } > else if(termAtt.term().equals(lineBreak)) > { > System.out.println("lineBreak"); > lineCount++; > } > else > addPayload(lineCount, pageCount); > > return true; > } > else > return false; > } > > where pageBreak and lineBreak is defined as : > int pageBreakAscii = 12; > String pageBreak = new Character ((char) pageBreakAscii).toString(); > String lineBreak = System.getProperty("line.separator"); > > And am using the WhitespaceAnalyzer tokenstream, which ignores the > pageBreak and lineBreak. Is there a way to create a analyzer that will > ignore the pagebreak and linebreak characters during search, but give > access to them in incrementToken() in the filter ? > > > -- > Where there is a will, there is a way ! > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org