Re: Search in HTML code

Erick Erickson Tue, 03 Oct 2006 10:45:24 -0700

Sure, anything's possible. Whether Lucene is your best bet may be another
question <G>. But in this example, you're not using Lucene to do anything
except store the strings. By storing all the data as UN_TOKENIZED, all
you're doing is a regex match on the entire HTML text of each document. You
might as well put them in a database and do a "like" clause. Or store them
in files and read each file and do a regex. Or.....


My point is, that this design doesn't leverage what Lucene does, which is
allow you to quickly search on terms. The body you're storing is just a long
string, not a series of tokens. So I question whether lucene is relevant.

Unless you tokenize the body text then do some interesting term enumeration,
I don't think lucene is helping you.

Best
Erick

On 10/3/06, John Bugger <[EMAIL PROTECTED]> wrote:


My crawler indexing crawled pages with these code:
Document doc = new Document();
doc.add(new Field("body", page.getHtmlData(), Store.YES,
Index.UN_TOKENIZED
));
doc.add(new Field("url", page.getUrl(), Store.YES, Index.UN_TOKENIZED));
doc.add(new Field("title", page.getTitle(), Store.YES, Index.TOKENIZED));
doc.add(new Field("id", Integer.toString(page.getId()), Store.YES,
Index.NO
));
try {
    indexWriter.addDocument(doc);
}
catch (Exception e) {
    log.error(e.getMessage());
}

I need to write application able to search through indexed pages' html
code
using code patterns like:
<table width="100%" height="50" style="border: 1px solid red;">
  *
  <th>*test*</th>
  *
</table>
This should match all documents with such code regardless of order of tag
parameters.
Is it possible with lucene engine?

Thanks!

Re: Search in HTML code

Reply via email to