Sure, anything's possible. Whether Lucene is your best bet may be another question <G>. But in this example, you're not using Lucene to do anything except store the strings. By storing all the data as UN_TOKENIZED, all you're doing is a regex match on the entire HTML text of each document. You might as well put them in a database and do a "like" clause. Or store them in files and read each file and do a regex. Or.....
My point is, that this design doesn't leverage what Lucene does, which is allow you to quickly search on terms. The body you're storing is just a long string, not a series of tokens. So I question whether lucene is relevant. Unless you tokenize the body text then do some interesting term enumeration, I don't think lucene is helping you. Best Erick On 10/3/06, John Bugger <[EMAIL PROTECTED]> wrote:
My crawler indexing crawled pages with these code: Document doc = new Document(); doc.add(new Field("body", page.getHtmlData(), Store.YES, Index.UN_TOKENIZED )); doc.add(new Field("url", page.getUrl(), Store.YES, Index.UN_TOKENIZED)); doc.add(new Field("title", page.getTitle(), Store.YES, Index.TOKENIZED)); doc.add(new Field("id", Integer.toString(page.getId()), Store.YES, Index.NO )); try { indexWriter.addDocument(doc); } catch (Exception e) { log.error(e.getMessage()); } I need to write application able to search through indexed pages' html code using code patterns like: <table width="100%" height="50" style="border: 1px solid red;"> * <th>*test*</th> * </table> This should match all documents with such code regardless of order of tag parameters. Is it possible with lucene engine? Thanks!