Andreas,

The items 10 and 11 from the Lucene FAQ provide the (partial) answer.

--------------------------------------------------------------------------------

10. Can I use Lucene to crawl my site or other sites on the Internet ?
No. Lucene does not know how to access external document, nor does it know how 
to extract the content and links of HTML and other document format. Lucene 
focus on the indexing and searching and does it great. 

--------------------------------------------------------------------------------

11. How can I extract the content of HTML pages ?
Lucene (at least the current version) does not provide handlers for various 
document formats and leaves this task to the application. To extract content 
form HTML pages, you may use an HTML parser (there are several free versions on 
the Internet). If you have hard time finding one, you can post a question in 
the Lucene User mailing list. 

(tip by T.J.Mather) Lucene includes an HTML parser in the demo/HTMLParser 
directory of the distribution. This is used by the demo/IndexHTML.java class. 
--------------------------------------------------------------------------------

Gregory

-----Original Message-----
From: Andreas Kuckartz [mailto:[EMAIL PROTECTED]
Sent: Dienstag, 16. November 2004 16:37
To: [EMAIL PROTECTED]
Subject: Re: Mentor(s) required for a search engine project


I am no potential sponsor but would like to see a comparison to Apache Jakarta
Lucene (http://jakarta.apache.org/lucene/docs/index.html) which is implemented
in Java.

Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to