Andreas, The items 10 and 11 from the Lucene FAQ provide the (partial) answer.
-------------------------------------------------------------------------------- 10. Can I use Lucene to crawl my site or other sites on the Internet ? No. Lucene does not know how to access external document, nor does it know how to extract the content and links of HTML and other document format. Lucene focus on the indexing and searching and does it great. -------------------------------------------------------------------------------- 11. How can I extract the content of HTML pages ? Lucene (at least the current version) does not provide handlers for various document formats and leaves this task to the application. To extract content form HTML pages, you may use an HTML parser (there are several free versions on the Internet). If you have hard time finding one, you can post a question in the Lucene User mailing list. (tip by T.J.Mather) Lucene includes an HTML parser in the demo/HTMLParser directory of the distribution. This is used by the demo/IndexHTML.java class. -------------------------------------------------------------------------------- Gregory -----Original Message----- From: Andreas Kuckartz [mailto:[EMAIL PROTECTED] Sent: Dienstag, 16. November 2004 16:37 To: [EMAIL PROTECTED] Subject: Re: Mentor(s) required for a search engine project I am no potential sponsor but would like to see a comparison to Apache Jakarta Lucene (http://jakarta.apache.org/lucene/docs/index.html) which is implemented in Java. Andreas --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]