A little while ago I announced the existence of the Aperture project, founded by my company together with the DFKI institute.

We just released Aperture 2006.1 alpha 2, which may be of interest to all Lucene users dealing with crawling and text extraction.

The project page is located at:

        http://sourceforge.net/projects/aperture

To summarize, Aperture now has code for the following tasks:

- Crawling of file systems, websites and IMAP folders. An Outlook mailbox crawler is also in the works, any help is welcome.

- Text and metadata extraction of a large and growing number of document formats, e.g. MS Office files, MS Works, OpenOffice, OpenDocument, RTF, PDF, WordPerfect, Quattro, Presentations, HTML, XML, plain text...

- A robust magic number-based MIME type identifier, a must for choosing the right extractor for a given document.

- Security-related classes for handling self-signed certificates when communicating using SSL.

Most of the code is already in good shape. The reason that it is still labeled as "alpha" is that we only recently started applying Aperture in our own software, which may still lead to certain (probably minor) API changes.

Future plans include continuously extending the set of extractors, e.g. by including extractors for mp3, images, videos, etc., adding support for Thunderbird and other mail clients, support for expanding and crawling archives, address books, ...

Furthermore we are working on metadata storage facilities that build upon Lucene and Sesame, a RDF storage and query engine (see www.openrdf.org). This should combine the expressiveness of RDF and the performance and scalability of Sesame with Lucene's full-text indexing capabilities.

For questions please consider joining the aperture-devel mailing list.


Regards,

Christiaan Fluit.

--
[EMAIL PROTECTED]

Aduna
Prinses Julianaplein 14-b
3817 CS Amersfoort
The Netherlands

+31 33 465 9987 phone
+31 33 465 9987 fax

http://aduna.biz

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to