> Will it be possible to index them as it is? You'll not be indexing headers or the From field specially if you simply throw the text at the indexer. You'll not even be separating out separate e-mail messages. You certainly need to be able to separate out e-mail messages, to treat messages as separate documents. You'll also probably want to MIME-decode to index attachments etc. Also you may want to unzip etc. If you want to do this yourself, you should start by getting the Lucene In Action book and look at the content handlers for different file types etc etc, but reading the book will probably take a chunk out of your "couple of days". That's why taking Tropo off the shelf is probably your best route, assuming that Tropo's analysis is adequate for your needs, handling all of your attachment requirements.
Disclaimer: I'm not familiar with Tropo. -----Original Message----- From: Rob Staveley (Tom) [mailto:[EMAIL PROTECTED] Sent: 15 August 2006 13:26 To: java-user@lucene.apache.org Subject: RE: Indexing existing email archives OK, we're well off topic now. If you have follow up on this can I recommend that you don't do this via [EMAIL PROTECTED] You'll see from http://www.mozilla.org/support/thunderbird/faq#q2.10 that Thunderbird stores its mail in mbox format. If you use Dovecot for your imap server, you'll be able to use the mbox file directly. Just copy the file over. mbox is very easy to work with. You'll have seen from looking at it with your text editor that there are "From lines" followed by raw RFC 822 data. You can append one mbox file to another very simply. -----Original Message----- From: Suba Suresh [mailto:[EMAIL PROTECTED] Sent: 15 August 2006 12:46 To: java-user@lucene.apache.org Subject: Re: Indexing existing email archives The mail archives and saved mail folders can be opened up and read as a text file using textpad or wordpad. Will it be possible to index them as it is? The mail client is Thunderbird Mailbox. If I have to use third party software is there anything you can suggest? suba suresh. Rob Staveley (Tom) wrote: >> I have to have this working in next couple of days. > > I had a similar requirement and it look me several weeks to get > something working. I think you'll need to make use of as much off the > shelf software as you can, given your timescale. > > If you don't want to get your hands dirty with Lucene and like Tropo, > why not convert your local storage to mbox/maildir/mbx if it is not > already in the appropriate format and light up imapd while you index? > e.g. If your local format is PST (i.e. Outlook), take a look at > http://www.tldp.org/HOWTO/html_single/Outlook-to-Unix-Mailbox/ > > -----Original Message----- > From: Suba Suresh [mailto:[EMAIL PROTECTED] > Sent: 14 August 2006 16:50 > To: java-user@lucene.apache.org > Subject: Indexing existing email archives > > I was looking at "http://www.tropo.com/techno/java/lucene/imap.html" > and my understanding is it is used to retrieve and index the emails > that is on the email server. I have some stored emails in folders in > my local disk and huge list of email archives in another system. Is > there a way I could index them? > > I have to have this working in next couple of days. Any help and > suggestions are appreciated. > > thanks, > suba suresh. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
smime.p7s
Description: S/MIME cryptographic signature