Hi Dejan, how do you query for email- and(!) attachment-documents, if you just want to present one hit per email (even if the searchterm matches in the email- and(!) in the corresponding attachment-document)?
Thanks lude On 8/15/06, Dejan Nenov <[EMAIL PROTECTED]> wrote:
The approach we I find best is to create both Email documents - where a list (and links) to all attachments is contained as well as individual Attachment documents. It gets a little tricky when you have a forwarded email, containing an original Email that contains a tar.gz attachment, which contains the "actual" attached files :) (Shameless promotion follows) If you are a Windows user, for a _very_ good example get a copy of X1 Desktop (free - also distributed as Yahoo! Desktop search) - then right-click on the column headers and look at the available fields for email. Dejan -----Original Message----- From: lude [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 15, 2006 10:29 AM To: java-user@lucene.apache.org Subject: Best Practice: emails and file-attachments Hello, does anybody has an idea what is the best design approch for realizing the following: The goal is to index emails and their corresponding file attachments. One email could contain for example: 1 x subject 1 x sender-address 1 x to-addresses 1 x message-text 0..n x file-attachments (each contains a 'file-name' and the 'file-content') How should I build the index? First approach: Each email + attachments gets one document with the following fields: subject, sender_address, to_address, message_text, 1_attachment_name, 1_attachment_content, 2_attachment_name, 2_attachment_content, 3_attachment_name, 3_attachment_content Disadvantage: Only three attachments could be indexed. It isn't a generic solution for indexing 'n' file-attachments. Second approach: Each email gets one document with the main email-data and 0 to n documents of file-attachments: 1 x email_id, subject, sender_address, to_address, message_text 0..n x email_id, attachment_name, attachment_content Disadvantage: At query time it is difficult to aggregate the documents that belongs to each other. One hit per email (including attachments) should be shown. Any thoughts? Thanks lude --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]