[ https://issues.apache.org/jira/browse/TIKA-4391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17931277#comment-17931277 ]
Ross Johnson edited comment on TIKA-4391 at 2/27/25 7:46 PM: ------------------------------------------------------------- I've worked a lot with msg files & normalizing attachments, so just thought I'd give a bit of a brain dump of related info. In HTML bodies & RTF-encapsulated HTML bodies, inline images normally have _PidTagAttachmentHidden_ = true. Note that I have seen unusual emails where an attachment image has _PidTagAttachmentHidden_ = true, but it's there doesn't seem to be a place in the HTML where that image actually goes, i.e. no apparent reference in the HTML. This flag is also used to hide other non-image attachments, mostly related to calendar invites & calendar exceptions. For real RTF bodies that reference attachments, things are a bit different. These attachments don't have _PidTagAttachmentHidden_ = true but rather have _PidTagRenderingPosition_ < 0xFFFFFFFF. The main issue with RTF attachments is determining whether the attachment is actually fully shown inline or not. For example, an embedded message or normal binary file will just show a thumbnail in the body (stored in a OLE presentation stream). Other attachments, such as an Excel file, may show a selection of a worksheet inline, and clicking on that section in Outlook then opens the full Excel file. I think true inline images won't have any OLE presentation defined, indicating that the original image data is used inline directly instead. was (Author: rossj): I've worked a lot with msg files & normalizing attachments, so just thought I'd give a bit of a brain dump of related info. In HTML bodies & RTF-encapsulated HTML bodies, inline images normally have `PidTagAttachmentHidden` = true. Note that I have seen unusual emails where an attachment image has `PidTagAttachmentHidden` = true, but it's there doesn't seem to be a place in the HTML where that image actually goes, i.e. no apparent reference in the HTML. This flag is also used to hide other non-image attachments, mostly related to calendar invites & calendar exceptions. For real RTF bodies that reference attachments, things are a bit different. These attachments don't have `PidTagAttachmentHidden` = true but rather have `PidTagRenderingPosition` < 0xFFFFFFFF. The main issue with RTF attachments is determining whether the attachment is actually fully shown inline or not. For example, an embedded message or normal binary file will just show a thumbnail in the body (stored in a OLE presentation stream). Other attachments, such as an Excel file, may show a selection of a worksheet inline, and clicking on that section in Outlook then opens the full Excel file. I think true inline images won't have any OLE presentation defined, indicating that the original image data is used inline directly instead. > Detect inline images in msg files > --------------------------------- > > Key: TIKA-4391 > URL: https://issues.apache.org/jira/browse/TIKA-4391 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > > Images are stored as attachments. It would be helpful to be able to > distinguish between "inline" images that are intended to be rendered in the > email vs regular image attachments. -- This message was sent by Atlassian Jira (v8.20.10#820010)