[ https://issues.apache.org/jira/browse/TIKA-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134180#comment-15134180 ]
Daniel Bonniot de Ruisselet commented on TIKA-1854: --------------------------------------------------- The documents I'm processing sometimes have embedded files representing chemical structures. Having the storage class IDs allow me to know which embedded chemical structures, and to know in which format they are, so that I can process them accordingly. What I mean about the content type is that metadata.get(Metadata.CONTENT_TYPE) already returns for instance "application/vnd.ms-excel" for embedded excel documents. However it is not populated for chemical or other formats. I might be mistaken, but it seems to me that the documentation you linked to is about the mime type of the main (container) document. Is the same mechanism used to determine the mime type of the embedded documents? I think the specific formats I'm interested in are not in widespread use, so for contributions to Tika I'm rather focused on a generic solution. Getting the storage class IDs will definitely be useful in such cases. If custom mime types worked for embedded documents that could also be useful. > Include the storage class ID of documents embedded in MS Office documents > ------------------------------------------------------------------------- > > Key: TIKA-1854 > URL: https://issues.apache.org/jira/browse/TIKA-1854 > Project: Tika > Issue Type: Improvement > Components: parser > Reporter: Daniel Bonniot de Ruisselet > Assignee: Tim Allison > Attachments: class-id.patch > > > When processing embedded documents using an EmbeddedDocumentExtractor, the > storage class ID of the embedded document would be a useful metadata to have, > but it's currently missing. > I'll promptly attach a patch implementing and testing this new feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332)