Hi, A while ago I added the http://wiki.apache.org/tika/MetadataDiscussion page to the Tika wiki.
Since then, with the help of Jukka Zitting, a solution has been described for using the current Tika library to capture nested document metadata and associate that with the text extracted for each nested document. What hasn't been accomplished is identifying a way to get to both the metadata and text for nested documents without the user writing a ContentHandler. Here are some possibilities for moving forward: - Decide that anyone who wants to identify the text and metadata associated with each nested document must write their own ContentHandler and ParserDecorator that gathers and associates text with the corresponding metadata. - Point out easier ways to accomplish the same thing with the existing Tika libraries. - Provide a new Parser and ContentHandler combination that gathers subdocument text and metadata together and provides a stream of events (maybe something other than XHTML) with easier recursive document and metadata handling. - Come up with a way to add nested metadata to the XHTML stream without violating XHTML Are there any thoughts on how to move forward? Is it okay if users who want to extract nested documents with metadata resort to writing their own content handlers and parser decorators? Or would the Tika team prefer to offer an easier way for users to extract nested documents with metadata? Paul