I just tried a pdf embedded within a .doc, and Tika extracted it. I didn't
test an mp3 so your mileage might vary.
Might want to use Tika (instructions below) or dive into its guts for
inspiration on using HWPF directly
(org.apache.tika.parser.microsoft.WordExtractor and
org.apache.tika.parser.microsoft.AbstractPOIFSExtractor) both in the parsers
jar.
If you are going the Tika route:
Create a class that implements EmbeddedResourceHandler and override "handle"
with something like this:
1) @Override
2) public void handle(String embeddedFileName, MediaType mediaType,
InputStream is) {
3)
4) System.err.println("in handle: " + mediaType);
5) if (embeddedFileName == null || embeddedFileName.equals("")){
6) embeddedFileName = "unnamed_file_"+num;
7) }
8) //in case the "embeddedFileName" comes with path information,
make sure to take just the name
9) String actualName = new File(embeddedFileName).getName();
10) File outFile = //figure out what you want to call the file
11) System.out.println("about to extract " + outFile);
12) OutputStream os = null;
13) try{
14) os = new FileOutputStream(outFile);
15) System.out.println("about to extract " + outFile);
16) IOUtils.copy(is, os);
17) os.flush();
18) } catch (IOException e){
19) /* add logging*/
20) } finally {
21) if (os != null){
22) try{
23) os.close();
24) } catch (IOException e){
25) //swallow
26) }
27) }
28) }
29)
30) }
Then call tika like this (assuming you've named your EmbeddedResourceHandler
"WithinDirectoryEmbeddedHandler"):
TikaInputStream is = TikaInputStream.get(f);
ParserContainerExtractor containerExtractor = new
ParserContainerExtractor();
containerExtractor.extract(is, new ParserContainerExtractor(),
new WithinDirectoryEmbeddedHandler(f));
is.close();
From: Chris Bamford [mailto:[email protected]]
Sent: Friday, June 07, 2013 8:32 AM
To: POI Users List
Subject: Extracting embedded files from HWPF docs
Hi guys,
Is there a way to extract files embedded into Word docs (.doc, not .docx),
using the HWPF package?
I understand that I can extract Pictures with
document.getPicturesTable().getAllPictures();
But I am specifically interested in non-pictures file too (e.g. MP3).
Thanks,
- Chris
[cid:[email protected]]<https://serviceA.mimecast.com/mimecast/click?account=C1A1&code=09a49df47996e3beb4b4d6fb4ef4ff15>
[cid:[email protected]]
[ Our
Blog<https://serviceA.mimecast.com/mimecast/click?account=C1A1&code=4fe4e2dd06912e7d1cd683bef487ffb9>
] [
Twitter<https://serviceA.mimecast.com/mimecast/click?account=C1A1&code=8e0c629db14f7e6a228bbafae637e470>
] [
YouTube<https://serviceA.mimecast.com/mimecast/click?account=C1A1&code=a3920b681e0e22ee0e6b748e4e78f86c>
]
Chris Bamford
Senior Developer
m: +44 7860 405292
www.mimecast.com<https://serviceA.mimecast.com/mimecast/click?account=C1A1&code=8a5d8b2fad1629cb2f7af18bb6a9db08>
CityPoint, One Ropemaker Street, London, EC2Y 9AW.
+44 (0) 207 847 8700
Disclaimer
The information contained in this communication from
[email protected]<mailto:[email protected]> sent at 2013-06-07 13:32:06
is confidential and may be legally privileged. It is intended solely for use by
[email protected]<mailto:[email protected]> and others authorized to
receive it. If you are not [email protected]<mailto:[email protected]> you
are hereby notified that any disclosure, copying, distribution or taking action
in reliance of the contents of this information is strictly prohibited and may
be unlawful.
Mimecast Ltd. is a company registered in England and Wales with the company
number 4698693 VAT No. GB 123 4197 34
Registered Office: CityPoint, One Ropemaker Street, Moorgate, London, EC2Y 9AW
Email Address: [email protected]<mailto:[email protected]>
________________________________
This email message has been scanned for viruses by Mimecast.
Mimecast delivers a complete managed email solution from a single web based
platform.
For more information please visit http://www.mimecast.com
________________________________