On Fri, 9 Nov 2012, Norman M wrote:
I am using Apache Tika to extract text from PPT/PPTX files.

Is Poi really using streaming to parse files?

Some bits. xls file processing is stream based, for ppt the whole file gets processed and then the text parts are located and picked out.

File file = new File ("temp.ppt");
Url url = file.toURI().toURL();
OutputStream o = new ByteArrayOutputStream()

InputStream input = TikaInputStream.get(url, metadata);

Is there a reason why you're not passing the file to TikaInputStream, but going via the URL instead?


ContentHandler handler = new BodyContentHandler(outputStream);
parser.parse(input, handler, metadata,context);
String extractedText = outputStream.toStream();

The text you extract will probably be fairly small, but the code above will mean it all has to get buffered first. You might want to look at processing the sax events as they come in, to reduce the memory instead of buffering everything, especially for very large amounts of text

Nick

Reply via email to