Re: Is Tika really using streaming to parse files?

Nick Burch Sat, 10 Nov 2012 14:11:57 -0800

On Fri, 9 Nov 2012, Norman M wrote:

I am using Apache Tika to extract text from PPT/PPTX files.


Is Poi really using streaming to parse files?

Some bits. xls file processing is stream based, for ppt the whole filegets processed and then the text parts are located and picked out.

File file = new File ("temp.ppt");
Url url = file.toURI().toURL();
OutputStream o = new ByteArrayOutputStream()

InputStream input = TikaInputStream.get(url, metadata);

Is there a reason why you're not passing the file to TikaInputStream, butgoing via the URL instead?

ContentHandler handler = new BodyContentHandler(outputStream);
parser.parse(input, handler, metadata,context);
String extractedText = outputStream.toStream();

The text you extract will probably be fairly small, but the code abovewill mean it all has to get buffered first. You might want to look atprocessing the sax events as they come in, to reduce the memory instead ofbuffering everything, especially for very large amounts of text


Nick

Re: Is Tika really using streaming to parse files?

Reply via email to