I am using Apache Tika to extract text from PPT/PPTX files.

Tika is using Apache POI to extract texts.

I tried to compare processing time and memory usage for POI vs Aspose 
(www.aspose.com)

The processing time and memory requirement for Tika (i-e POI) is almost double 
of Aspose.

Is
 Poi really using streaming to parse files? Why it is taking much more 
memory than Aspose that I thought reads the whole file into memory.

I found this thread 
http://lucene.472066.n3.nabble.com/Large-xls-files-always-loaded-into-memory-td646710.html
 where Tika founder is claiming that Poi is not steaming input files. That 
thread is quite old, is it still the same?

My goal is to minimize the memory requirement.

Here is my code

ParseContext context - new ParseContext();
Detector detector = new DefaultDetector();
Parser parser = new AutoDetectParser(detector);
context.set(Parser.class, parser);
MetaData metaData = new MetaData();

File file = new File ("temp.ppt");
Url url = file.toURI().toURL();
OutputStream o = new ByteArrayOutputStream()

InputStream input = TikaInputStream.get(url, metadata);
ContentHandler handler = new BodyContentHandler(outputStream);

parser.parse(input, handler, metadata,context);

String extractedText = outputStream.toStream();

It looks like that whole extracted text will be written to output stream and 
hence it may be the reason for large memory consumption. How can I make memory 
usage as least as possible?
 
 Any response will be appreciated.

Thanks,

Reply via email to