[ https://issues.apache.org/jira/browse/TIKA-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821964#comment-16821964 ]
Tim Allison commented on TIKA-2849: ----------------------------------- As for the POIFSContainerDetector, I do not currently believe that it is possible to do streaming detection and stop short of reading the full stream largely because the part that contains the file type is stored at the end of the file. The underlying POIFSFileSystem loads the data off heap at least and limits the total to 2GB. If someone can recommend a way to detect the subtypes of tika-office (doc, ppt, xls) without reading the full stream (and without relying on file names), please let us know. > TikaInputStream copies the input stream locally > ----------------------------------------------- > > Key: TIKA-2849 > URL: https://issues.apache.org/jira/browse/TIKA-2849 > Project: Tika > Issue Type: Bug > Affects Versions: 1.20 > Reporter: Boris Petrov > Assignee: Tim Allison > Priority: Major > > When doing "tika.detect(stream, name)" and the stream is a "TikaInputStream", > execution gets to "TikaInputStream#getPath" which does a "Files.copy(in, > path, REPLACE_EXISTING);" which is very, very bad. This input stream could > be, as in our case, an input stream from a network file which is tens or > hundreds of gigabytes large. Copying it locally is a huge waste of resources > to say the least. Why does it do that and can I make it not do it? Or is this > something that has to be fixed in Tika? -- This message was sent by Atlassian JIRA (v7.6.3#76005)