tballison commented on PR #2253:
URL: https://github.com/apache/tika/pull/2253#issuecomment-3052951215

   Thank you for opening this. I'm not sure that it fits within the current 
design goals of Tika, but there may be ways forward.
   
   I deeply respect sleuthkit and would be interested in pursuing whatever we 
can to work together.
   
   I also may misunderstand your use case and this PR. Please bear with me.
   
   My major concern is that Tika is intended to process individual files one at 
a time. Even with a single large docx or PDF, Tika can go out of memory. 
   
   If we treat an entire filesystem as a file (obv with embedded files), I 
think we're aiming for serious problems.
   
   There are two ways I could see some kind of integration point with Tika.
   
   1) Create a pipesiterator and fetchers so that Tika could iterate through 
ntfs or any other format handled by sleuthkit.
   
   2) Create standardized "Unpackaging" api in Tika that would use sleuthkit 
commandline(s?) to extract binary files for further processing. There are lots 
of use cases I've seen where "unpackaging" is required rather than the usual 
parsing. This is typically a pre-parsing step required to unpackage a bundle of 
files that someone packages for transfer. For example, this can be useful with 
zips, PSTs, mbox etc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to