tballison commented on PR #2253: URL: https://github.com/apache/tika/pull/2253#issuecomment-3052951215
Thank you for opening this. I'm not sure that it fits within the current design goals of Tika, but there may be ways forward. I deeply respect sleuthkit and would be interested in pursuing whatever we can to work together. I also may misunderstand your use case and this PR. Please bear with me. My major concern is that Tika is intended to process individual files one at a time. Even with a single large docx or PDF, Tika can go out of memory. If we treat an entire filesystem as a file (obv with embedded files), I think we're aiming for serious problems. There are two ways I could see some kind of integration point with Tika. 1) Create a pipesiterator and fetchers so that Tika could iterate through ntfs or any other format handled by sleuthkit. 2) Create standardized "Unpackaging" api in Tika that would use sleuthkit commandline(s?) to extract binary files for further processing. There are lots of use cases I've seen where "unpackaging" is required rather than the usual parsing. This is typically a pre-parsing step required to unpackage a bundle of files that someone packages for transfer. For example, this can be useful with zips, PSTs, mbox etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org