Oooh, this sounds like a great opportunity for some bike-shedding :) If I had my druthers, I would organize as:
tika-parsers-standard: whatever is required to extract text and metadata from 80%+ of stand-alone documents found on the web. tika-parsers-archives: zip, pkg, etc. tika-parsers-ocr: which might eventually have other modules like remote calls to AWS Rekognition/Textract? tika-parsers-advanced: everything else This would make it easy for me to configure for my standard use-case, and mix in when I want to look inside archives, or do OCR. But as I said, it’s bike-shedding. — Ken > On Mar 9, 2021, at 9:03 AM, Tim Allison <talli...@apache.org> wrote: > > All, > I was recently chatting about Tika 2.x with some Tika friends and > they had some hesitation about the names for the three high level > parser modules. > > They are currently: > > tika-parsers-classic > tika-parsers-extended > tika-parsers-advanced > > The quibbles weren't with the delineation, but with the naming. > > In my mind, this is what I've been thinking as definitions: > > tika-parsers-classic -- with the exception of optional OCR, these > should be lightish weight dependencies in pure java with no > parsers/resources that require network calls. > > tika-parsers-extended -- these can require native libs and/or have > heavier dependencies, including network calls. > > tika-parsers-advanced -- anything goes. dl4j as a dependency, etc. > > Some options for classic-> basic, base, ...what else? > > Any other recommendations for these names? Thank you! > > Best, > > Tim -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Cassandra & Solr