Oooh, this sounds like a great opportunity for some bike-shedding :)

If I had my druthers, I would organize as:

tika-parsers-standard: whatever is required to extract text and metadata from 
80%+ of stand-alone documents found on the web.
tika-parsers-archives: zip, pkg, etc.
tika-parsers-ocr: which might eventually have other modules like remote calls 
to AWS Rekognition/Textract?
tika-parsers-advanced: everything else

This would make it easy for me to configure for my standard use-case, and mix 
in when I want to look inside archives, or do OCR.

But as I said, it’s bike-shedding.

— Ken


> On Mar 9, 2021, at 9:03 AM, Tim Allison <talli...@apache.org> wrote:
> 
> All,
>  I was recently chatting about Tika 2.x with some Tika friends and
> they had some hesitation about the names for the three high level
> parser modules.
> 
> They are currently:
> 
> tika-parsers-classic
> tika-parsers-extended
> tika-parsers-advanced
> 
> The quibbles weren't with the delineation, but with the naming.
> 
> In my mind, this is what I've been thinking as definitions:
> 
> tika-parsers-classic -- with the exception of optional OCR, these
> should be lightish weight dependencies in pure java with no
> parsers/resources that require network calls.
> 
> tika-parsers-extended -- these can require native libs and/or have
> heavier dependencies, including network calls.
> 
> tika-parsers-advanced -- anything goes. dl4j as a dependency, etc.
> 
> Some options for classic-> basic, base, ...what else?
> 
> Any other recommendations for these names?  Thank you!
> 
> Best,
> 
>           Tim

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Reply via email to