[ 
https://issues.apache.org/jira/browse/TIKA-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320330#comment-17320330
 ] 

Tim Allison commented on TIKA-3351:
-----------------------------------

I'm not happy with what I did.  I may fix it differently later.  The problem 
I'm trying to solve is that I want people to be able to add their own OCR 
parsers that handle pseudo mime types, e.g. image/ocr-jpeg.  However, at least 
in the PDFParser, the ocr parser can be called once per page, and that "parsed 
by" is getting added for every page.  I'd be ok with my current solution, which 
is, roughly, the same as [~peterkronenberg]'s, which is don't include the same 
parser twice.

However, to get the unit tests to pass, I had to require a wrong answer from 
the multiple parser test.  For example, if we had a reparse on failure parser, 
we might want to include in the "parsed by" array that the same parser operated 
twice.

I don't like what I've currently done, but it will at least stop the madness of 
adding the OCR parser for every page. :P

> Make list of parsers in metadata unique
> ---------------------------------------
>
>                 Key: TIKA-3351
>                 URL: https://issues.apache.org/jira/browse/TIKA-3351
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Peter Kronenberg
>            Priority: Major
>
> The Parsed_By field in the metadata can have duplicates, since some parsers 
> can be called more than one.  Make this field only contain each parser once



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to