[ https://issues.apache.org/jira/browse/TIKA-3351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320330#comment-17320330 ]
Tim Allison commented on TIKA-3351: ----------------------------------- I'm not happy with what I did. I may fix it differently later. The problem I'm trying to solve is that I want people to be able to add their own OCR parsers that handle pseudo mime types, e.g. image/ocr-jpeg. However, at least in the PDFParser, the ocr parser can be called once per page, and that "parsed by" is getting added for every page. I'd be ok with my current solution, which is, roughly, the same as [~peterkronenberg]'s, which is don't include the same parser twice. However, to get the unit tests to pass, I had to require a wrong answer from the multiple parser test. For example, if we had a reparse on failure parser, we might want to include in the "parsed by" array that the same parser operated twice. I don't like what I've currently done, but it will at least stop the madness of adding the OCR parser for every page. :P > Make list of parsers in metadata unique > --------------------------------------- > > Key: TIKA-3351 > URL: https://issues.apache.org/jira/browse/TIKA-3351 > Project: Tika > Issue Type: Improvement > Reporter: Peter Kronenberg > Priority: Major > > The Parsed_By field in the metadata can have duplicates, since some parsers > can be called more than one. Make this field only contain each parser once -- This message was sent by Atlassian Jira (v8.3.4#803005)