[ 
https://issues.apache.org/jira/browse/TIKA-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893906#comment-17893906
 ] 

Leszek Sliwko commented on TIKA-4314:
-------------------------------------

I’m glad you like it. I understand the concept of one parser-per-file for other 
parsers - while it makes sense to run several for metadata, there could be 
issues with scraping content. However, using the {{SupplementingParser}} with 
{{ExternalParsers}} definitely makes sense, and possibly with 
{{TesseractOCRParser}} as well.

I’ve created my own side routine in the code to run all parsers after the main 
Tika parser, so this isn’t an issue.

Also, if you are introducing {{SupplementingParser}} for 
{{{}ExternalParsers{}}}, it would make sense to update {{ParserUtils}} as well:
{code:java}
public static void recordParserDetails(Parser parser, Metadata metadata) {
    List<String> parserClassNames;
    if (parser instanceof AbstractMultipleParser abstractMultipleParser) {
        parserClassNames    = 
abstractMultipleParser.getAllParsers().stream().map(ParserUtils::getParserClassname).toList();
    } else {
        parserClassNames    = List.of(getParserClassname(parser));
    }
    parserClassNames.forEach(className -> recordParserDetails(className, 
metadata));
}

public static String getParserClassname(Parser parser) {
    if (parser instanceof ExternalParser externalParser) {
        return externalParser.getClass().getName() + "(" + 
Arrays.toString(externalParser.getCommand()) + ")";
    } else if (parser instanceof ParserDecorator parserDecorator) {
        return parserDecorator.getWrappedParser().getClass().getName();
    } else {
        return parser.getClass().getName();
    }
}{code}

> CompositeParser returns only one parser per content type
> --------------------------------------------------------
>
>                 Key: TIKA-4314
>                 URL: https://issues.apache.org/jira/browse/TIKA-4314
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.9.2
>            Reporter: Leszek Sliwko
>            Priority: Major
>         Attachments: CompositeParser.java, duration-test-2.avi, 
> geolocation-test-1.jpg, geolocation-test-2.jpg
>
>
> External parsers can have many supported content types, but information is 
> lost in CompositeParser:
>  
> public Map<MediaType, Parser> getParsers(ParseContext context) {
>   Map<MediaType, Parser> map = new HashMap<>();
>   for (Parser parser : parsers) {
>     for (MediaType type : parser.getSupportedTypes(context))
> {        map.put(registry.normalize(type), parser); }
>    }
>    return map;
> }
>  
> To recreate - parse any avi file (content type: video/x-msvideo), Only the 
> exiftool will by picked up and the ffmpeg parser won't be executed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to