[
https://issues.apache.org/jira/browse/TIKA-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893906#comment-17893906
]
Leszek Sliwko commented on TIKA-4314:
-------------------------------------
I’m glad you like it. I understand the concept of one parser-per-file for other
parsers - while it makes sense to run several for metadata, there could be
issues with scraping content. However, using the {{SupplementingParser}} with
{{ExternalParsers}} definitely makes sense, and possibly with
{{TesseractOCRParser}} as well.
I’ve created my own side routine in the code to run all parsers after the main
Tika parser, so this isn’t an issue.
Also, if you are introducing {{SupplementingParser}} for
{{{}ExternalParsers{}}}, it would make sense to update {{ParserUtils}} as well:
{code:java}
public static void recordParserDetails(Parser parser, Metadata metadata) {
List<String> parserClassNames;
if (parser instanceof AbstractMultipleParser abstractMultipleParser) {
parserClassNames =
abstractMultipleParser.getAllParsers().stream().map(ParserUtils::getParserClassname).toList();
} else {
parserClassNames = List.of(getParserClassname(parser));
}
parserClassNames.forEach(className -> recordParserDetails(className,
metadata));
}
public static String getParserClassname(Parser parser) {
if (parser instanceof ExternalParser externalParser) {
return externalParser.getClass().getName() + "(" +
Arrays.toString(externalParser.getCommand()) + ")";
} else if (parser instanceof ParserDecorator parserDecorator) {
return parserDecorator.getWrappedParser().getClass().getName();
} else {
return parser.getClass().getName();
}
}{code}
> CompositeParser returns only one parser per content type
> --------------------------------------------------------
>
> Key: TIKA-4314
> URL: https://issues.apache.org/jira/browse/TIKA-4314
> Project: Tika
> Issue Type: Bug
> Components: core
> Affects Versions: 2.9.2
> Reporter: Leszek Sliwko
> Priority: Major
> Attachments: CompositeParser.java, duration-test-2.avi,
> geolocation-test-1.jpg, geolocation-test-2.jpg
>
>
> External parsers can have many supported content types, but information is
> lost in CompositeParser:
>
> public Map<MediaType, Parser> getParsers(ParseContext context) {
> Map<MediaType, Parser> map = new HashMap<>();
> for (Parser parser : parsers) {
> for (MediaType type : parser.getSupportedTypes(context))
> { map.put(registry.normalize(type), parser); }
> }
> return map;
> }
>
> To recreate - parse any avi file (content type: video/x-msvideo), Only the
> exiftool will by picked up and the ffmpeg parser won't be executed.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)