[ 
https://issues.apache.org/jira/browse/TIKA-4314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17884572#comment-17884572
 ] 

Leszek Sliwko edited comment on TIKA-4314 at 9/25/24 11:12 AM:
---------------------------------------------------------------

Currently, FFmpeg is ignored in favor of ExifTool for analyzing AVI files 
(video/x-msvideo) - which is obviously not a desired result. I also recall that 
this feature worked fine in 2019 when I first started using Tika.

I'm not knowledgeable enough to say whether selecting only one parser per 
content type is the correct approach, but there are numerous command-line tools 
capable of extracting much more metadata. For example, I added the sox parser 
to extract the duration of WAV files, which none of the previous parsers could 
do.

If changing this behavior at the design level, i.e., running all parsers that 
support a given content type and merging the results, is not possible, I would 
suggest implementing this at least for external parsers. The 
{{CompositeExternalParser}} seems like a good starting point.


was (Author: JIRAUSER282927):
Currently, FFmpeg is ignored in favor of ExifTool for analyzing AVI files 
(video/x-msvideo) - which is obviously not a desired result. I also recall that 
this feature worked fine in 2019 when I first started using Tika.

I'm not knowledgeable enough to say whether selecting only one parser per 
content type is the correct approach, but there are numerous command-line tools 
capable of extracting much more metadata. For example, I added the SoX parser 
to extract the duration of WAV files, which none of the previous parsers could 
do.

If changing this behavior at the design level, i.e., running all parsers that 
support a given content type and merging the results, is not possible, I would 
suggest implementing this at least for external parsers. The 
{{CompositeExternalParser}} seems like a good starting point.

> CompositeParser returns only one parser per content type
> --------------------------------------------------------
>
>                 Key: TIKA-4314
>                 URL: https://issues.apache.org/jira/browse/TIKA-4314
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.9.2
>            Reporter: Leszek Sliwko
>            Priority: Major
>
> External parsers can have many supported content types, but information is 
> lost in CompositeParser:
>  
> public Map<MediaType, Parser> getParsers(ParseContext context) {
>   Map<MediaType, Parser> map = new HashMap<>();
>   for (Parser parser : parsers) {
>     for (MediaType type : parser.getSupportedTypes(context))
> {        map.put(registry.normalize(type), parser); }
>    }
>    return map;
> }
>  
> To recreate - parse any avi file (content type: video/x-msvideo), Only the 
> exiftool will by picked up and the ffmpeg parser won't be executed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to