[ 
https://issues.apache.org/jira/browse/TIKA-4060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17730649#comment-17730649
 ] 

Nick Burch commented on TIKA-4060:
----------------------------------

0x494443 is the string ID3, which I think ought to be at the start. It is in 
the handful of files I've found. The rest of the magic is pretty vague and a 
little prone to false positives, so I'm reluctant to match on the string "ID3" 
anywhere in the first 2kb and then the vague 3 bytes somewhere else further on.

I've tried to make the matches a little "tighter" to hopefully reduce false 
positives, just seem to have gone too tight - the test file I produced with ID3 
tags does have the ID3 at the start. The hex dump key sections are:

{{00000000 49 44 33 03 00 00 00 00 09 6b 54 50 45 31 00 00 |ID3......kTPE1..|}}
{{00000010 00 0c 00 00 00 54 65 73 74 20 41 72 74 69 73 74 |.....Test Artist|}}
{{...}}
{{00000090 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|}}
{{*}}
{{000004f0 00 00 00 00 00 ff f1 50 80 32 5f fc de 02 00 4c |.......P.2_....L|}}

> Add magic to audio/aac in tika-mimetypes.xml
> --------------------------------------------
>
>                 Key: TIKA-4060
>                 URL: https://issues.apache.org/jira/browse/TIKA-4060
>             Project: Tika
>          Issue Type: Sub-task
>            Reporter: Gregory Lepore
>            Priority: Minor
>         Attachments: 
> 067aece423d8694a891a61a45ac0e870914bc1314ef510ac40b36ca3397843ef, 
> cb1bec08898db7a733b42ac44bdd76b6177cd3a07a2435a83fd99b7453d564d1
>
>
> Currently tika-mimetypes only recognizes audio/aac files by the file 
> extension. PRONOM recently added support for identifying aac files, but the 
> signature is tricky. There are two signatures, below in PRONOM format curly 
> braces mean to look ahead between the two values for the subsequent patterns.
>  
> The first pattern is pretty basic, the second pattern is the first pattern 
> after a 2048 ID3 header.
>  
> ||Name|Audio Data Transport Stream sig.1|
> ||Description|An FF pattern from BOF with variation of byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |
> ||Name|Audio Data Transport Stream sig.2|
> ||Description|ID3 tag variation with variable byte stream|
> ||Byte sequences|
> ||Position type|Absolute from BOF|
> ||Offset|0|
> ||Maximum Offset|0|
> ||Byte order| |
> ||Value|494433\{0-2045}FF(F0\|F1\|F8\|F9)(40\|41\|44\|45\|48\|49\|4C\|4D\|50\|51\|54\|55\|58\|59\|5C\|5D\|60\|61\|64\|65\|68\|69\|6C\|6D\|70\|71\|80\|81\|84\|85\|88\|89\|8C\|8D\|90\|91\|94\|95\|98\|99\|9C\|9D\|A0\|A1\|A4\|A5\|A8\|A9\|AC\|AD\|B0\|B1)(00\|01\|20\|40\|41\|60\|80\|81\|60\|A0\|C0\|C1\|E0)|
> |



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to