[
https://issues.apache.org/jira/browse/TIKA-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gregory Lepore updated TIKA-4074:
---------------------------------
Description:
The TeX Virtual Font format occurs 6,047 times in the second most recent Common
Crawl dataset (and over 3000 in the latest set). No known mime type. The magic
is:
F7CA\{9}F300\{4}0010 at offset 0.
The above signature will catch most TeX vf files, however some will be missed.
However, there were no false positives so I think it's a good compromise to
catch the majority of sample files.
It would be nice to see the results of additional testing.
was:
The TeX Virtual Font format occurs 6,047 times in the second most recent Common
Crawl dataset. No known mime type. The magic is:
F7CA\{9}F300\{4}0010 at offset 0.
The above signature will catch most TeX vf files, however some will be missed.
However, there were no false positives so I think it's a good compromise to
catch the majority of sample files.
It would be nice to see the results of additional testing.
> Add magic for TeX Virtual Font format
> -------------------------------------
>
> Key: TIKA-4074
> URL: https://issues.apache.org/jira/browse/TIKA-4074
> Project: Tika
> Issue Type: Sub-task
> Reporter: Gregory Lepore
> Priority: Minor
> Attachments: aebx10.vf, aebx12.vf, aebxsl10.vf
>
>
> The TeX Virtual Font format occurs 6,047 times in the second most recent
> Common Crawl dataset (and over 3000 in the latest set). No known mime type.
> The magic is:
>
> F7CA\{9}F300\{4}0010 at offset 0.
>
> The above signature will catch most TeX vf files, however some will be
> missed. However, there were no false positives so I think it's a good
> compromise to catch the majority of sample files.
>
> It would be nice to see the results of additional testing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)