[ https://issues.apache.org/jira/browse/TIKA-4081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17732764#comment-17732764 ]
Gregory Lepore edited comment on TIKA-4081 at 6/14/23 9:10 PM: --------------------------------------------------------------- My bad, I wrote some code last week to sort the Common Crawl files by the first two bytes (as a way of identifying similar files) and that one slipped in. I removed it and added three other ZIM files that match. To get to four bytes of magic you can add the fourth byte as 04 (per the specs) and even add the 5 byte being an 05 or 06 for the major version. https://wiki.openzim.org/wiki/ZIM_file_format_old_namespace was (Author: g...@rhobard.com): My bad, I wrote some code last week to sort the Common Crawl files by the first two bytes (as a way of identifying similar files) and that one slipped in. I removed it and added three other ZIM files that match. > Add magic for ZIM format > ------------------------ > > Key: TIKA-4081 > URL: https://issues.apache.org/jira/browse/TIKA-4081 > Project: Tika > Issue Type: Sub-task > Reporter: Gregory Lepore > Priority: Minor > Attachments: > 0a91827fbe71425d5961bb7f876ff19f000101368d16bf8887829288b6ef62f5, > 0ad57cb18df3c37d1d1387a5ae173296b3b20ede030de4a3d0d73211ce5b06b8, > gutenberg_ar_all_2021-05-1.zim, gutenberg_ar_all_2021-05.zim, > wikipedia_ln_all_nopic_2021-03.zim > > > ZIM format files occur 3,152 times in the latest Common Crawl dataset. Magic > is 5A494D04 which is ASCII "ZIM" at offset 0. > > [https://en.wikipedia.org/wiki/ZIM_(file_format)] > > https://wiki.openzim.org/wiki/ZIM_file_format_old_namespace -- This message was sent by Atlassian Jira (v8.20.10#820010)