Gregory Lepore created TIKA-4048:
------------------------------------
Summary: Gzipped WARC not identifying all assets
Key: TIKA-4048
URL: https://issues.apache.org/jira/browse/TIKA-4048
Project: Tika
Issue Type: Bug
Reporter: Gregory Lepore
Attachments: rec-20230518121844489398-5335604b8b23.warc,
rec-20230518121844489398-5335604b8b23.warc.gz,
rec-20230518121844489398-5335604b8b23.warc.gz.json,
rec-20230518121844489398-5335604b8b23.warc.json
The WARC parser works for non GZipped WARC files, but for GZipped WARC files it
appears not all embedded files are being identified.
Processing a WARC.GZ file should return identical JSON output as the plain WARC
file, with the addition of the GZ file metadata. However, in the attached JSON
outputs, the JPEG present in the plain WARC file is not represented in the
WARC.GZ.json file.
Additionally, the warc: metadata is not being returned for all files, although
this may be by design.
Attached are two JSON files, one for the GZipped WARC file and one for the
plain WARC file. And the two original files.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)