Gregory Lepore created TIKA-4048:
------------------------------------

             Summary: Gzipped WARC not identifying all assets
                 Key: TIKA-4048
                 URL: https://issues.apache.org/jira/browse/TIKA-4048
             Project: Tika
          Issue Type: Bug
            Reporter: Gregory Lepore
         Attachments: rec-20230518121844489398-5335604b8b23.warc, 
rec-20230518121844489398-5335604b8b23.warc.gz, 
rec-20230518121844489398-5335604b8b23.warc.gz.json, 
rec-20230518121844489398-5335604b8b23.warc.json

The WARC parser works for non GZipped WARC files, but for GZipped WARC files it 
appears not all embedded files are being identified.

 

Processing a WARC.GZ file should return identical JSON output as the plain WARC 
file, with the addition of the GZ file metadata. However, in the attached JSON 
outputs, the JPEG present in the plain WARC file is not represented in the 
WARC.GZ.json file.

 

Additionally, the warc: metadata is not being returned for all files, although 
this may be by design. 

 

Attached are two JSON files, one for the GZipped WARC file and one for the 
plain WARC file. And the two original files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to