[ 
https://issues.apache.org/jira/browse/TIKA-4048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17747636#comment-17747636
 ] 

Tim Allison commented on TIKA-4048:
-----------------------------------

I looked into this. I don't think we can fix this on the Tika side.  What we 
can do is revert to the default of "do not" uncompress multiple gzip streams 
generally.  Then, in the WARC wrapper/parser, we can set the default to "true".

> Gzipped WARC not identifying all assets
> ---------------------------------------
>
>                 Key: TIKA-4048
>                 URL: https://issues.apache.org/jira/browse/TIKA-4048
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Gregory Lepore
>            Assignee: Tim Allison
>            Priority: Minor
>             Fix For: 2.8.1
>
>         Attachments: Screenshot 2023-05-30 at 3.49.19 PM.png, Screenshot 
> 2023-05-30 at 3.50.41 PM.png, rec-20230518121844489398-5335604b8b23.warc, 
> rec-20230518121844489398-5335604b8b23.warc.gz, 
> rec-20230518121844489398-5335604b8b23.warc.gz.json, 
> rec-20230518121844489398-5335604b8b23.warc.json
>
>
> The WARC parser works for non GZipped WARC files, but for GZipped WARC files 
> it appears not all embedded files are being identified.
>  
> Processing a WARC.GZ file should return identical JSON output as the plain 
> WARC file, with the addition of the GZ file metadata. However, in the 
> attached JSON outputs, the JPEG present in the plain WARC file is not 
> represented in the WARC.GZ.json file.
>  
> Additionally, the warc: metadata is not being returned for all files, 
> although this may be by design. 
>  
> Attached are two JSON files, one for the GZipped WARC file and one for the 
> plain WARC file. And the two original files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to