[jira] [Commented] (TIKA-4276) Tika fails to detect damaged pdf

Tilman Hausherr (Jira) Wed, 10 Jul 2024 06:54:06 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17864670#comment-17864670
 ]


Tilman Hausherr commented on TIKA-4276:
---------------------------------------

Your file starts with "1 0 obj" instead of with "%PDF" so I'd say this isn't a 
bug. The file is truncated at the beginning, and it could be truncated 
anywhere. We'd need countless magic numbers.

> Tika fails to detect damaged pdf
> --------------------------------
>
>                 Key: TIKA-4276
>                 URL: https://issues.apache.org/jira/browse/TIKA-4276
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.9.2
>            Reporter: Xiaohong Yang
>            Priority: Major
>
> We use Tika to check file type and extension. However, with some damaged pdf 
> files Tika detects them as text file.
> Wonder if you can make Tika detect the damaged pdf file as pdf file type and 
> extension.
> Following is the sample code and the link to the tika-config.xml and the 
> sample PDF file is 
> [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es]
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2 and POI version is 5.2.3.   
>  
>  
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.mime.MimeType;
>  
> import java.io.FileInputStream;
>  
> public class DetectDamagedPDF {
>  
>     public static void main(String args[]) {
>         try
> {             String filePath = 
> "/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf";             
> TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml");      
>        Detector detector = config.getDetector();             Metadata 
> metadata = new Metadata();             FileInputStream fis = new 
> FileInputStream(filePath);             TikaInputStream stream = 
> TikaInputStream.get(fis);             
> metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath);             
> MediaType mediaType = detector.detect(stream, metadata);             MimeType 
> mimeType = config.getMimeRepository().forName(mediaType.toString());          
>    String tikaExtension = mimeType.getExtension();             
> System.out.println("tikaExtension = " + tikaExtension);         }
>         catch(Exception ex)
> {             ex.printStackTrace();         }
>     }
> }
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4276) Tika fails to detect damaged pdf

Reply via email to