[ 
https://issues.apache.org/jira/browse/TIKA-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-4276:
----------------------------------
    Description: 
We use Tika to check file type and extension. However, with some damaged pdf 
files Tika detects them as text file.

Wonder if you can make Tika detect the damaged pdf file as pdf file type and 
extension.

Following is the sample code and the link to the tika-config.xml and the sample 
PDF file is [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es]

The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
2.9.2 and POI version is 5.2.3.   

 

 
{code:java}
import org.apache.tika.config.TikaConfig;
import org.apache.tika.detect.Detector;
import org.apache.tika.io.TikaInputStream;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.metadata.TikaCoreProperties;
import org.apache.tika.mime.MediaType;
import org.apache.tika.mime.MimeType;
 
import java.io.FileInputStream;
 
public class DetectDamagedPDF {
 
    public static void main(String args[]) {
        try {
            String filePath = 
"/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf";
            TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml");
            Detector detector = config.getDetector();
            Metadata metadata = new Metadata();
            FileInputStream fis = new FileInputStream(filePath);
            TikaInputStream stream = TikaInputStream.get(fis);
            metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath);
            MediaType mediaType = detector.detect(stream, metadata);
            MimeType mimeType = 
config.getMimeRepository().forName(mediaType.toString());
            String tikaExtension = mimeType.getExtension();
            System.out.println("tikaExtension = " + tikaExtension);
        }
        catch(Exception ex) {
            ex.printStackTrace();
        }
    }
}
{code}
 

  was:
We use Tika to check file type and extension. However, with some damaged pdf 
files Tika detects them as text file.

Wonder if you can make Tika detect the damaged pdf file as pdf file type and 
extension.

Following is the sample code and the link to the tika-config.xml and the sample 
PDF file is [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es]

The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
2.9.2 and POI version is 5.2.3.   

 

 

import org.apache.tika.config.TikaConfig;

import org.apache.tika.detect.Detector;

import org.apache.tika.io.TikaInputStream;

import org.apache.tika.metadata.Metadata;

import org.apache.tika.metadata.TikaCoreProperties;

import org.apache.tika.mime.MediaType;

import org.apache.tika.mime.MimeType;

 

import java.io.FileInputStream;

 

public class DetectDamagedPDF {

 

    public static void main(String args[]) {

        try

{             String filePath = 
"/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf";             
TikaConfig config = new 
TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml");        
     Detector detector = config.getDetector();             Metadata metadata = 
new Metadata();             FileInputStream fis = new 
FileInputStream(filePath);             TikaInputStream stream = 
TikaInputStream.get(fis);             
metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath);             
MediaType mediaType = detector.detect(stream, metadata);             MimeType 
mimeType = config.getMimeRepository().forName(mediaType.toString());            
 String tikaExtension = mimeType.getExtension();             
System.out.println("tikaExtension = " + tikaExtension);         }

        catch(Exception ex)

{             ex.printStackTrace();         }

    }

}

 


> Tika fails to detect damaged pdf
> --------------------------------
>
>                 Key: TIKA-4276
>                 URL: https://issues.apache.org/jira/browse/TIKA-4276
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.9.2
>            Reporter: Xiaohong Yang
>            Priority: Major
>
> We use Tika to check file type and extension. However, with some damaged pdf 
> files Tika detects them as text file.
> Wonder if you can make Tika detect the damaged pdf file as pdf file type and 
> extension.
> Following is the sample code and the link to the tika-config.xml and the 
> sample PDF file is 
> [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es]
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2 and POI version is 5.2.3.   
>  
>  
> {code:java}
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.mime.MimeType;
>  
> import java.io.FileInputStream;
>  
> public class DetectDamagedPDF {
>  
>     public static void main(String args[]) {
>         try {
>             String filePath = 
> "/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf";
>             TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml");
>             Detector detector = config.getDetector();
>             Metadata metadata = new Metadata();
>             FileInputStream fis = new FileInputStream(filePath);
>             TikaInputStream stream = TikaInputStream.get(fis);
>             metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath);
>             MediaType mediaType = detector.detect(stream, metadata);
>             MimeType mimeType = 
> config.getMimeRepository().forName(mediaType.toString());
>             String tikaExtension = mimeType.getExtension();
>             System.out.println("tikaExtension = " + tikaExtension);
>         }
>         catch(Exception ex) {
>             ex.printStackTrace();
>         }
>     }
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to