[ https://issues.apache.org/jira/browse/TIKA-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17864670#comment-17864670 ]
Tilman Hausherr commented on TIKA-4276: --------------------------------------- Your file starts with "1 0 obj" instead of with "%PDF" so I'd say this isn't a bug. The file is truncated at the beginning, and it could be truncated anywhere. We'd need countless magic numbers. > Tika fails to detect damaged pdf > -------------------------------- > > Key: TIKA-4276 > URL: https://issues.apache.org/jira/browse/TIKA-4276 > Project: Tika > Issue Type: Bug > Affects Versions: 2.9.2 > Reporter: Xiaohong Yang > Priority: Major > > We use Tika to check file type and extension. However, with some damaged pdf > files Tika detects them as text file. > Wonder if you can make Tika detect the damaged pdf file as pdf file type and > extension. > Following is the sample code and the link to the tika-config.xml and the > sample PDF file is > [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es] > The operating system is Ubuntu 20.04. Java version is 21. Tika version is > 2.9.2 and POI version is 5.2.3. > > > import org.apache.tika.config.TikaConfig; > import org.apache.tika.detect.Detector; > import org.apache.tika.io.TikaInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.metadata.TikaCoreProperties; > import org.apache.tika.mime.MediaType; > import org.apache.tika.mime.MimeType; > > import java.io.FileInputStream; > > public class DetectDamagedPDF { > > public static void main(String args[]) { > try > { String filePath = > "/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf"; > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml"); > Detector detector = config.getDetector(); Metadata > metadata = new Metadata(); FileInputStream fis = new > FileInputStream(filePath); TikaInputStream stream = > TikaInputStream.get(fis); > metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath); > MediaType mediaType = detector.detect(stream, metadata); MimeType > mimeType = config.getMimeRepository().forName(mediaType.toString()); > String tikaExtension = mimeType.getExtension(); > System.out.println("tikaExtension = " + tikaExtension); } > catch(Exception ex) > { ex.printStackTrace(); } > } > } > -- This message was sent by Atlassian Jira (v8.20.10#820010)