[ https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423040#comment-16423040 ]
Ewan Mellor commented on TIKA-2624: ----------------------------------- There were definitely changes between 1.8 and 2.0, e.g. PDFBOX-1963. I think it's always been 1 == 72dpi though; I see that in their doc-comments dating back to 2014. Of course, it could easily have been buggy back then. > Rendering PDFs for OCR with Tesseract uses different DPI than claimed > --------------------------------------------------------------------- > > Key: TIKA-2624 > URL: https://issues.apache.org/jira/browse/TIKA-2624 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.17 > Reporter: Ewan Mellor > Assignee: Tim Allison > Priority: Major > > Tika has two properties in {{PDFParser.properties}} that control what happens > in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract > for OCR. These are {{ocrDPI}} (default 300) and {{ocrImageScale}} (default > 2.0). > {{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the > metadata in the image (i.e. it doesn't control scaling at all, it's just an > advertised metadata field). > {{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which > uses it to specify the scale for rendering. This value is such that 1.0 == > 72dpi, and therefore Tika's default is to request 144dpi for rendering. > This means that Tika is asking PDFBox to render at 144dpi, and then > advertising 300dpi in the image metadata. This makes no sense to me, and is > surely going to confuse Tesseract. > Instead of doing this, we should remove {{ocrImageScale}}, and use the same > DPI value in both places. > We should keep the existing default DPI value, since Tesseract is trained at > 300dpi by default, so this will mean that all stages between PDFRenderer and > Tesseract are defaulting to 300dpi. > This change will have the side-effect that the temporary images between the > PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi). This will > have a memory and temporary disk space impact, but I think that it's still > best to have the whole pipeline using 300dpi. People who have memory > constraints will need to reduce ocrDPI and make the corresponding changes on > the Tesseract side. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)