[CODE4LIB] Creating pdfs from images and their text

Padraic Stack Thu, 16 Jan 2014 09:21:34 -0800

Hi folks,

I have a number of typescript / manuscript images on which it is quitetime consuming to run OCR. (Or more accurately it is quite timeconsuming to correct the OCR).

For some of these I have text files containing accurate transcriptions.In other cases I have TEI files with these transcriptions.

What is a straightforward way to combine the text with overlaid imagesto create searchable pdfs?

I know my way around the command line and can follow tutorials but I'mnot a programmer so the more straightforward the solution the better.

I have had a go with pdftkBuilder and a result can be seen here[https://www.dropbox.com/s/fxp6rnt24043aez/result3.pdf] but there are anumber of problems:

1. it involves 'printing' the text to pdf and 'stamping' the image overit. The result entails a margin unless the image matches a standardpaper size.2. the underlying text doesn't match up to the image. I would love if itcould but can live with it if can't.3. it is very time consuming - ideally I would like a solution thatcould be scripted and left to run.


Any advice would be greatly appreciated.


The best I have

--

Padraic


Padraic Stack | Digital Humanities Support Officer | NUI Maynooth | 
[email protected] |Phone: Mon: 01 474 7187 Tue - Fri: 01 474 7197

[CODE4LIB] Creating pdfs from images and their text

Reply via email to