Methodology for Document Archiveal to digital media

Duncan Patton a Campbell Tue, 12 Feb 2008 18:13:50 -0800

The following is proposed as a base methodology for hardcopy document 
archival to digital media.


The problem consists of making available, via internet, the large coda
of human knowledge contained exclusively in paper books in a manner that
is searchable via electronic mechanisms: string search, inverted words 
lists and semantic network composition should be enabled without undue 
degradation of the original material's display properties.  Pictures, 
tables and other graphical materials should be preseved in context 
even if they are not incorporated in a searchable format.  

In order to achieve a final output that retains the formatted structure of 
the original while being searchable by machine, the process of induction 
should record the original graphic layout in image format and also produce 
a mechanically derived extract of the original material's text content.  

In this demonstration a text of no special account is scanned as images,
with each page being subject to optical character recognition.  The
named image of each page is then stored in JPEG format with the OCR
extract for that page being recorded in a searchable database.  

A program allowing a user to scan pages and record them to a (SQL) database 
was formulated (in gprolog) to subject each scanned page to the following 
processess:

1. page scanned to .pnm via (sane)
2. OCR extract of text from .pnm (ocrad)
3. conversion of .pnm image to JPEG (gm convert)
4. storeage of JPEG file name and OCR output to SQL record (MySQL)

The operator is presented with some choices in the scanning of pages relating 
to the most applicable bit density for the image (black&white, grayscale, 
color) 
as well as the option to redo a scan of/from a specific page.  These are minor 
optimizations available when hand scanning that can reduce the underlying 
storeage requirements by some 2/3 as opposed to scanning everything in color.

A trivial search/display mechanism was also then constructed for a LAMP server
using PHP/MySQL.  The current mechanism only allows for simple token seach, but
because both the underlying image and OCR extracted text are preserved in 
correlation, much more sophisticated search mechanisms are possible.

An eventual standard for such material, regardless of the actual implementation,
(XML?) should consider retaining the functional characteristics described by 
this project.  Concerns such as copywrite access and security should not be 
undertaken at the physical storeage layer but as part of a further access 
protocol 
not described or considered here.

This project can be viewed at http://neotext.ca/CPetrol and sources downloaded 
from
http://neotext.ca/CPetrol/CPetrol_code.tar.gz.  A convenient search token to 
view 
might be oligocene.

Further information can be obtained by emailing me at [EMAIL PROTECTED]

Dhu

Methodology for Document Archiveal to digital media

Reply via email to