The following is proposed as a base methodology for hardcopy document archival to digital media.
The problem consists of making available, via internet, the large coda of human knowledge contained exclusively in paper books in a manner that is searchable via electronic mechanisms: string search, inverted words lists and semantic network composition should be enabled without undue degradation of the original material's display properties. Pictures, tables and other graphical materials should be preseved in context even if they are not incorporated in a searchable format. In order to achieve a final output that retains the formatted structure of the original while being searchable by machine, the process of induction should record the original graphic layout in image format and also produce a mechanically derived extract of the original material's text content. In this demonstration a text of no special account is scanned as images, with each page being subject to optical character recognition. The named image of each page is then stored in JPEG format with the OCR extract for that page being recorded in a searchable database. A program allowing a user to scan pages and record them to a (SQL) database was formulated (in gprolog) to subject each scanned page to the following processess: 1. page scanned to .pnm via (sane) 2. OCR extract of text from .pnm (ocrad) 3. conversion of .pnm image to JPEG (gm convert) 4. storeage of JPEG file name and OCR output to SQL record (MySQL) The operator is presented with some choices in the scanning of pages relating to the most applicable bit density for the image (black&white, grayscale, color) as well as the option to redo a scan of/from a specific page. These are minor optimizations available when hand scanning that can reduce the underlying storeage requirements by some 2/3 as opposed to scanning everything in color. A trivial search/display mechanism was also then constructed for a LAMP server using PHP/MySQL. The current mechanism only allows for simple token seach, but because both the underlying image and OCR extracted text are preserved in correlation, much more sophisticated search mechanisms are possible. An eventual standard for such material, regardless of the actual implementation, (XML?) should consider retaining the functional characteristics described by this project. Concerns such as copywrite access and security should not be undertaken at the physical storeage layer but as part of a further access protocol not described or considered here. This project can be viewed at http://neotext.ca/CPetrol and sources downloaded from http://neotext.ca/CPetrol/CPetrol_code.tar.gz. A convenient search token to view might be oligocene. Further information can be obtained by emailing me at [EMAIL PROTECTED] Dhu