On Wed, 13 Feb 2008 06:46:11 +0200 Lars Noodin <[EMAIL PROTECTED]> wrote:
> Duncan Patton a Campbell wrote: > > The following is proposed as a base methodology for paper copy document > > archival to digital media. > > Fixed that for you. ;) > > >... subject each scanned page to the following processess: > > > > 1. page scanned to .pnm via (sane) > > 2. OCR extract of text from .pnm (ocrad) > > 3. conversion of .pnm image to ??? (gm convert) > > Use a lossless format for the output. How the processes tuned depends > heavily on what is being scanned, what the scans are to be used for and > how long they are to be kept. Keep in mind that handling, especially > scanning, is very damaging for older items. Given the mention of OCR, > guess you will not handle hand-written manuscripts. > > So, be especially sure to use a lossless format instead. JPEG is not > only lossy, but the nature of the compression artifacts make it > particularly unsuitable (and harmful) for printed text and line > drawings. Most business forms, such as a tax form, can be considered a > bit of both. > > gzipped RAW would be better than JPEG. Don't let the system produce a > lossy format on the first pass, and don't make it easy to do so. Or > else some MBA will set it to produce highly compressed, lossy images, > thus fucking your documents by virtue of no more time/money to do a > rescan. You and your system will then get the blame for the bad > quality. PNG or GIF are the main options. > > Besides, for the grayscale and black and white, GIF will usually produce > a much smaller file. Are more than 254 levels of grey needed? > No. These are all good comments, thanks. JPEG is probably not the best image format. I am concerned with simplicity of procedure and mass availability. > > 4. storeage of JPEG file name and OCR output to SQL record (MySQL) > > If you are just linking the file name to metadata like checksum, date, > etc. then MySQL might be overkill. If you are wondering what metadata > to include, then consider a subset of Dublin Core. > > If the originals are valuable, then it might be useful to take a very > high-quality, large-film photo and then scan the film. The film > positive (or negative) can also be used then to produce research quality > reprints. Or to serve as a condolence reminder when manuscript thieves > are done going through the shopping list you publish on the net. > Mostly this is not a problem, as the vast bulk of already printed material has a low material value. > If you have not already look at the methodology behind projects like the > Electronic Beowulf, The Making of America, or Project Runeberg. > Yes. These are high end projects where this was more concerned with mass-application with low-end equipment ;-) > If the system is to serve up a lot of images quickly, then also look at > softraid for raid level 0 (striping) to speed up disk access: > http://www.openbsd.org/cgi-bin/man.cgi?query=softraid > > > > Regards, > -Lars > > Thanks, Dhu