Re: Methodology for Document Archiveal to digital media

Duncan Patton a Campbell Tue, 12 Feb 2008 21:35:18 -0800

On Wed, 13 Feb 2008 06:46:11 +0200
Lars Noodin <[EMAIL PROTECTED]> wrote:


> Duncan Patton a Campbell wrote:
> > The following is proposed as a base methodology for paper copy document
> > archival to digital media.
>
> Fixed that for you. ;)
>
>  >... subject each scanned page to the following processess:
> >
> > 1. page scanned to .pnm via (sane)
> > 2. OCR extract of text from .pnm (ocrad)
> > 3. conversion of .pnm image to ??? (gm convert)
>
> Use a lossless format for the output.  How the processes tuned depends
> heavily on what is being scanned, what the scans are to be used for and
> how long they are to be kept.  Keep in mind that handling, especially
> scanning, is very damaging for older items.  Given the mention of OCR,
> guess you will not handle hand-written manuscripts.
>
> So, be especially sure to use a lossless format instead.  JPEG is not
> only lossy, but the nature of the compression artifacts make it
> particularly unsuitable (and harmful) for printed text and line
> drawings.  Most business forms, such as a tax form, can be considered a
> bit of both.
>
> gzipped RAW would be better than JPEG.  Don't let the system produce a
> lossy format on the first pass, and don't make it easy to do so.  Or
> else some MBA will set it to produce highly compressed, lossy images,
> thus fucking your documents by virtue of no more time/money to do a
> rescan.  You and your system will then get the blame for the bad
> quality.  PNG or GIF are the main options.
>
> Besides, for the grayscale and black and white, GIF will usually produce
> a much smaller file.  Are more than 254 levels of grey needed?
>

No.

These are all good comments, thanks.  JPEG is probably not the best
image format.

I am concerned with simplicity of procedure and mass availability.

> > 4. storeage of JPEG file name and OCR output to SQL record (MySQL)
>
> If you are just linking the file name to metadata like checksum, date,
> etc. then MySQL might be overkill.  If you are wondering what metadata
> to include, then consider a subset of Dublin Core.
>
> If the originals are valuable, then it might be useful to take a very
> high-quality, large-film photo and then scan the film.  The film
> positive (or negative) can also be used then to produce research quality
> reprints.  Or to serve as a condolence reminder when manuscript thieves
> are done going through the shopping list you publish on the net.
>

Mostly this is not a problem, as the vast bulk of already printed material
has a low material value.

> If you have not already look at the methodology behind projects like the
> Electronic Beowulf, The Making of America, or Project Runeberg.
>

Yes.

These are high end projects where this was more concerned with
mass-application with low-end equipment ;-)

> If the system is to serve up a lot of images quickly, then also look at
> softraid for raid level 0 (striping) to speed up disk access:
>   http://www.openbsd.org/cgi-bin/man.cgi?query=softraid
>
>
>
> Regards,
> -Lars
>
>

Thanks,

Dhu

Re: Methodology for Document Archiveal to digital media

Reply via email to