Re: [O] [OT] Scanning for archiving

Johnny Wed, 09 Nov 2011 01:13:38 -0800

Apologies for top-posting, but my comment is only inspired by the
conversation and doesn't exactly build on it, so here we go.


I use predominantly pdf in scanning, for one main reason only - it
handles *metadata* nicely (with gscan2pdf). This is nice for searching
later. When playing with DjVu, I didn't find an easy way to amend
metadata - is there any good working method and tools to recommend for
adding metadata for DjVu files?

Thanks.

Pieter Praet <pie...@praet.org> writes:

> On Mon, 7 Nov 2011 18:44:24 +0100, Karl Voit <devn...@karl-voit.at> wrote:
>> Hi!
>> 
>> Inspired by «Total Recall»[3], a book of two MS Research guys, I
>> started life logging on my own two months ago.
>> 
>
> Dammit, that's been on my reading list for almost 2 years now, and
> *still* it isn't available in ebook format.  One would think they'd walk
> their talk [1], no?
>
>> [...]
>> * Pieter Praet <pie...@praet.org> wrote:
>> >
>> > Using PDF for scanned documents results in *huge* files with a seriously
>> > disappointing image quality.  
>> 
>> I can not copy that at all:
>> 
>> ,----
>> | vk@gary ~2d % l 2011-11-02_13-22-45.png
>> | -rw------- 1 vk vk 103150 2011-11-02 13:22 2011-11-02_13-22-45.png
>> | vk@gary ~2d % convert 2011-11-02_13-22-45.png 2011-11-02_13-22-45.pdf
>> | vk@gary ~2d % l 2011-11-02_13-22-45.pdf
>> | -rw-r--r-- 1 vk vk 96457 2011-11-07 18:12 2011-11-02_13-22-45.pdf
>> | vk@gary ~2d %
>> `----
>> 
>> In this example, the compression of PDF is much better than the
>> original PNG one. PDF is only a container format.
>> 
>
> The conversion to PDF has indeed reduced the filesize, but not for the
> reasons you might think: If you don't explicitly provide ImageMagick's
> `convert' with a compression level (`-quality' option), it will use a
> default of 75%.  Thus I (perhaps incorrectly) infer that you've just
> lost 25% of the image quality for a meager 7% reduction in filesize.
>
>
> I do admit that the whole quality vs. filesize statement I made
> regarding using PDF for scanned documents wasn't entirely correct:
> I cut some corners.
>
> The real issue is that most folks use their scanner software to save
> directly to PDF, and for some reason, scanner software (especially the
> proprietary variety) predominantly uses JPEG compression as default when
> saving to PDF.
>
> JPEG was developed for storing images with smooth transitions and a high
> bit depth (i.e. photographs), not hard transitions and a low bit depth
> (i.e. documents), so you're likely to suffer a noticeable degradation in
> text quality, even when using 1:1 JPEG compression.
>
> You're using PNG compression though, so the whole JPEG deal doesn't apply.
>
> So, that just leaves the neverending stream of PDF security issues :)
>
>> > Consider storing your scans in DjVu format
>> > [1], which was developed specifically for this purpose.
>> 
>> PDF is a common standard whereas DjVu is something I - as an
>> advanced computer user - never faced before in real life. I am not
>> sure whether any of my computers can handle DjVu files at all.
>> 
>
> How about the Million Book Project / Universal Digital Library [2] ?
> Even though every computing device is most likely to support PDF, their
> collection is only available in TIFF and DjVu format.
>
> The list of participants and partners [3] (not to mention the magnitude
> and cost of their undertaking) is reason enough (for me, at least) to
> assume that DjVu is deemed to be rather future-proof.
>
> I'm guessing ISO standardization will be only a matter of time.
>
>> The goals of DjVu sound great but I get everything with PDF too.
>> Although I like the idea of OGG Vorbis, I re-ripped all my CDs using
>> mp3 again because I could not use many music devices or music
>> management software packages.
>> 
>
> Ahhh, VHS vs. Betamax, over and over again...
>
> Companies only succeed in getting everyone stuck with mediocre tools if
> we allow them to.  You don't *need* all devices/software to support the
> superior format.  Just get the ones that do (if there are any...), try
> to enlighten the people in you monkeysphere [4], and then let the free
> market do its work.  Joe Average Consumer will eventually follow (unless
> pornography is at stake, apparently), and the industry will be right on
> his tail.
>
>> I stick to the format *any* computer can handle without special
>> software products. [...]
>
> Somehow this implies that *every* computer is infected with Adobe's
> malware.  I find that rather disconcerting, to be honest :D
>
>> [...] And I do think that I get a higher chance of
>> being able to read my documents twenty years from now.
>> 
>
> For your sake, I hope you're right!
>
>> For scanned images I'd prefer PNG instead but the OS X Software of
>> my OfficeJet offers me the ability to generate PDF files where an
>> OCR software adds a searchable text layer above the scanned text.
>> This is *very* important to me since I am able to do full text
>> search on the content of my archived documents.
>> 
>
> May be a bit less convenient in daily usage, but you could stick to your
> preference of keeping all your scans in PNG format by keeping the OCR
> output in a separate ASCII file:
>
>   #+begin_src sh
>     for i in $(ls ${HOME}/msg/paper/inbox/*.png) ; do
>         tesseract ${i} ${i}.txt
>     done
>   #+end_src
>
> That way you can access your data even on text-only machines,
> and full-text search is only a `grep' away.
>
>> And I plan to archive *all* of my documents. Really all of them.
>> 
>
> Then you'll probably be interested in Joey Hess' git-annex [5] to keep
> your archive versioned and in sync across all your devices.
>
>> Storage space does not matter (any more) to me since I have more
>> disk space now already than I could possible fill with my lifetime
>> paper correspondence. And I do think that my disk space continues to
>> grow in future.
>> 
>
> I'd argue it still does, otherwise you'd be keeping your scans in
> TIFF format.  And digitized trees surely aren't the only type of
> correspondence you are (or will be) archiving.
>
> Efficiency should always play a major role IMO, even if the available
> resources are (perceived to be) infinite.  Having a hangar instead of a
> garage doesn't warrant driving a schoolbus to work, even if doesn't
> guzzle a drop of gas.
>
>> [...]
>> 
>> Funny side fact: grayscale scan document settings produces slightly
>> larger files than colored ones.
>> 
>
> That's odd.  Probably depends on which type of compression is used.
>
>> > gscan2pdf also supports a number of OCR utils, but the UI for this is
>> > clumsy (aren't they all...), so you're better off using the CLI tools
>> > directly.  Tesseract is recommended.
>> 
>> I played around with ocropus, tesseract, ocroscript, hocr2pdf,
>> exactimage, ppa:gezakovacs/pdfocr, ... to generate those sandwitch
>> PDF documents (OCR text above the scanned images) on GNU/Linux.
>> Unfortunately none of those (very cool projects) produced reliable
>> results on my side. The results vary from «no error but overlay font
>> size is incorrect and produces loss of layout» to «library error
>> messages I can not read or handle». 
>> 
>> Whereas the HP OfficeJet bundles its OS X software with OCR from
>> Readiris which produces perfect results even in different languages
>> and using a usable user interface.
>> 
>
> Sadly, I can only agree with this.  Google's involvement in Tesseract
> and OCRopus does instill hope though :)
>
>> > NOTE: When attempting something like this, a fast scanner with a *reliable*
>> > automatic document feeder will help prevent premature hair loss ;)
>> 
>> I have found several scanner products I was interested in:
>> 
>> "Canon imageFORMULA P-150": very small form factor with basic Linux
>> support. Price tag starts with € 260. Neat form factor and very
>> portable. Different version "P-150m" for Mac OS X.
>> 
>> The authors of [3] use Fujitsu ScanSnap starting at € 400.
>> 
>> I ended up with the Office Jet Pro (mentioned above) at € 250
>> because I got flatbed scanner *and* ADF-scanner *and* a
>> full-duplex/full-color network printer with a very good
>> price-per-printed-page-ratio (better than many laser printers!). And
>> all of this with a cheaper price tag than any scan-only-product I
>> was interested in.
>> 
>> So far I am almost satisfied. «Almost»? Well, HP did a good job with
>> this printer but they made only a 90% solution on almost all levels.
>> Whereas 100% would be possible with small additional effort when
>> creating the printer. But those resulting 90% are pretty usable.
>> 
>>   3. http://qr.cx/sAHU
>> -- 
>> Karl Voit
>> 
>> 
>
>
> Peace

-- 
Johnny

Re: [O] [OT] Scanning for archiving

Reply via email to