Apologies for top-posting, but my comment is only inspired by the conversation and doesn't exactly build on it, so here we go.
I use predominantly pdf in scanning, for one main reason only - it handles *metadata* nicely (with gscan2pdf). This is nice for searching later. When playing with DjVu, I didn't find an easy way to amend metadata - is there any good working method and tools to recommend for adding metadata for DjVu files? Thanks. Pieter Praet <pie...@praet.org> writes: > On Mon, 7 Nov 2011 18:44:24 +0100, Karl Voit <devn...@karl-voit.at> wrote: >> Hi! >> >> Inspired by «Total Recall»[3], a book of two MS Research guys, I >> started life logging on my own two months ago. >> > > Dammit, that's been on my reading list for almost 2 years now, and > *still* it isn't available in ebook format. One would think they'd walk > their talk [1], no? > >> [...] >> * Pieter Praet <pie...@praet.org> wrote: >> > >> > Using PDF for scanned documents results in *huge* files with a seriously >> > disappointing image quality. >> >> I can not copy that at all: >> >> ,---- >> | vk@gary ~2d % l 2011-11-02_13-22-45.png >> | -rw------- 1 vk vk 103150 2011-11-02 13:22 2011-11-02_13-22-45.png >> | vk@gary ~2d % convert 2011-11-02_13-22-45.png 2011-11-02_13-22-45.pdf >> | vk@gary ~2d % l 2011-11-02_13-22-45.pdf >> | -rw-r--r-- 1 vk vk 96457 2011-11-07 18:12 2011-11-02_13-22-45.pdf >> | vk@gary ~2d % >> `---- >> >> In this example, the compression of PDF is much better than the >> original PNG one. PDF is only a container format. >> > > The conversion to PDF has indeed reduced the filesize, but not for the > reasons you might think: If you don't explicitly provide ImageMagick's > `convert' with a compression level (`-quality' option), it will use a > default of 75%. Thus I (perhaps incorrectly) infer that you've just > lost 25% of the image quality for a meager 7% reduction in filesize. > > > I do admit that the whole quality vs. filesize statement I made > regarding using PDF for scanned documents wasn't entirely correct: > I cut some corners. > > The real issue is that most folks use their scanner software to save > directly to PDF, and for some reason, scanner software (especially the > proprietary variety) predominantly uses JPEG compression as default when > saving to PDF. > > JPEG was developed for storing images with smooth transitions and a high > bit depth (i.e. photographs), not hard transitions and a low bit depth > (i.e. documents), so you're likely to suffer a noticeable degradation in > text quality, even when using 1:1 JPEG compression. > > You're using PNG compression though, so the whole JPEG deal doesn't apply. > > So, that just leaves the neverending stream of PDF security issues :) > >> > Consider storing your scans in DjVu format >> > [1], which was developed specifically for this purpose. >> >> PDF is a common standard whereas DjVu is something I - as an >> advanced computer user - never faced before in real life. I am not >> sure whether any of my computers can handle DjVu files at all. >> > > How about the Million Book Project / Universal Digital Library [2] ? > Even though every computing device is most likely to support PDF, their > collection is only available in TIFF and DjVu format. > > The list of participants and partners [3] (not to mention the magnitude > and cost of their undertaking) is reason enough (for me, at least) to > assume that DjVu is deemed to be rather future-proof. > > I'm guessing ISO standardization will be only a matter of time. > >> The goals of DjVu sound great but I get everything with PDF too. >> Although I like the idea of OGG Vorbis, I re-ripped all my CDs using >> mp3 again because I could not use many music devices or music >> management software packages. >> > > Ahhh, VHS vs. Betamax, over and over again... > > Companies only succeed in getting everyone stuck with mediocre tools if > we allow them to. You don't *need* all devices/software to support the > superior format. Just get the ones that do (if there are any...), try > to enlighten the people in you monkeysphere [4], and then let the free > market do its work. Joe Average Consumer will eventually follow (unless > pornography is at stake, apparently), and the industry will be right on > his tail. > >> I stick to the format *any* computer can handle without special >> software products. [...] > > Somehow this implies that *every* computer is infected with Adobe's > malware. I find that rather disconcerting, to be honest :D > >> [...] And I do think that I get a higher chance of >> being able to read my documents twenty years from now. >> > > For your sake, I hope you're right! > >> For scanned images I'd prefer PNG instead but the OS X Software of >> my OfficeJet offers me the ability to generate PDF files where an >> OCR software adds a searchable text layer above the scanned text. >> This is *very* important to me since I am able to do full text >> search on the content of my archived documents. >> > > May be a bit less convenient in daily usage, but you could stick to your > preference of keeping all your scans in PNG format by keeping the OCR > output in a separate ASCII file: > > #+begin_src sh > for i in $(ls ${HOME}/msg/paper/inbox/*.png) ; do > tesseract ${i} ${i}.txt > done > #+end_src > > That way you can access your data even on text-only machines, > and full-text search is only a `grep' away. > >> And I plan to archive *all* of my documents. Really all of them. >> > > Then you'll probably be interested in Joey Hess' git-annex [5] to keep > your archive versioned and in sync across all your devices. > >> Storage space does not matter (any more) to me since I have more >> disk space now already than I could possible fill with my lifetime >> paper correspondence. And I do think that my disk space continues to >> grow in future. >> > > I'd argue it still does, otherwise you'd be keeping your scans in > TIFF format. And digitized trees surely aren't the only type of > correspondence you are (or will be) archiving. > > Efficiency should always play a major role IMO, even if the available > resources are (perceived to be) infinite. Having a hangar instead of a > garage doesn't warrant driving a schoolbus to work, even if doesn't > guzzle a drop of gas. > >> [...] >> >> Funny side fact: grayscale scan document settings produces slightly >> larger files than colored ones. >> > > That's odd. Probably depends on which type of compression is used. > >> > gscan2pdf also supports a number of OCR utils, but the UI for this is >> > clumsy (aren't they all...), so you're better off using the CLI tools >> > directly. Tesseract is recommended. >> >> I played around with ocropus, tesseract, ocroscript, hocr2pdf, >> exactimage, ppa:gezakovacs/pdfocr, ... to generate those sandwitch >> PDF documents (OCR text above the scanned images) on GNU/Linux. >> Unfortunately none of those (very cool projects) produced reliable >> results on my side. The results vary from «no error but overlay font >> size is incorrect and produces loss of layout» to «library error >> messages I can not read or handle». >> >> Whereas the HP OfficeJet bundles its OS X software with OCR from >> Readiris which produces perfect results even in different languages >> and using a usable user interface. >> > > Sadly, I can only agree with this. Google's involvement in Tesseract > and OCRopus does instill hope though :) > >> > NOTE: When attempting something like this, a fast scanner with a *reliable* >> > automatic document feeder will help prevent premature hair loss ;) >> >> I have found several scanner products I was interested in: >> >> "Canon imageFORMULA P-150": very small form factor with basic Linux >> support. Price tag starts with € 260. Neat form factor and very >> portable. Different version "P-150m" for Mac OS X. >> >> The authors of [3] use Fujitsu ScanSnap starting at € 400. >> >> I ended up with the Office Jet Pro (mentioned above) at € 250 >> because I got flatbed scanner *and* ADF-scanner *and* a >> full-duplex/full-color network printer with a very good >> price-per-printed-page-ratio (better than many laser printers!). And >> all of this with a cheaper price tag than any scan-only-product I >> was interested in. >> >> So far I am almost satisfied. «Almost»? Well, HP did a good job with >> this printer but they made only a 90% solution on almost all levels. >> Whereas 100% would be possible with small additional effort when >> creating the printer. But those resulting 90% are pretty usable. >> >> 3. http://qr.cx/sAHU >> -- >> Karl Voit >> >> > > > Peace -- Johnny