Re: Scanning docs for bitsavers

Paul Koning via cctalk Tue, 03 Dec 2019 09:47:55 -0800

> On Dec 2, 2019, at 11:12 PM, Grant Taylor via cctalk <cctalk@classiccmp.org> 
> wrote:
> 
> On 12/2/19 9:06 PM, Grant Taylor via cctalk wrote:
>> In my opinion, PDFs are the last place that computer usable data goes. 
>> Because getting anything out of a PDF as a data source is next to impossible.
>> Sure, you, a human, can read it and consume the data.
>> Try importing a simple table from a PDF and working with the data in 
>> something like a spreadsheet.  You can't do it.  The raw data is there.  But 
>> you can't readily use it.
>> This is why I say that a PDF is the end of the line for data.
>> I view it as effectively impossible to take data out of a PDF and do 
>> anything with it without first needing to reconstitute it before I can use 
>> it.
> 
> I'll add this:
> 
> PDF is a decent page layout format.  But trying to view the contents in any 
> different layout is problematic (at best).
> 
> Trying to use the result of a page layout as a data source is ... problematic.

That's hardly surprising.  These properties are precisely the intent of PDF.  
It's basically a portable variant of PostScript, with some cleanups (relatively 
sane Unicode support, transparency, hyperlinks, a few other things).  Its 
specific purpose is to encode page images, just as they appear on actual paper. 
 Indeed, PDF is often used as a "camera ready copy" format for material going 
to a print shop.  It works quite well for that.

For scanned documents, where each page is just an image, PDF is a decent 
container format.  For documents with actual text, it's far more problematic.

Using PDF as an intermediate form is every bit as inappropriate as using JPEG 
for line art or any other application where artefacts are impermissible.  The 
trouble (for both of these) is that many of the users don't know the 
limitations and blindly use the wrong tools.

        paul
Re: Scanning docs for bitsavers

Reply via email to