Re: [ccp4bb] [RANT] Publication Data Formats

Chun Luo Wed, 17 Nov 2010 13:01:39 -0800

Adobe Acrobat Pro should convert any pdf file (text or image) into a format
recognized by text editing program. --Chun

From: CCP4 bulletin board [mailto:ccp...@jiscmail.ac.uk] On Behalf Of James
Stroud
Sent: Wednesday, November 17, 2010 12:36 PM
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] [RANT] Publication Data Formats

On Nov 17, 2010, at 11:46 AM, Bosch, Juergen wrote:

On Nov 17, 2010, at 2:30 PM, James Holton wrote:

 I once burned up an 
entire week trying to extract author, title, journal, etc. from a pile 
of 300 "sdarticle.pdf" files.  It is NOT easy!

drag & drop, it is that easy :-) And you can even export it into text
afterwards.

James, 

you should have invested into this program:

The program you suggest might be able to do author, title, and journal for
many of the articles, but would likely bonk terribly on the "etc." part.

Anyone with a certain level of programming ability can dredge through PDF
files. That's not the point. The point is that computers are unlike people
in that they can not yet decode semantics in the absence of structured
context. Humans are good at this task although computers are not. For
example, you understand the meaning of this paragraph, but a computer would
just see a bunch of words...

   <clause type="conditional" language="English">

       <condition>unless</condition>

       <subject>I</subject>

       <predicate>followed</predicate>

       <qualification>

           <adjective>certain</adjective>

           <adjective>prescribed</adjective>

           <object type="indirect">rules</object>

       </qualification>

   </clause>

(And still the computer would have a lot of trouble identifying exactly who
"I" was in that clause because the structure does not extend to broader
context--I have conveniently not added the 'author' attribute to the
<clause>.)

Unlike the above XML*, the PDF file format is not required to give any
indication of which of its bytes represent a certain type of data.

James

*Again, not advocating for XML or any other specific structured data format.

Re: [ccp4bb] [RANT] Publication Data Formats

Reply via email to