Adobe Acrobat Pro should convert any pdf file (text or image) into a format recognized by text editing program. --Chun
From: CCP4 bulletin board [mailto:ccp...@jiscmail.ac.uk] On Behalf Of James Stroud Sent: Wednesday, November 17, 2010 12:36 PM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] [RANT] Publication Data Formats On Nov 17, 2010, at 11:46 AM, Bosch, Juergen wrote: On Nov 17, 2010, at 2:30 PM, James Holton wrote: I once burned up an entire week trying to extract author, title, journal, etc. from a pile of 300 "sdarticle.pdf" files. It is NOT easy! drag & drop, it is that easy :-) And you can even export it into text afterwards. James, you should have invested into this program: The program you suggest might be able to do author, title, and journal for many of the articles, but would likely bonk terribly on the "etc." part. Anyone with a certain level of programming ability can dredge through PDF files. That's not the point. The point is that computers are unlike people in that they can not yet decode semantics in the absence of structured context. Humans are good at this task although computers are not. For example, you understand the meaning of this paragraph, but a computer would just see a bunch of words... <clause type="conditional" language="English"> <condition>unless</condition> <subject>I</subject> <predicate>followed</predicate> <qualification> <adjective>certain</adjective> <adjective>prescribed</adjective> <object type="indirect">rules</object> </qualification> </clause> (And still the computer would have a lot of trouble identifying exactly who "I" was in that clause because the structure does not extend to broader context--I have conveniently not added the 'author' attribute to the <clause>.) Unlike the above XML*, the PDF file format is not required to give any indication of which of its bytes represent a certain type of data. James *Again, not advocating for XML or any other specific structured data format.