Re: [CODE4LIB] scraping or extracting structured data from a pdf

Joe Hourclé Thu, 12 May 2022 12:28:43 -0700

Let’s try this again without my hitting ‘send’ when I want to send it to 
drafts.  (Yay, mystery meat navigation in cell phone interfaces)

>> On May 12, 2022, at 2:40 PM, Danielle Reay <dr...@drew.edu> wrote:
>> 
>> Hello,
>> 
>> We have a faculty member looking to create a dataset from an annotated
>> bibliography she compiled. Right now it exists as a word file and as a pdf.
>> The entries are relatively structured with a citation and an abstract, but
>> the document is about 150 pages long with multiple entries per page. Rather
>> than manually copy and paste everything to create the spreadsheet/csv, I
>> wanted to ask for suggestions or approaches to doing this by either
>> scraping or extracting structured data from the pdf. Thanks very much in
>> advance!

I haven’t had to do this in years, but I used to do it quite a bit.  (Including 
trying to extract information from our school’s course catalog and build 
cross-linked websites from it)

First, try to start with whatever you have that’s the lowest level… in this 
case, the Word document.  From that, try to see if there’s any semantic content 
(did they use character or paragraph styles for formatting, or is it just 
‘bold’ and ‘italic’?

If there’s semantics, then you might want to use something to extract data from 
those files…. But that’s incredibly rare.

So instead, export to something that has just enough formatting to not lose 
information, but that there are lots of parsers for.  I tend to like HTML or 
RTF (rich text format)

Depending on exactly what’s in the file, you might even be able to export to 
just plain text and not lose too much.

From there, I tend to go through cycles of parsing and cleanup.  There are lots 
of parsers for bibliography data out there these days, but it’s amazing at what 
sort of errors you end up with in manually maintained files.  (I’ve even given 
a talk or two about it).

Basically, run the parser, and then look to figure out what it missed.  I tend 
to write stuff that either does in-line replacement (regex type stuff) or 
removes items as it finds them and moves them to a new file.

Both have issues, as sometimes things get parsed wrong and it’s easier to 
restore the original file and clean it up there than in the new format, 
especially if you can find the problematic patterns that your parser is having 
trouble with.

Again, I haven’t done this for years, so I don’t know what new tools are out 
there, but I used to do much of my work in Perl, as it has really good regular 
expression / string manipulation support.  I know there are some PDF libraries 
for Perl, but I’ve luckily been able to get original source content and never 
needed to parse them directly.

Once you extract the data, I would try to set your faculty member up with a 
bibliography management tool so you can hopefully avoid having to do this again.

-Joe

Sent from a mobile device with a crappy on screen keyboard and obnoxious 
"autocorrect"

Re: [CODE4LIB] scraping or extracting structured data from a pdf

Reply via email to