Re: [CODE4LIB] scraping or extracting structured data from a pdf

Miles Fidelman Thu, 12 May 2022 11:56:44 -0700

Danielle Reay wrote:

Hello,


We have a faculty member looking to create a dataset from an annotated
bibliography she compiled. Right now it exists as a word file and as a pdf.
The entries are relatively structured with a citation and an abstract, but
the document is about 150 pages long with multiple entries per page. Rather
than manually copy and paste everything to create the spreadsheet/csv, I
wanted to ask for suggestions or approaches to doing this by either
scraping or extracting structured data from the pdf. Thanks very much in
advance!

I'd sure like to find a tool for this as well. Though, in my case, thepurpose would be to extract numbered requirements from RFPs.

There seems to be a distinct dearth of text analysis tools that canactually do structural analysis, based on numbering.


Miles Fidelman

--
In theory, there is no difference between theory and practice.
In practice, there is.  .... Yogi Berra

Theory is when you know everything but nothing works.
Practice is when everything works but no one knows why.
In our lab, theory and practice are combined:
nothing works and no one knows why.  ... unknown

Re: [CODE4LIB] scraping or extracting structured data from a pdf

Reply via email to