RE: Reading & Extracting Data From PDF Files

Paul H. Tarver Mon, 01 May 2017 07:57:28 -0700

Darren,

I've generally been successful in avoiding PDF's and getting the same
information in a more "standardized" format such as text, CSV, Excel, etc.
However, I've always wondered if my preference was based on my unwillingness
to devote the necessary time required to understand PDF formats, or just a
case of taking the path of least resistance. After all, why spend a 100
hours figuring out a PDF version of a report when you can spend 5 hours
building an extract from a text version of the same report.

In the thread related to counting the number of lines in a file, there were
at least two references made to getting the number of lines in a PDF file
which made me think perhaps someone had figured out how to work with a PDF
in the same sort of way we would work with an XML file. Generally when I
work with XML, import all the lines, use the tags as triggers and flags,
then strip the tags and markup leaving all the good stuff behind in tables I
created in the parsing process. It was my hope someone had magically figured
out how to do something similar with PDF's but clearly I'm not the only one
avoiding PDF's whenever possible. 

I think it is enough for me to know that it might be possible given enough
time and enough money to extract data from a PDF file but I'll wait for the
day when I have NO OTHER OPTION on  a project to justify spending the time. 

Thanks! 

Paul H. Tarver
Tarver Program Consultants, Inc.

-----Original Message-----
From: Darren [mailto:[email protected]] 
Sent: Saturday, April 29, 2017 4:37 PM
To: [email protected]
Subject: RE: Reading & Extracting Data From PDF Files

It's all that font data and positional information that cause the grief.  I
have spent a fair bit of time on PDF's with varying degrees of success.
Comes down to how they were created in the first place. I have always found
that I can consistently retrieve data from PDFs created by any given source
but the rules applied to that PDF do not transport well to other PDF's.

There are a lot of PDFTOTEXT type tools out there. Those that I have
explored mostly do a "pretty good job" of extracting text but none is
perfect and all require varying degrees of manipulation post conversion to
extract the data.

None have proven consistent in extracting the text and retaining layout
(insofar as layout can be retained between different fonts).

All that said I still think it should be possible to take the positional
data and font metrics and resolve that such that text is extracted and
layout retained. Not a simple exercise but for an entity focused on the task
achievable (I would have thought).

In the end though - if you can avoid them - then that is best approach.

-----Original Message-----
From: ProfoxTech [mailto:[email protected]] On Behalf Of Ted
Roche
Sent: Sunday, 30 April 2017 6:39 AM
To: [email protected]
Subject: Re: Reading & Extracting Data From PDF Files

In my limited experience, PDF is trouble. PDF is essential a printer output
file, in PostScript, encapsulated as document. There's lots of info about
fonts, geometry and there are letters placed in specific places, not
necessarily in the order you think, depending on the application that
generated the print image, and the printer drivers used. It doesn't help
that there are lots of different PDF standards ("I love standards, that's
why I have so many!") and extensions to do things like provide
accessiblility or to slim it down for web and visual presentation (vs.
High-resolution for print pre-press).

Recently, I was trying to copy one of my articles out of a PDF, and I found
that the two snaked columns that appeared to you and me meant nothing to the
PDF. Highlighting the text got line one of column one, then line one of
column two, all the way down. Pretty frustrating.

There are some smart applications out there. Monarch was advertised years
ago

Here's an SO question, with some possiblities:

On Fri, Apr 28, 2017 at 11:19 AM, Paul H. Tarver <[email protected]> wrote:
> Original Thread: Getting count of rows in a text file -- best approach?
>
> A couple of times I've heard people mention reading in PDF files using 
> FileToStr and I want to know more about reading and extracting data 
> from PDF files. I do a lot of data conversion and interface work with 
> lots of file formats, but I've not been very successful at importing 
> and extracting data from PDF reports. Obviously a scanned image saved 
> as a PDF would have to be ocr'd first, but is there is a reliable way 
> to extract data from PDF reports and if so, how? I'm sure I don't know 
> all the ends and outs of the PDF format, but when I try, I seem to get 
> a strange mix of formatting details and data combined in a random way.
>
> Am I being thick here or is there really a way that I can get any PDF 
> file from any client and then successfully extract the data elements 
> from that format?
>
> I'm prepared to be thought of as stupid but be gentle! :)
>
> Paul H. Tarver
> Tarver Program Consultants, Inc.
> Email: [email protected]
>
>
>
> -----Original Message-----
> From: Brant E. Layton [mailto:[email protected]]
> Sent: Wednesday, April 26, 2017 3:17 PM
> To: [email protected]
> Subject: RE: Getting count of rows in a text file -- best approach?
>
> |My experience was moving PDF files in and out of SQLServer tables - 
> |found an
> abrupt truncation at the 16,777,184 mark...
>
> Brant Layton|
> |480.964.1316|
> On 4/26/2017 12:57 PM, [email protected] wrote:
>> RE: Getting count of rows in a text file -- best approach?
>
>
>
> --- StripMime Report -- processed MIME parts --- multipart/alternative
>   text/plain (text body -- kept)
>   text/html
> ---
>
[excessive quoting removed by server]

_______________________________________________
Post Messages to: [email protected]
Subscription Maintenance: http://mail.leafe.com/mailman/listinfo/profox
OT-free version of this list: http://mail.leafe.com/mailman/listinfo/profoxtech
Searchable Archive: http://leafe.com/archives/search/profox
This message: 
http://leafe.com/archives/byMID/profox/[email protected]
** All postings, unless explicitly stated otherwise, are the opinions of the 
author, and do not constitute legal or medical advice. This statement is added 
to the messages for those lawyers who are too stupid to see the obvious.

RE: Reading & Extracting Data From PDF Files

Reply via email to