Re: [go-nuts] Extracting table data out of PDFs

Konstantin Khomoutov Thu, 30 Jun 2016 09:05:26 -0700

On Thu, 30 Jun 2016 11:22:00 -0400
Shawn Milochik <shawn.m...@gmail.com> wrote:


>  I don't know of a Go solution, but if you are on Linux you could try
> pdftotext and parse the text. With the obvious caveat of "it depends
> on how the PDF was encoded."

I'm using this approach in one of my applications.
The only problem with pdftotext is that its CLI interface is dumb in
that you can't tell it to read PDF data from stdin and output the
results to stdout at the same time, so you have to resort to using
temporary files.

The approach is to first manually play with `pdftotext` on the sample
data set -- trying out its "-raw" and "-layout" options to see which
produces the most sensible results and then go with it.

I should note that I needed to extract few strings, not tabular data,
so the OP might have better results with `pdftohtml` from the same
package, which is able to produce XML output which can be parsed by
means of the encoding/xml package.

> Worst-case you may be able to use
> tesseract OCR to generate text and then do the same thing.
> 
> https://packages.debian.org/sid/poppler-utils
> https://packages.debian.org/wheezy/tesseract-ocr

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [go-nuts] Extracting table data out of PDFs

Reply via email to