On Thu, 30 Jun 2016 11:22:00 -0400 Shawn Milochik <shawn.m...@gmail.com> wrote:
> I don't know of a Go solution, but if you are on Linux you could try > pdftotext and parse the text. With the obvious caveat of "it depends > on how the PDF was encoded." I'm using this approach in one of my applications. The only problem with pdftotext is that its CLI interface is dumb in that you can't tell it to read PDF data from stdin and output the results to stdout at the same time, so you have to resort to using temporary files. The approach is to first manually play with `pdftotext` on the sample data set -- trying out its "-raw" and "-layout" options to see which produces the most sensible results and then go with it. I should note that I needed to extract few strings, not tabular data, so the OP might have better results with `pdftohtml` from the same package, which is able to produce XML output which can be parsed by means of the encoding/xml package. > Worst-case you may be able to use > tesseract OCR to generate text and then do the same thing. > > https://packages.debian.org/sid/poppler-utils > https://packages.debian.org/wheezy/tesseract-ocr -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.