Using pdftohtml and then using regexes or parser on top, seem to be the easiest solution as of now. I came across tabula-java which also seems interesting. Thank you everyone for the recommendations. I've still not got multiple tables in a single page or tables over-flowing across pages working correctly yet. But no native golang FOSS library seem to exist as of now.
2016-06-30 21:34 GMT+05:30 Konstantin Khomoutov <flatw...@users.sourceforge.net>: > On Thu, 30 Jun 2016 11:22:00 -0400 > Shawn Milochik <shawn.m...@gmail.com> wrote: > >> I don't know of a Go solution, but if you are on Linux you could try >> pdftotext and parse the text. With the obvious caveat of "it depends >> on how the PDF was encoded." > > I'm using this approach in one of my applications. > The only problem with pdftotext is that its CLI interface is dumb in > that you can't tell it to read PDF data from stdin and output the > results to stdout at the same time, so you have to resort to using > temporary files. > > The approach is to first manually play with `pdftotext` on the sample > data set -- trying out its "-raw" and "-layout" options to see which > produces the most sensible results and then go with it. > > I should note that I needed to extract few strings, not tabular data, > so the OP might have better results with `pdftohtml` from the same > package, which is able to produce XML output which can be parsed by > means of the encoding/xml package. > >> Worst-case you may be able to use >> tesseract OCR to generate text and then do the same thing. >> >> https://packages.debian.org/sid/poppler-utils >> https://packages.debian.org/wheezy/tesseract-ocr > > -- > You received this message because you are subscribed to a topic in the Google > Groups "golang-nuts" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/golang-nuts/8NisCMXjQIw/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > golang-nuts+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. -- Sankar P http://psankar.blogspot.com -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.