Using pdftohtml and then using regexes or parser on top, seem to be
the easiest solution as of now. I came across tabula-java which also
seems interesting. Thank you everyone for the recommendations. I've
still not got multiple tables in a single page or tables over-flowing
across pages working correctly yet. But no native golang FOSS library
seem to exist as of now.

2016-06-30 21:34 GMT+05:30 Konstantin Khomoutov
<flatw...@users.sourceforge.net>:
> On Thu, 30 Jun 2016 11:22:00 -0400
> Shawn Milochik <shawn.m...@gmail.com> wrote:
>
>>  I don't know of a Go solution, but if you are on Linux you could try
>> pdftotext and parse the text. With the obvious caveat of "it depends
>> on how the PDF was encoded."
>
> I'm using this approach in one of my applications.
> The only problem with pdftotext is that its CLI interface is dumb in
> that you can't tell it to read PDF data from stdin and output the
> results to stdout at the same time, so you have to resort to using
> temporary files.
>
> The approach is to first manually play with `pdftotext` on the sample
> data set -- trying out its "-raw" and "-layout" options to see which
> produces the most sensible results and then go with it.
>
> I should note that I needed to extract few strings, not tabular data,
> so the OP might have better results with `pdftohtml` from the same
> package, which is able to produce XML output which can be parsed by
> means of the encoding/xml package.
>
>> Worst-case you may be able to use
>> tesseract OCR to generate text and then do the same thing.
>>
>> https://packages.debian.org/sid/poppler-utils
>> https://packages.debian.org/wheezy/tesseract-ocr
>
> --
> You received this message because you are subscribed to a topic in the Google 
> Groups "golang-nuts" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/golang-nuts/8NisCMXjQIw/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to 
> golang-nuts+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



-- 
Sankar P
http://psankar.blogspot.com

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to