Re: [go-nuts] PDF to text

robert engels Thu, 23 Jan 2025 09:56:24 -0800

You typically can’t convert a PDF to text and do what you are trying to do.


Look for PDF to XML converters - you need the “blocks” and the hierarchy in 
order to interpret most PDFs with any sort of complex formatting.

But even with XML, tables may not work, because there is no guarantee that the 
PDF authoring tool provided the table metadata, which is why most really good 
PDF -> XML converters use OCR and try and find the tables that way. There are 
several AI/ML based automated OCR tools that do a pretty good job.

Often though, a user/system creates a “parsing template” for the various 
documents it wants to parse (i.e. forms) and adds the additional metadata (e.g. 
identifies fields, and tables) for how it should be interpreted.


> On Jan 23, 2025, at 11:17 AM, Michael Bright <mjbrigh...@gmail.com> wrote:
> 
> 
> Hi Mike,
> 
> Not wanting to suggest that you take the Python route, but just sharing my 
> experience.
> 
> I've tried Acrobat Reader's "Save as Text" functionality, and also one or two 
> Python libraries to extract text from PDFs (PyPDF2 is the one I've settled 
> on).
> 
> But what I learnt - without really digging into the issue - is that PDF is a 
> pretty weird format where text from the same sentence/paragraph "floats 
> around" as separate objects.
> Bottom line is -  no matter what tool you use - you may find it really tricky 
> to get polished text from what seems like a simple PDF.
> 
> That said ... please please prove me wrong !
> If anyone has a good pdf extraction tool in any easy to use form I'm 
> interested.
> 
> My own use case is to extract text from some partner training materials which 
> I regularly deliver so that I can do a diff to see what changed between 
> releases (obviously if they actually summarized that would be ideal, but they 
> point me to pdfdiff ... yuk).
> 
> I have some scripting that - just about - works but it's an absolute pain 
> having
> - sentence/paragraphs broken up into multiple lines (and not the same across 
> releases)
> - embedded code (in boxes) is indistinguishable from other text.
> 
> My 2cts.py,
> another Mike.
> 
> 
> On Thursday, January 23, 2025 at 1:30:33 PM UTC+1 Hugh Myrie wrote:
> Hi Mike,
> 
> Thanks for the suggestion! I'm interested in checking out your forked code. 
> It seems like a good alternative to what I'm currently using.
> 
> Hugh
> 
> On Wed, Jan 22, 2025, 10:25 PM Mike Schinkel <mi...@newclarity.net <>> wrote:
> Hi Hugh,
> 
> I have been planning to do some Go work with PDF files, so your email 
> triggered me to do some research.
> 
> Not sure it using heussd/pdftotext-go is critical to you, or if you are just 
> trying to read text in a PDF?  I tried to get pdf2text installed but my dev 
> laptop is still running macOS Monterey and I couldn't get it working so I 
> looked for other options.
> 
> If you are just interested in reading PDF text and do not have a specific 
> need to use pdf2text then one those others I looked at might work. I came 
> across a package originally developed by Russ Cox that was forked by many 
> others, and to evaluate it I forked one of those and then converted it from 
> using a reader to returning a slice of strings so I could easily split out 
> the new lines. (I could probably have make it work with the reader, but I was 
> just going for quick.)
> 
> If you think it can help your use-case, please check it out (but be aware, my 
> additions to the forked code are rather hacky):
> 
> https://github.com/mikeschinkel/go-pdf-content-reader
> 
> -Mike
> 
>> On Jan 22, 2025, at 11:08 AM, Hugh Myrie <hugh....@gmail.com <>> wrote:
>> 
>> I want to extract text from a PDF and preserve any table or at least convert 
>> it to a CSV. I am using the PDFtoText package (which uses the Poppler 
>> software). The text is extracted vertically (i.e. one column at a time) and 
>> each text is separated by a space. There is no line break making it 
>> difficult to manipulate. I want to extract the text horizontally to preserve 
>> and possible add line breaks to allow for further manipulation.
>> 
>> Your help in this matter is appreciated. Suggest alternatives if available.
>> 
>> Here is the Go code:
>> 
>> package main
>> 
>> import (
>>     "fmt"
>>     "log"
>>     "os"
>> 
>>     pdftotext "github.com/heussd/pdftotext-go 
>> <http://github.com/heussd/pdftotext-go>"
>> )
>> 
>> func main() {
>>     // Replace "test.pdf" with the path to your PDF file
>>     pdfPath := "test.pdf"
>>     // Open the PDF file
>>     f, err := os.Open(pdfPath)
>>     if err != nil {
>>         log.Fatalf("Failed to open PDF file: %v", err)
>>     }
>>     defer f.Close()
>>     // Read the file content
>>     content, err := os.ReadFile(pdfPath)
>>     if err != nil {
>>         log.Fatalf("Failed to read PDF file: %v", err)
>>     }
>>     // Extract text from the PDF file
>>     text, err := pdftotext.Extract(content)
>>     if err != nil {
>>         log.Fatalf("Failed to extract text from PDF file: %v", err)
>>     }
>>     // Print the extracted text
>>     fmt.Println(text)
>> }
>> 
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "golang-nuts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to golang-nuts...@googlegroups.com <>.
>> To view this discussion visit 
>> https://groups.google.com/d/msgid/golang-nuts/c19e212d-a81f-4525-ae0d-a9abb0b292fbn%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/golang-nuts/c19e212d-a81f-4525-ae0d-a9abb0b292fbn%40googlegroups.com?utm_medium=email&utm_source=footer>.
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to golang-nuts+unsubscr...@googlegroups.com 
> <mailto:golang-nuts+unsubscr...@googlegroups.com>.
> To view this discussion visit 
> https://groups.google.com/d/msgid/golang-nuts/70c80f52-96e4-4e3d-94af-25015515abb5n%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/golang-nuts/70c80f52-96e4-4e3d-94af-25015515abb5n%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/golang-nuts/662229D3-F9D0-434D-BB11-CAB36C8D1BB5%40ix.netcom.com.

Re: [go-nuts] PDF to text

Reply via email to