Re: [go-nuts] PDF to text

Robert Solomon Sat, 25 Jan 2025 08:42:38 -0800

Adobe's Acrobat can extract to docx and xlsx.  Not a cheap option but it 
does work



On Thursday, January 23, 2025 at 7:29:13 PM UTC-5 Hugh Myrie wrote:

> Hi Michael,
>
> You're absolutely right, PDF extraction can be a real headache! 
> I've tried Mike's suggestion, but unfortunately, it didn't quite work as 
> I'd hoped – it put each character on a separate line, which made it just as 
> difficult to work with. 
>
> I think I'll give OCR a shot and see if that yields better results. If 
> that doesn't pan out, I might explore some Python libraries, as you 
> suggested. 
>
> Thanks again for your input, it's much appreciated!
>
> Best regards,
> Hugh 
>
>
> On Thu, Jan 23, 2025, 12:49 PM Michael Bright <mjbri...@gmail.com> wrote:
>
>>
>> Hi Mike,
>>
>> Not wanting to suggest that you take the Python route, but just sharing 
>> my experience.
>>
>> I've tried Acrobat Reader's "Save as Text" functionality, and also one or 
>> two Python libraries to extract text from PDFs (PyPDF2 is the one I've 
>> settled on).
>>
>> But what I learnt - without really digging into the issue - is that PDF 
>> is a pretty weird format where text from the same sentence/paragraph 
>> "floats around" as separate objects.
>> Bottom line is -  no matter what tool you use - you may find it really 
>> tricky to get polished text from what seems like a simple PDF.
>>
>> That said ... please please prove me wrong !
>> If anyone has a good pdf extraction tool in any easy to use form I'm 
>> interested.
>>
>> My own use case is to extract text from some partner training materials 
>> which I regularly deliver so that I can do a diff to see what changed 
>> between releases (obviously if they actually summarized that would be 
>> ideal, but they point me to pdfdiff ... yuk).
>>
>> I have some scripting that - just about - works but it's an absolute pain 
>> having
>> - sentence/paragraphs broken up into multiple lines (and not the same 
>> across releases)
>> - embedded code (in boxes) is indistinguishable from other text.
>>
>> My 2cts.py,
>> another Mike.
>>
>>
>> On Thursday, January 23, 2025 at 1:30:33 PM UTC+1 Hugh Myrie wrote:
>>
>> Hi Mike,
>>
>> Thanks for the suggestion! I'm interested in checking out your forked 
>> code. It seems like a good alternative to what I'm currently using.
>>
>> Hugh
>>
>> On Wed, Jan 22, 2025, 10:25 PM Mike Schinkel <mi...@newclarity.net> 
>> wrote:
>>
>> Hi Hugh,
>>
>> I have been planning to do some Go work with PDF files, so your email 
>> triggered me to do some research.
>>
>> Not sure it using heussd/pdftotext-go is critical to you, or if you are 
>> just trying to read text in a PDF?  I tried to get pdf2text installed but 
>> my dev laptop is still running macOS Monterey and I couldn't get it working 
>> so I looked for other options.
>>
>> If you are just interested in reading PDF text and do not have a specific 
>> need to use pdf2text then one those others I looked at might work. I came 
>> across a package originally developed by Russ Cox that was forked by many 
>> others, and to evaluate it I forked one of those and then converted it from 
>> using a reader to returning a slice of strings so I could easily split out 
>> the new lines. (I could probably have make it work with the reader, but I 
>> was just going for quick.)
>>
>> If you think it can help your use-case, please check it out (but be 
>> aware, my additions to the forked code are rather hacky):
>>
>> https://github.com/mikeschinkel/go-pdf-content-reader
>>
>> -Mike
>>
>> On Jan 22, 2025, at 11:08 AM, Hugh Myrie <hugh....@gmail.com> wrote:
>>
>> I want to extract text from a PDF and preserve any table or at least 
>> convert it to a CSV. I am using the PDFtoText package (which uses the 
>> Poppler software). The text is extracted vertically (i.e. one column at a 
>> time) and each text is separated by a space. There is no line break making 
>> it difficult to manipulate. I want to extract the text horizontally to 
>> preserve and possible add line breaks to allow for further manipulation.
>>
>> Your help in this matter is appreciated. Suggest alternatives if 
>> available.
>>
>> Here is the Go code:
>>
>> package main
>>
>> import (
>>     "fmt"
>>     "log"
>>     "os"
>>
>>     pdftotext "github.com/heussd/pdftotext-go"
>> )
>>
>> func main() {
>>     // Replace "test.pdf" with the path to your PDF file
>>     pdfPath := "test.pdf"
>>     // Open the PDF file
>>     f, err := os.Open(pdfPath)
>>     if err != nil {
>>         log.Fatalf("Failed to open PDF file: %v", err)
>>     }
>>     defer f.Close()
>>     // Read the file content
>>     content, err := os.ReadFile(pdfPath)
>>     if err != nil {
>>         log.Fatalf("Failed to read PDF file: %v", err)
>>     }
>>     // Extract text from the PDF file
>>     text, err := pdftotext.Extract(content)
>>     if err != nil {
>>         log.Fatalf("Failed to extract text from PDF file: %v", err)
>>     }
>>     // Print the extracted text
>>     fmt.Println(text)
>> }
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "golang-nuts" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to golang-nuts...@googlegroups.com.
>> To view this discussion visit 
>> https://groups.google.com/d/msgid/golang-nuts/c19e212d-a81f-4525-ae0d-a9abb0b292fbn%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/golang-nuts/c19e212d-a81f-4525-ae0d-a9abb0b292fbn%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>>
>> -- 
>>
> You received this message because you are subscribed to a topic in the 
>> Google Groups "golang-nuts" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/golang-nuts/f7aJwHTcZwQ/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to 
>> golang-nuts...@googlegroups.com.
>> To view this discussion visit 
>> https://groups.google.com/d/msgid/golang-nuts/70c80f52-96e4-4e3d-94af-25015515abb5n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/golang-nuts/70c80f52-96e4-4e3d-94af-25015515abb5n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/golang-nuts/1af5c204-808b-4b3f-9988-cf412dddd523n%40googlegroups.com.

Re: [go-nuts] PDF to text

Reply via email to