Adobe's Acrobat can extract to docx and xlsx. Not a cheap option but it does work
On Thursday, January 23, 2025 at 7:29:13 PM UTC-5 Hugh Myrie wrote: > Hi Michael, > > You're absolutely right, PDF extraction can be a real headache! > I've tried Mike's suggestion, but unfortunately, it didn't quite work as > I'd hoped – it put each character on a separate line, which made it just as > difficult to work with. > > I think I'll give OCR a shot and see if that yields better results. If > that doesn't pan out, I might explore some Python libraries, as you > suggested. > > Thanks again for your input, it's much appreciated! > > Best regards, > Hugh > > > On Thu, Jan 23, 2025, 12:49 PM Michael Bright <mjbri...@gmail.com> wrote: > >> >> Hi Mike, >> >> Not wanting to suggest that you take the Python route, but just sharing >> my experience. >> >> I've tried Acrobat Reader's "Save as Text" functionality, and also one or >> two Python libraries to extract text from PDFs (PyPDF2 is the one I've >> settled on). >> >> But what I learnt - without really digging into the issue - is that PDF >> is a pretty weird format where text from the same sentence/paragraph >> "floats around" as separate objects. >> Bottom line is - no matter what tool you use - you may find it really >> tricky to get polished text from what seems like a simple PDF. >> >> That said ... please please prove me wrong ! >> If anyone has a good pdf extraction tool in any easy to use form I'm >> interested. >> >> My own use case is to extract text from some partner training materials >> which I regularly deliver so that I can do a diff to see what changed >> between releases (obviously if they actually summarized that would be >> ideal, but they point me to pdfdiff ... yuk). >> >> I have some scripting that - just about - works but it's an absolute pain >> having >> - sentence/paragraphs broken up into multiple lines (and not the same >> across releases) >> - embedded code (in boxes) is indistinguishable from other text. >> >> My 2cts.py, >> another Mike. >> >> >> On Thursday, January 23, 2025 at 1:30:33 PM UTC+1 Hugh Myrie wrote: >> >> Hi Mike, >> >> Thanks for the suggestion! I'm interested in checking out your forked >> code. It seems like a good alternative to what I'm currently using. >> >> Hugh >> >> On Wed, Jan 22, 2025, 10:25 PM Mike Schinkel <mi...@newclarity.net> >> wrote: >> >> Hi Hugh, >> >> I have been planning to do some Go work with PDF files, so your email >> triggered me to do some research. >> >> Not sure it using heussd/pdftotext-go is critical to you, or if you are >> just trying to read text in a PDF? I tried to get pdf2text installed but >> my dev laptop is still running macOS Monterey and I couldn't get it working >> so I looked for other options. >> >> If you are just interested in reading PDF text and do not have a specific >> need to use pdf2text then one those others I looked at might work. I came >> across a package originally developed by Russ Cox that was forked by many >> others, and to evaluate it I forked one of those and then converted it from >> using a reader to returning a slice of strings so I could easily split out >> the new lines. (I could probably have make it work with the reader, but I >> was just going for quick.) >> >> If you think it can help your use-case, please check it out (but be >> aware, my additions to the forked code are rather hacky): >> >> https://github.com/mikeschinkel/go-pdf-content-reader >> >> -Mike >> >> On Jan 22, 2025, at 11:08 AM, Hugh Myrie <hugh....@gmail.com> wrote: >> >> I want to extract text from a PDF and preserve any table or at least >> convert it to a CSV. I am using the PDFtoText package (which uses the >> Poppler software). The text is extracted vertically (i.e. one column at a >> time) and each text is separated by a space. There is no line break making >> it difficult to manipulate. I want to extract the text horizontally to >> preserve and possible add line breaks to allow for further manipulation. >> >> Your help in this matter is appreciated. Suggest alternatives if >> available. >> >> Here is the Go code: >> >> package main >> >> import ( >> "fmt" >> "log" >> "os" >> >> pdftotext "github.com/heussd/pdftotext-go" >> ) >> >> func main() { >> // Replace "test.pdf" with the path to your PDF file >> pdfPath := "test.pdf" >> // Open the PDF file >> f, err := os.Open(pdfPath) >> if err != nil { >> log.Fatalf("Failed to open PDF file: %v", err) >> } >> defer f.Close() >> // Read the file content >> content, err := os.ReadFile(pdfPath) >> if err != nil { >> log.Fatalf("Failed to read PDF file: %v", err) >> } >> // Extract text from the PDF file >> text, err := pdftotext.Extract(content) >> if err != nil { >> log.Fatalf("Failed to extract text from PDF file: %v", err) >> } >> // Print the extracted text >> fmt.Println(text) >> } >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "golang-nuts" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to golang-nuts...@googlegroups.com. >> To view this discussion visit >> https://groups.google.com/d/msgid/golang-nuts/c19e212d-a81f-4525-ae0d-a9abb0b292fbn%40googlegroups.com >> >> <https://groups.google.com/d/msgid/golang-nuts/c19e212d-a81f-4525-ae0d-a9abb0b292fbn%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> >> -- >> > You received this message because you are subscribed to a topic in the >> Google Groups "golang-nuts" group. >> To unsubscribe from this topic, visit >> https://groups.google.com/d/topic/golang-nuts/f7aJwHTcZwQ/unsubscribe. >> To unsubscribe from this group and all its topics, send an email to >> golang-nuts...@googlegroups.com. >> To view this discussion visit >> https://groups.google.com/d/msgid/golang-nuts/70c80f52-96e4-4e3d-94af-25015515abb5n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/golang-nuts/70c80f52-96e4-4e3d-94af-25015515abb5n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/1af5c204-808b-4b3f-9988-cf412dddd523n%40googlegroups.com.