Re: [go-nuts] PDF to text

Sharon Mafgaoker Thu, 23 Jan 2025 10:56:55 -0800

Hey,

I’m using
https://cloud.google.com/document-ai


I’m sending my pdf and getting back extracted text json object.

Work fast and not expensive 🙏

I hope this will help you .

Sharon Mafgaoker – Senior Solutions Architect

M. 050 995 99 16 | sha...@cloud5.co.il




On Thu, 23 Jan 2025 at 19:56 robert engels <reng...@ix.netcom.com> wrote:

> You typically can’t convert a PDF to text and do what you are trying to do.
>
> Look for PDF to XML converters - you need the “blocks” and the hierarchy
> in order to interpret most PDFs with any sort of complex formatting.
>
> But even with XML, tables may not work, because there is no guarantee that
> the PDF authoring tool provided the table metadata, which is why most
> really good PDF -> XML converters use OCR and try and find the tables that
> way. There are several AI/ML based automated OCR tools that do a pretty
> good job.
>
> Often though, a user/system creates a “parsing template” for the various
> documents it wants to parse (i.e. forms) and adds the additional metadata
> (e.g. identifies fields, and tables) for how it should be interpreted.
>
>
> On Jan 23, 2025, at 11:17 AM, Michael Bright <mjbrigh...@gmail.com> wrote:
>
>
> Hi Mike,
>
> Not wanting to suggest that you take the Python route, but just sharing my
> experience.
>
> I've tried Acrobat Reader's "Save as Text" functionality, and also one or
> two Python libraries to extract text from PDFs (PyPDF2 is the one I've
> settled on).
>
> But what I learnt - without really digging into the issue - is that PDF is
> a pretty weird format where text from the same sentence/paragraph "floats
> around" as separate objects.
> Bottom line is -  no matter what tool you use - you may find it really
> tricky to get polished text from what seems like a simple PDF.
>
> That said ... please please prove me wrong !
> If anyone has a good pdf extraction tool in any easy to use form I'm
> interested.
>
> My own use case is to extract text from some partner training materials
> which I regularly deliver so that I can do a diff to see what changed
> between releases (obviously if they actually summarized that would be
> ideal, but they point me to pdfdiff ... yuk).
>
> I have some scripting that - just about - works but it's an absolute pain
> having
> - sentence/paragraphs broken up into multiple lines (and not the same
> across releases)
> - embedded code (in boxes) is indistinguishable from other text.
>
> My 2cts.py,
> another Mike.
>
>
> On Thursday, January 23, 2025 at 1:30:33 PM UTC+1 Hugh Myrie wrote:
>
> Hi Mike,
>
> Thanks for the suggestion! I'm interested in checking out your forked
> code. It seems like a good alternative to what I'm currently using.
>
> Hugh
>
> On Wed, Jan 22, 2025, 10:25 PM Mike Schinkel <mi...@newclarity.net> wrote:
>
> Hi Hugh,
>
> I have been planning to do some Go work with PDF files, so your email
> triggered me to do some research.
>
> Not sure it using heussd/pdftotext-go is critical to you, or if you are
> just trying to read text in a PDF?  I tried to get pdf2text installed but
> my dev laptop is still running macOS Monterey and I couldn't get it working
> so I looked for other options.
>
> If you are just interested in reading PDF text and do not have a specific
> need to use pdf2text then one those others I looked at might work. I came
> across a package originally developed by Russ Cox that was forked by many
> others, and to evaluate it I forked one of those and then converted it from
> using a reader to returning a slice of strings so I could easily split out
> the new lines. (I could probably have make it work with the reader, but I
> was just going for quick.)
>
> If you think it can help your use-case, please check it out (but be aware,
> my additions to the forked code are rather hacky):
>
> https://github.com/mikeschinkel/go-pdf-content-reader
>
> -Mike
>
> On Jan 22, 2025, at 11:08 AM, Hugh Myrie <hugh....@gmail.com> wrote:
>
> I want to extract text from a PDF and preserve any table or at least
> convert it to a CSV. I am using the PDFtoText package (which uses the
> Poppler software). The text is extracted vertically (i.e. one column at a
> time) and each text is separated by a space. There is no line break making
> it difficult to manipulate. I want to extract the text horizontally to
> preserve and possible add line breaks to allow for further manipulation.
>
> Your help in this matter is appreciated. Suggest alternatives if available.
>
> Here is the Go code:
>
> package main
>
> import (
>     "fmt"
>     "log"
>     "os"
>
>     pdftotext "github.com/heussd/pdftotext-go"
> )
>
> func main() {
>     // Replace "test.pdf" with the path to your PDF file
>     pdfPath := "test.pdf"
>     // Open the PDF file
>     f, err := os.Open(pdfPath)
>     if err != nil {
>         log.Fatalf("Failed to open PDF file: %v", err)
>     }
>     defer f.Close()
>     // Read the file content
>     content, err := os.ReadFile(pdfPath)
>     if err != nil {
>         log.Fatalf("Failed to read PDF file: %v", err)
>     }
>     // Extract text from the PDF file
>     text, err := pdftotext.Extract(content)
>     if err != nil {
>         log.Fatalf("Failed to extract text from PDF file: %v", err)
>     }
>     // Print the extracted text
>     fmt.Println(text)
> }
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts...@googlegroups.com.
> To view this discussion visit
> https://groups.google.com/d/msgid/golang-nuts/c19e212d-a81f-4525-ae0d-a9abb0b292fbn%40googlegroups.com
> <https://groups.google.com/d/msgid/golang-nuts/c19e212d-a81f-4525-ae0d-a9abb0b292fbn%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts+unsubscr...@googlegroups.com.
> To view this discussion visit
> https://groups.google.com/d/msgid/golang-nuts/70c80f52-96e4-4e3d-94af-25015515abb5n%40googlegroups.com
> <https://groups.google.com/d/msgid/golang-nuts/70c80f52-96e4-4e3d-94af-25015515abb5n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts+unsubscr...@googlegroups.com.
> To view this discussion visit
> https://groups.google.com/d/msgid/golang-nuts/662229D3-F9D0-434D-BB11-CAB36C8D1BB5%40ix.netcom.com
> <https://groups.google.com/d/msgid/golang-nuts/662229D3-F9D0-434D-BB11-CAB36C8D1BB5%40ix.netcom.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to golang-nuts+unsubscr...@googlegroups.com.
To view this discussion visit 
https://groups.google.com/d/msgid/golang-nuts/CA%2BKEDerYuXpKj2YrfGR1O_takr69uHHp9aOqNu5LL%3D3%3DsPJWAA%40mail.gmail.com.

Re: [go-nuts] PDF to text

Reply via email to