Hey, I’m using https://cloud.google.com/document-ai
I’m sending my pdf and getting back extracted text json object. Work fast and not expensive 🙏 I hope this will help you . Sharon Mafgaoker – Senior Solutions Architect M. 050 995 99 16 | sha...@cloud5.co.il On Thu, 23 Jan 2025 at 19:56 robert engels <reng...@ix.netcom.com> wrote: > You typically can’t convert a PDF to text and do what you are trying to do. > > Look for PDF to XML converters - you need the “blocks” and the hierarchy > in order to interpret most PDFs with any sort of complex formatting. > > But even with XML, tables may not work, because there is no guarantee that > the PDF authoring tool provided the table metadata, which is why most > really good PDF -> XML converters use OCR and try and find the tables that > way. There are several AI/ML based automated OCR tools that do a pretty > good job. > > Often though, a user/system creates a “parsing template” for the various > documents it wants to parse (i.e. forms) and adds the additional metadata > (e.g. identifies fields, and tables) for how it should be interpreted. > > > On Jan 23, 2025, at 11:17 AM, Michael Bright <mjbrigh...@gmail.com> wrote: > > > Hi Mike, > > Not wanting to suggest that you take the Python route, but just sharing my > experience. > > I've tried Acrobat Reader's "Save as Text" functionality, and also one or > two Python libraries to extract text from PDFs (PyPDF2 is the one I've > settled on). > > But what I learnt - without really digging into the issue - is that PDF is > a pretty weird format where text from the same sentence/paragraph "floats > around" as separate objects. > Bottom line is - no matter what tool you use - you may find it really > tricky to get polished text from what seems like a simple PDF. > > That said ... please please prove me wrong ! > If anyone has a good pdf extraction tool in any easy to use form I'm > interested. > > My own use case is to extract text from some partner training materials > which I regularly deliver so that I can do a diff to see what changed > between releases (obviously if they actually summarized that would be > ideal, but they point me to pdfdiff ... yuk). > > I have some scripting that - just about - works but it's an absolute pain > having > - sentence/paragraphs broken up into multiple lines (and not the same > across releases) > - embedded code (in boxes) is indistinguishable from other text. > > My 2cts.py, > another Mike. > > > On Thursday, January 23, 2025 at 1:30:33 PM UTC+1 Hugh Myrie wrote: > > Hi Mike, > > Thanks for the suggestion! I'm interested in checking out your forked > code. It seems like a good alternative to what I'm currently using. > > Hugh > > On Wed, Jan 22, 2025, 10:25 PM Mike Schinkel <mi...@newclarity.net> wrote: > > Hi Hugh, > > I have been planning to do some Go work with PDF files, so your email > triggered me to do some research. > > Not sure it using heussd/pdftotext-go is critical to you, or if you are > just trying to read text in a PDF? I tried to get pdf2text installed but > my dev laptop is still running macOS Monterey and I couldn't get it working > so I looked for other options. > > If you are just interested in reading PDF text and do not have a specific > need to use pdf2text then one those others I looked at might work. I came > across a package originally developed by Russ Cox that was forked by many > others, and to evaluate it I forked one of those and then converted it from > using a reader to returning a slice of strings so I could easily split out > the new lines. (I could probably have make it work with the reader, but I > was just going for quick.) > > If you think it can help your use-case, please check it out (but be aware, > my additions to the forked code are rather hacky): > > https://github.com/mikeschinkel/go-pdf-content-reader > > -Mike > > On Jan 22, 2025, at 11:08 AM, Hugh Myrie <hugh....@gmail.com> wrote: > > I want to extract text from a PDF and preserve any table or at least > convert it to a CSV. I am using the PDFtoText package (which uses the > Poppler software). The text is extracted vertically (i.e. one column at a > time) and each text is separated by a space. There is no line break making > it difficult to manipulate. I want to extract the text horizontally to > preserve and possible add line breaks to allow for further manipulation. > > Your help in this matter is appreciated. Suggest alternatives if available. > > Here is the Go code: > > package main > > import ( > "fmt" > "log" > "os" > > pdftotext "github.com/heussd/pdftotext-go" > ) > > func main() { > // Replace "test.pdf" with the path to your PDF file > pdfPath := "test.pdf" > // Open the PDF file > f, err := os.Open(pdfPath) > if err != nil { > log.Fatalf("Failed to open PDF file: %v", err) > } > defer f.Close() > // Read the file content > content, err := os.ReadFile(pdfPath) > if err != nil { > log.Fatalf("Failed to read PDF file: %v", err) > } > // Extract text from the PDF file > text, err := pdftotext.Extract(content) > if err != nil { > log.Fatalf("Failed to extract text from PDF file: %v", err) > } > // Print the extracted text > fmt.Println(text) > } > > > -- > You received this message because you are subscribed to the Google Groups > "golang-nuts" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-nuts...@googlegroups.com. > To view this discussion visit > https://groups.google.com/d/msgid/golang-nuts/c19e212d-a81f-4525-ae0d-a9abb0b292fbn%40googlegroups.com > <https://groups.google.com/d/msgid/golang-nuts/c19e212d-a81f-4525-ae0d-a9abb0b292fbn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > > > -- > You received this message because you are subscribed to the Google Groups > "golang-nuts" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-nuts+unsubscr...@googlegroups.com. > To view this discussion visit > https://groups.google.com/d/msgid/golang-nuts/70c80f52-96e4-4e3d-94af-25015515abb5n%40googlegroups.com > <https://groups.google.com/d/msgid/golang-nuts/70c80f52-96e4-4e3d-94af-25015515abb5n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > > -- > You received this message because you are subscribed to the Google Groups > "golang-nuts" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to golang-nuts+unsubscr...@googlegroups.com. > To view this discussion visit > https://groups.google.com/d/msgid/golang-nuts/662229D3-F9D0-434D-BB11-CAB36C8D1BB5%40ix.netcom.com > <https://groups.google.com/d/msgid/golang-nuts/662229D3-F9D0-434D-BB11-CAB36C8D1BB5%40ix.netcom.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "golang-nuts" group. To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts+unsubscr...@googlegroups.com. To view this discussion visit https://groups.google.com/d/msgid/golang-nuts/CA%2BKEDerYuXpKj2YrfGR1O_takr69uHHp9aOqNu5LL%3D3%3DsPJWAA%40mail.gmail.com.