Re: [go-nuts] PDF to text

2025-01-25 Thread Robert Solomon
Adobe's Acrobat can extract to docx and xlsx. Not a cheap option but it does work On Thursday, January 23, 2025 at 7:29:13 PM UTC-5 Hugh Myrie wrote: > Hi Michael, > > You're absolutely right, PDF extraction can be a real headache! > I've tried Mike's suggestion, but unfortunately, it didn't

Re: [go-nuts] PDF to text

2025-01-23 Thread Hugh Myrie
Hi Michael, You're absolutely right, PDF extraction can be a real headache! I've tried Mike's suggestion, but unfortunately, it didn't quite work as I'd hoped – it put each character on a separate line, which made it just as difficult to work with. I think I'll give OCR a shot and see if that yie

Re: [go-nuts] PDF to text

2025-01-23 Thread Robert Engels
Glyph boundaries maintains the positional information but you still need to effectively treat it as an image - it’s just very course. Which leads to the OCR/vision AI model. If the pdf author is intentionally hindering the ability to “grab the data” then there is no text at all - and it is an image

Re: [go-nuts] PDF to text

2025-01-23 Thread Duncan Harris
Amusingly we wrote our PDF table extractor largely in Go: https://pdftables.com/ It identifies tables and cells by looking at the statistical distribution of glyph boundaries on the pages rather than inferring anything from the way the text is logically grouped within the PDF. There are many ap

Re: [go-nuts] PDF to text

2025-01-23 Thread Sharon Mafgaoker
Hey, I’m using https://cloud.google.com/document-ai I’m sending my pdf and getting back extracted text json object. Work fast and not expensive 🙏 I hope this will help you . Sharon Mafgaoker – Senior Solutions Architect M. 050 995 99 16 | sha...@cloud5.co.il On Thu, 23 Jan 2025 at 19:56 r

Re: [go-nuts] PDF to text

2025-01-23 Thread robert engels
You typically can’t convert a PDF to text and do what you are trying to do. Look for PDF to XML converters - you need the “blocks” and the hierarchy in order to interpret most PDFs with any sort of complex formatting. But even with XML, tables may not work, because there is no guarantee that the

Re: [go-nuts] PDF to text

2025-01-23 Thread Michael Bright
Hi Mike, Not wanting to suggest that you take the Python route, but just sharing my experience. I've tried Acrobat Reader's "Save as Text" functionality, and also one or two Python libraries to extract text from PDFs (PyPDF2 is the one I've settled on). But what I learnt - without really dig

Re: [go-nuts] PDF to text

2025-01-23 Thread Hugh Myrie
Hi Mike, Thanks for the suggestion! I'm interested in checking out your forked code. It seems like a good alternative to what I'm currently using. Hugh On Wed, Jan 22, 2025, 10:25 PM Mike Schinkel wrote: > Hi Hugh, > > I have been planning to do some Go work with PDF files, so your email > tri

Re: [go-nuts] PDF to text

2025-01-22 Thread Mike Schinkel
Hi Hugh, I have been planning to do some Go work with PDF files, so your email triggered me to do some research. Not sure it using heussd/pdftotext-go is critical to you, or if you are just trying to read text in a PDF? I tried to get pdf2text installed but my dev laptop is still running macO

[go-nuts] PDF to text

2025-01-22 Thread Hugh Myrie
I want to extract text from a PDF and preserve any table or at least convert it to a CSV. I am using the PDFtoText package (which uses the Poppler software). The text is extracted vertically (i.e. one column at a time) and each text is separated by a space. There is no line break making it diff