Adobe's Acrobat can extract to docx and xlsx. Not a cheap option but it
does work
On Thursday, January 23, 2025 at 7:29:13 PM UTC-5 Hugh Myrie wrote:
> Hi Michael,
>
> You're absolutely right, PDF extraction can be a real headache!
> I've tried Mike's suggestion, but unfortunately, it didn't
Hi Michael,
You're absolutely right, PDF extraction can be a real headache!
I've tried Mike's suggestion, but unfortunately, it didn't quite work as
I'd hoped – it put each character on a separate line, which made it just as
difficult to work with.
I think I'll give OCR a shot and see if that yie
Glyph boundaries maintains the positional information but you still need to effectively treat it as an image - it’s just very course. Which leads to the OCR/vision AI model. If the pdf author is intentionally hindering the ability to “grab the data” then there is no text at all - and it is an image
Amusingly we wrote our PDF table extractor largely in
Go: https://pdftables.com/
It identifies tables and cells by looking at the statistical distribution
of glyph boundaries on the pages
rather than inferring anything from the way the text is logically grouped
within the PDF.
There are many ap
Hey,
I’m using
https://cloud.google.com/document-ai
I’m sending my pdf and getting back extracted text json object.
Work fast and not expensive 🙏
I hope this will help you .
Sharon Mafgaoker – Senior Solutions Architect
M. 050 995 99 16 | sha...@cloud5.co.il
On Thu, 23 Jan 2025 at 19:56 r
You typically can’t convert a PDF to text and do what you are trying to do.
Look for PDF to XML converters - you need the “blocks” and the hierarchy in
order to interpret most PDFs with any sort of complex formatting.
But even with XML, tables may not work, because there is no guarantee that the
Hi Mike,
Not wanting to suggest that you take the Python route, but just sharing my
experience.
I've tried Acrobat Reader's "Save as Text" functionality, and also one or
two Python libraries to extract text from PDFs (PyPDF2 is the one I've
settled on).
But what I learnt - without really dig
Hi Mike,
Thanks for the suggestion! I'm interested in checking out your forked code.
It seems like a good alternative to what I'm currently using.
Hugh
On Wed, Jan 22, 2025, 10:25 PM Mike Schinkel wrote:
> Hi Hugh,
>
> I have been planning to do some Go work with PDF files, so your email
> tri
Hi Hugh,
I have been planning to do some Go work with PDF files, so your email triggered
me to do some research.
Not sure it using heussd/pdftotext-go is critical to you, or if you are just
trying to read text in a PDF? I tried to get pdf2text installed but my dev
laptop is still running macO
I want to extract text from a PDF and preserve any table or at least
convert it to a CSV. I am using the PDFtoText package (which uses the
Poppler software). The text is extracted vertically (i.e. one column at a
time) and each text is separated by a space. There is no line break making
it diff
10 matches
Mail list logo