To extract text from a PDF document, I am using a command line tool on Windows which is available also for Linux based systems called Xpdf.
It was working well, using shell() on LiveCode Community 8x, but tested only in the IDE on Windows. It should work with Linux and Mac as well. If PDFs just contain images where the text is in the image, you need to first run it through character recognition program. Since I found that different tools generate different results when converting image characters in PDF to embedded text, I still find that Acrobat from Adobe is doing the best job. I needed this since some people had sent huge lists of numerical data in PDF which were impossible to extract, and the manual method could taken weeks. Also, it is helpful for building Document Management Systems where words within associated documents need to be indexed. Converting PDF to .docx formats (Word) usually does not give good results. The resulting documents are quite unclean. Extracting the text also does not necessarily result in a meaningful text if the original PDF is not structured with clearly separated paragraphs, headlines, etc. ideally in one top-to-bottom and left-to-right flow. So, a lot of manual work will often be required. Nevertheless, I can not see that PDF will lose ground as the standard for many years to come. There are possibly billions of documents in PDF around? What should replace it? And people are still printing. Xpdf can generate a pure text file that can be read from LiveCode and processed further. *Open Source Xpdf* http://www.xpdfreader.com/download.html https://en.wikipedia.org/wiki/Pdftotext Command line tools in Xpdf The open source Xpdf toolkit also includes several command line tools which perform various functions on PDF files: - *pdftotext*: converts PDF to text - *pdftops*: converts PDF to PostScript - *pdftoppm*: converts PDF pages to netpbm (PPM/PGM/PBM) image files - *pdftopng*: converts PDF pages to PNG image files - *pdftohtml*: converts PDF to HTML - *pdfinfo*: extracts PDF metadata - *pdfimages*: extracts raw images from PDF files - *pdffonts*: lists fonts used in PDF files - *pdfdetach*: extracts attached files from PDF files Cross-platform All of Xpdf tools are available for Linux, Windows, and Mac. The viewer (xpdf / XpdfReader) uses the Qt toolkit. Roland _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode