Document Management systems use PDFs almost exclusively. I think PDF is here to stay.
Bob S > On May 13, 2018, at 08:05 , Mike Bonner via use-livecode > <use-livecode@lists.runrev.com> wrote: > > I ended up using pdftotext, it worked like a charm. (Though I had to look > up how to send it a file list using find.. Too long away from the shell.) > I now have a little app that can do a word search for matching files and > show either the extracted text, or the original pdf using the browser > widget. > > As far as being on the "make pdfs go away" bandwagon, yes I am. > Unfortunately, they're still used all over the place. Insurance companies > autogenerate a huge amount of pdf reports, some of them built live through > horribly slow clunky awful (insert a bunch of other words here to describe > how NOT enjoyable it is to use their websites) that then eventually (after > going through huge amounts of different screens, get to the end result) > hand you a pdf. /endInsuranceWebsiteVent > > Reminds me of when I worked as phone support for a "large computer > manufacturer".. When there was a workflow issue, and slow call times due to > waiting on page loads for vantive.. The answer usually ended up being.. > "Hey, its already slow so lets add 3 more required page loads that can take > forever to complete especially on busy days, thereby slowing things down > even more..." /endPhoneSupportVent > > I seem to be on a "KISAF" kick lately. Keep It Simple And Fast > > On Sun, May 13, 2018 at 8:30 AM, R.H. via use-livecode < > use-livecode@lists.runrev.com> wrote: > >> To extract text from a PDF document, I am using a command line tool on >> Windows which is available also for Linux based systems called Xpdf. >> >> It was working well, using shell() on LiveCode Community 8x, but tested >> only in the IDE on Windows. >> >> It should work with Linux and Mac as well. >> >> If PDFs just contain images where the text is in the image, you need to >> first run it through character recognition program. Since I found that >> different tools generate different results when converting image characters >> in PDF to embedded text, I still find that Acrobat from Adobe is doing the >> best job. >> >> I needed this since some people had sent huge lists of numerical data in >> PDF which were impossible to extract, and the manual method could taken >> weeks. Also, it is helpful for building Document Management Systems where >> words within associated documents need to be indexed. >> >> Converting PDF to .docx formats (Word) usually does not give good results. >> The resulting documents are quite unclean. Extracting the text also does >> not necessarily result in a meaningful text if the original PDF is not >> structured with clearly separated paragraphs, headlines, etc. ideally in >> one top-to-bottom and left-to-right flow. So, a lot of manual work will >> often be required. >> >> Nevertheless, I can not see that PDF will lose ground as the standard for >> many years to come. There are possibly billions of documents in PDF around? >> What should replace it? And people are still printing. >> >> Xpdf can generate a pure text file that can be read from LiveCode and >> processed further. >> >> *Open Source Xpdf* >> >> http://www.xpdfreader.com/download.html >> >> https://en.wikipedia.org/wiki/Pdftotext >> Command line tools in Xpdf >> >> The open source Xpdf toolkit also includes several command line tools which >> perform various functions on PDF files: >> >> - *pdftotext*: converts PDF to text >> - *pdftops*: converts PDF to PostScript >> - *pdftoppm*: converts PDF pages to netpbm (PPM/PGM/PBM) image files >> - *pdftopng*: converts PDF pages to PNG image files >> - *pdftohtml*: converts PDF to HTML >> - *pdfinfo*: extracts PDF metadata >> - *pdfimages*: extracts raw images from PDF files >> - *pdffonts*: lists fonts used in PDF files >> - *pdfdetach*: extracts attached files from PDF files >> >> Cross-platform >> >> All of Xpdf tools are available for Linux, Windows, and Mac. >> >> The viewer (xpdf / XpdfReader) uses the Qt toolkit. >> Roland >> _______________________________________________ >> use-livecode mailing list >> use-livecode@lists.runrev.com >> Please visit this url to subscribe, unsubscribe and manage your >> subscription preferences: >> http://lists.runrev.com/mailman/listinfo/use-livecode >> > _______________________________________________ > use-livecode mailing list > use-livecode@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your subscription > preferences: > http://lists.runrev.com/mailman/listinfo/use-livecode _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode