Re: PDF

Bob Sneidar via use-livecode Mon, 14 May 2018 08:23:01 -0700

Document Management systems use PDFs almost exclusively. I think PDF is here to 
stay.


Bob S


> On May 13, 2018, at 08:05 , Mike Bonner via use-livecode 
> <use-livecode@lists.runrev.com> wrote:
> 
> I ended up using pdftotext, it worked like a charm. (Though I had to look
> up how to send it a file list using find..  Too long away from the shell.)
> I now have a little app that can do a word search for matching files and
> show either the extracted text, or the original pdf using the browser
> widget.
> 
> As far as being on the "make pdfs go away" bandwagon, yes I am.
> Unfortunately, they're still used all over the place.  Insurance companies
> autogenerate a huge amount of pdf reports, some of them built live through
> horribly slow clunky awful (insert a bunch of other words here to describe
> how NOT enjoyable it is to use their websites) that then eventually (after
> going through huge amounts of different screens, get to the end result)
> hand you a pdf.    /endInsuranceWebsiteVent
> 
> Reminds me of when I worked as phone support for a "large computer
> manufacturer".. When there was a workflow issue, and slow call times due to
> waiting on page loads for vantive.. The answer usually ended up being..
> "Hey, its already slow so lets add 3 more required page loads that can take
> forever to complete especially on busy days, thereby slowing things down
> even more..."  /endPhoneSupportVent
> 
> I seem to be on a "KISAF" kick lately.  Keep It Simple And Fast
> 
> On Sun, May 13, 2018 at 8:30 AM, R.H. via use-livecode <
> use-livecode@lists.runrev.com> wrote:
> 
>> To extract text from a PDF document, I am using a command line tool on
>> Windows which is available also for Linux based systems called Xpdf.
>> 
>> It was working well, using shell() on LiveCode Community 8x, but tested
>> only in the IDE on Windows.
>> 
>> It should work with Linux and Mac as well.
>> 
>> If PDFs just contain images where the text is in the image, you need to
>> first run it through character recognition program. Since I found that
>> different tools generate different results when converting image characters
>> in PDF to embedded text, I still find that Acrobat from Adobe is doing the
>> best job.
>> 
>> I needed this since some people had sent huge lists of numerical data in
>> PDF which were impossible to extract, and the manual method could taken
>> weeks. Also, it is helpful for building Document Management Systems where
>> words within associated documents need to be indexed.
>> 
>> Converting PDF to .docx formats (Word) usually does not give good results.
>> The resulting documents are quite unclean. Extracting the text also does
>> not necessarily result in a meaningful text if the original PDF is not
>> structured with clearly separated paragraphs, headlines, etc. ideally in
>> one top-to-bottom and left-to-right flow. So, a lot of manual work will
>> often be required.
>> 
>> Nevertheless, I can not see that PDF will lose ground as the standard for
>> many years to come. There are possibly billions of documents in PDF around?
>> What should replace it? And people are still printing.
>> 
>> Xpdf can generate a pure text file that can be read from LiveCode and
>> processed further.
>> 
>> *Open Source Xpdf*
>> 
>> http://www.xpdfreader.com/download.html
>> 
>> https://en.wikipedia.org/wiki/Pdftotext
>> Command line tools in Xpdf
>> 
>> The open source Xpdf toolkit also includes several command line tools which
>> perform various functions on PDF files:
>> 
>>   - *pdftotext*: converts PDF to text
>>   - *pdftops*: converts PDF to PostScript
>>   - *pdftoppm*: converts PDF pages to netpbm (PPM/PGM/PBM) image files
>>   - *pdftopng*: converts PDF pages to PNG image files
>>   - *pdftohtml*: converts PDF to HTML
>>   - *pdfinfo*: extracts PDF metadata
>>   - *pdfimages*: extracts raw images from PDF files
>>   - *pdffonts*: lists fonts used in PDF files
>>   - *pdfdetach*: extracts attached files from PDF files
>> 
>> Cross-platform
>> 
>> All of Xpdf tools are available for Linux, Windows, and Mac.
>> 
>> The viewer (xpdf / XpdfReader) uses the Qt toolkit.
>> Roland
>> _______________________________________________
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>> 
> _______________________________________________
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: PDF

Reply via email to