Re: edit pdf's

dircha Tue, 11 May 2004 21:07:34 -0700

Kevin Mark wrote:

On Tue, May 11, 2004 at 01:01:16PM -0400, Matt Price wrote:
thanks for the flues folks. pdftohtml -- which I confess I *did* already know about, sorry, should havesaid so -- won't work so well for me, i odn't think; these are scanned-in texts from the jstor journal collection, and it's important I keep the pages in order...

as ,er, someone mentioned earlier (don't have the thread in front of me at the moment), a complex process involving gimp and pdftops seems to be the best bet, but it's insanely labour-intensive for long documents, so I may forego the whole project. thx all though.
you mentioned something that caught my eye as it relates to a need in
FOSS that a friend of mine is looking for. A replacement for the
PAPERPORT product that allows for scanning in multipage docs, with the
ability to annotate pages, store ocr data with pages and to search the
archive as well as have a 'desktop environment app' that can show the
virtual folders of document with document thumbnails. PAPERPORT uses pdf
as their new format. Has anyone considered making such an apps? There
are many lawyer offices that would like this as well as people who deal
with large collections of document repositories.

I don't seem to have the root of this thread any longer.

However, have you looked into using pdfimages to extract the images and then gocr to extract the text from the images? You might want netpbm too if you go that route.

dircha

-- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Re: edit pdf's

Reply via email to