Kevin Mark wrote:
On Tue, May 11, 2004 at 01:01:16PM -0400, Matt Price wrote:
thanks for the flues folks. pdftohtml -- which I confess I *did*
already know about, sorry, should havesaid so -- won't work so well
for me, i odn't think; these are scanned-in texts from the jstor
journal collection, and it's important I keep the pages in order...


as ,er, someone mentioned earlier (don't have the thread in front of
me at the moment), a complex process involving gimp and pdftops seems
to be the best bet, but it's insanely labour-intensive for long
documents, so I may forego the whole project. thx all though.

you mentioned something that caught my eye as it relates to a need in FOSS that a friend of mine is looking for. A replacement for the PAPERPORT product that allows for scanning in multipage docs, with the ability to annotate pages, store ocr data with pages and to search the archive as well as have a 'desktop environment app' that can show the virtual folders of document with document thumbnails. PAPERPORT uses pdf as their new format. Has anyone considered making such an apps? There are many lawyer offices that would like this as well as people who deal with large collections of document repositories.

I don't seem to have the root of this thread any longer.


However, have you looked into using pdfimages to extract the images and then gocr to extract the text from the images? You might want netpbm too if you go that route.

dircha


--
To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]




Reply via email to