x.pi...@gmail.com writes: > Hi list. > I try to make a workflow to mine data from pdfs into org mode. > I prefer to read in emacs, since I have fast dictionary lookup in it and > many other things. > There are two tools I think useful for conversion of pdfs into txt: > cuneiform - to extract text, and pdfimages for image extraction. > Cuneiform is better then other text extractors (what I have tried) in > handling two columned > pdfs.
PdfEdit seems interesting as well. http://sourceforge.net/projects/pdfedit http://www.cs.unb.ca/~bremner/blog/posts/pdf2text/ ps: I have no experience using PdfEdit or how it fares wrt images and captions. > A pdf as split to pages and each of them processed separateddly > Using this two programs and some scripting I believe it is possible to > convert pdf in org file. However there are two issues I would like to > solve. > 1) Is there any way to extract figure captions from a pdf? > 2) I have no solution for formulas and Greek letters. The only way to > handle it would be > to consult an image of the page. > Any suggestions about it? Have somebody tried something similar. > Thanks. > Petro. > > > > > --