Hi list.
I try to make a workflow to mine data from pdfs into org mode.
I prefer to read in emacs, since I have fast dictionary lookup in it and
many other things.
There are two tools I think useful for conversion of pdfs into txt:
cuneiform - to extract text, and pdfimages for image extraction.
Cuneiform is better then other text extractors (what I have tried) in handling 
two columned
pdfs.
A pdf as split to pages and each of them processed separateddly
Using this two programs and some scripting I believe it is possible to
convert pdf in org file. However there are two issues I would like to
solve.
1) Is there any way to extract  figure captions from a pdf?
2) I have no solution for formulas and Greek letters. The only way to handle it 
would be
to consult an image of the page.
Any suggestions about it? Have somebody tried something similar. 
Thanks.
Petro.




Reply via email to