Hi Chris, thanks for fast reply and all recommendations in helps me much! as you recommended me i used Pdfminer module to extract the text from pdf files and then with file.xreadlines() I allocated the lines where my keyword ("factors in this case") appears. Till now i extract just the lines but im wondering if its able to extract whole sentenses (only this) where my keawords ("factors in this case") are located.
I used following script >> import os, subprocess path="C:\\PDF" # insert the path to the directory of interest here dirList=os.listdir(path) for fname in dirList: output =fname.rstrip(".pdf") + ".txt" subprocess.call(["C:\Python26\python.exe", "pdf2txt.py", "-o", output, fname]) print fname file = open(output) for line in file.xreadlines(): if "driving" in line: print(line) ------------------------------------------------------- Robert Pazur Mobile : +421 948 001 705 Skype : ruegdeg 2011/5/6 Chris Rebert <c...@rebertia.com> > On Thu, May 5, 2011 at 2:26 PM, Robert Pazur <pazurrob...@gmail.com> > wrote: > > Dear all, > > i would like to access some text and count the occurrence as follows > > > I got a lots of pdf with some scientific articles and i want to preview > > which words are usually related with for example "determinants" > > as an example in the article is a sentence > ....elevation is the most > > important determinant.... > > how can i acquire the "elevation" string? > > of course i dont know where the sententence in article is located or > which > > particular word could there be > > any suggestions? > > Extract the text using PDFMiner[1], pyPdf[2], or PageCatcher[3]. Then > use something similar to n-grams on the extracted text, filtering out > those that don't contain "determinant(s)". Then just keep a word > frequency table for the remaining n-grams. > > Not-quite-pseudo-code: > from collections import defaultdict, deque > N = 7 # length of n-grams to consider; tune as needed > buf = deque(maxlen=N) > targets = frozenset(("determinant", "determinants")) > steps_until_gone = 0 > word2freq = defaultdict(int) > for word in words_from_pdf: > if word in targets: > steps_until_gone = N > buf.append(word) > if steps_until_gone: > for related_word in buf: > if related_word not in targets: > word2freq[related_word] += 1 > steps_until_gone -= 1 > for count, word in sorted((v,k) for k,v in word2freq.iteritems()): > print(word, ':', count) > > Making this more efficient and less naive is left as an exercise to the > reader. > There may very well already be something similar but more > sophisticated in NLTK[4]; I've never used it, so I dunno. > > [1]: http://www.unixuser.org/~euske/python/pdfminer/index.html > [2]: http://pybrary.net/pyPdf/ > [3]: http://www.reportlab.com/software/#pagecatcher > [4]: http://www.nltk.org/ > > Cheers, > Chris > -- > http://rebertia.com >
-- http://mail.python.org/mailman/listinfo/python-list