On Saturday, May 2, 2015 at 2:52:32 PM UTC+5:30, Peter Otten wrote: > wrote: > > > I have several millions of documents in several folders and subfolders in > > my machine. I tried to write a script as follows, to extract all the .doc > > files and to convert them in text, but it seems it is taking too much of > > time. > > > > import os > > from fnmatch import fnmatch > > import win32com.client > > import zipfile, re > > def listallfiles2(n): > > root = 'C:\Cand_Res' > > pattern = "*.doc" > > list1=[] > > for path, subdirs, files in os.walk(root): > > for name in files: > > if fnmatch(name, pattern): > > file_name1=os.path.join(path, name) > > if ".doc" in file_name1: > > #EXTRACTING ONLY .DOC FILES > > if ".docx" not in file_name1: > > #print "It is A Doc file$$:",file_name1 > > try: > > doc = win32com.client.GetObject(file_name1) > > text = doc.Range().Text > > text1=text.encode('ascii','ignore') > > text_word=text1.split() > > #print "Text for Document File Is:",text1 > > list1.append(text_word) > > print "It is a Doc file" > > except: > > print "DOC ISSUE" > > > > But it seems it is taking too much of time, to convert to text and to > > append to list. Is there any way I may do it fast? I am using Python2.7 on > > Windows 7 Professional Edition. Apology for any indentation error. > > > > If any one may kindly suggest a solution. > > It will not help the first time through your documents, but if you write the > words for the word documents in one .txt file per .doc, and the original > files rarely change you can read from the .txt files when you run your > script a second time. Just make sure that the .txt is younger than the > corresponding .doc by checking the file time. > > In short: use a caching strategy.
Thanks Peter. I'll surely check on that. Regards, Subhabrata Banerjee. -- https://mail.python.org/mailman/listinfo/python-list