On Friday, May 1, 2015 at 5:58:50 PM UTC+5:30, subhabrat...@gmail.com wrote: > Dear Group, > > I have several millions of documents in several folders and subfolders in my > machine. > I tried to write a script as follows, to extract all the .doc files and to > convert them in text, but it seems it is taking too much of time. > > import os > from fnmatch import fnmatch > import win32com.client > import zipfile, re > def listallfiles2(n): > root = 'C:\Cand_Res' > pattern = "*.doc" > list1=[] > for path, subdirs, files in os.walk(root): > for name in files: > if fnmatch(name, pattern): > file_name1=os.path.join(path, name) > if ".doc" in file_name1: > #EXTRACTING ONLY .DOC FILES > if ".docx" not in file_name1: > #print "It is A Doc file$$:",file_name1 > try: > doc = win32com.client.GetObject(file_name1) > text = doc.Range().Text > text1=text.encode('ascii','ignore') > text_word=text1.split() > #print "Text for Document File Is:",text1 > list1.append(text_word) > print "It is a Doc file" > except: > print "DOC ISSUE" > > But it seems it is taking too much of time, to convert to text and to append > to list. Is there any way I may do it fast? I am using Python2.7 on Windows 7 > Professional Edition. Apology for any indentation error. > > If any one may kindly suggest a solution. > > Regards, > Subhabrata Banerjee.
Thanks. You are right conversions are taking time. I would surely check. Rest part is okay. Regards, Subhabrata Banerjee. -- https://mail.python.org/mailman/listinfo/python-list