[david] > I want to compare PDF-PDF files and WORD-WORD files. OK. Well, that's clear enough.
> It seems that the right way is : > First, extract text from PDF file or Word file. > Then, use Difflib to compare these text files. When you say "it seems that the right way is..." I'll assume that this way meets your requirements. It wouldn't be the right way if, for example, you wanted to treat different header levels as different, or to consider embedded graphics as significant etc. > Would you please give me some more information > about the external diff tools? Well, I could mention the name of the ones which I might use (WinMerge and GNU diff), but I'm sure there are many of then around the place, and you're far better off doing this: http://www.google.co.uk/search?q=diff+tools In case you didn't realise, the "difflib" I referred to is a Python module from the standard library: <screendump> Python 2.4.2 (#67, Sep 28 2005, 12:41:11) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import difflib >>> `difflib` "<module 'difflib' from 'c:\\python24\\lib\\difflib.pyc'>" >>> </screendump> > There some Python scripts that can extract text > from PDF or WORD file? Well, I'm sure there are, but my honest opinion is that, unless you've got some compelling reason to do this in Python, you're better off using, say: + antiword: http://www.winfield.demon.nl/ + pdf2text from xpdf: http://www.foolabs.com/xpdf/home.html If you really wanted to go with Python (for the learning experience, if nothing else) then the most obvious candidates are: + Word: use the pywin32 modules to automate Word and save the document as text: http://pywin32.sf.net/ Something like this (assumes doc called c:\temp\test.doc exists): <code> import win32com.client word = win32com.client.gencache.EnsureDispatch ("Word.Application") doc = word.Documents.Open (FileName="c:/temp/test.doc") doc.SaveAs (FileName="c:/temp/test2.txt", FileFormat=win32com.client.constants.wdFormatText) word.Quit () del word text = open ("c:/temp/test2.txt").read () print text </code> + PDF: David Boddie's pdftools looks like about the only possibility: (ducks as a thousand people jump on him and point out the alternatives) http://www.boddie.org.uk/david/Projects/Python/pdftools/ Something like this might do the business. I'm afraid I've no idea how to determine where the line-breaks are. This was the first time I'd used pdftools, and the fact that I could do this much is a credit to its usability! <code> from pdftools.pdffile import PDFDocument from pdftools.pdftext import Text def contents_to_text (contents): for item in contents: if isinstance (item, type ([])): for i in contents_to_text (item): yield i elif isinstance (item, Text): yield item.text doc = PDFDocument ("c:/temp/test.pdf") n_pages = doc.count_pages () text = [] for n_page in range (1, n_pages+1): print "Page", n_page page = doc.read_page (n_page) contents = page.read_contents ().contents text.extend (contents_to_text (contents)) print "".join (text) </code> TJG ________________________________________________________________________ This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. For more information on a proactive anti-virus service working around the clock, around the globe, visit: http://www.star.net.uk ________________________________________________________________________ -- http://mail.python.org/mailman/listinfo/python-list