Hello, Henry Chang, le dim. 23 oct. 2022 20:12:45 -0400, a ecrit: > I have successfully convert a pdf file of US patent into .png, then into .txt > by using pdftoppm and tesseract.
pdftoppm could re-rater. Better use pdfimages which will just take the images from the pdf unmodified. > I found that USPTO provides plain text files in .xmal file. > > From the USPTO webiste, we downloaded a XML full-text data, ipg221011.xml. > This > file contains lots of XML files of U.S. patent data. How can I convert this > .xml file into plain text files of US patents? that xml file doesn't seem to be actually containing the patent text. Samuel