Re: How to convert a XML file of US patent into a plain text file on a Linux platform?

Samuel Thibault Mon, 24 Oct 2022 11:53:51 -0700

Hello,

Henry Chang, le dim. 23 oct. 2022 20:12:45 -0400, a ecrit:
> I have successfully convert a pdf file of US patent into .png, then into .txt
> by using pdftoppm and tesseract.


pdftoppm could re-rater. Better use pdfimages which will just take the
images from the pdf unmodified.

> I found that USPTO provides plain text files in .xmal file.
> 
> From the USPTO webiste, we downloaded a XML full-text data, ipg221011.xml. 
> This
> file contains lots of XML files of U.S. patent data. How can I convert this
> .xml file into plain text files of US patents?

that xml file doesn't seem to be actually containing the patent text.

Samuel

Re: How to convert a XML file of US patent into a plain text file on a Linux platform?

Reply via email to