If you have a JVM lying around, you can extract docx text with Apache Tika.
— Peter West p...@ehealth.id.au “I am the vine; you are the branches.” > On 7 May 2021, at 2:30 pm, John Hardin <jhar...@impsec.org> wrote: > > On Thu, 6 May 2021, Alex wrote: > >> Hi, >> >> I'm trying to use the latest ExtractText plugin, but the docx2txt >> program the plugin references is no longer available from >> http://docx2txt.sourceforge.net > >> Do you have any recommendations for an alternative...? > > Perhaps one of (from Stack Overflow): > > unzip -p some.docx word/document.xml |\ > sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' > > unzip -p document.docx word/document.xml |\ > sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g' > > unzip -p document.docx word/document.xml |\ > sed -e 's/<\/w:p>/ /g; s/<[^>]\{1,\}>/ /g; s/[^[:print:]]\{1,\}/ /g' > > ...though html2text might be better than sed for reliably de-XMLizing the > document text. > > There's also this: > > http://abisource.com/downloads/wv/ > > There's conflicting information on whether Antiword groks .docx, you may want > to try it and see. It may be available from your distro, otherwise: > > http://www.winfield.demon.nl/index.html > > It might be worthwhile to use native perl utilities to unzip the file, > extract the document.xml content and pass it through XML::XPath to extract > the text, but that would probably involve code changes to ExtractText rather > than just configuring an it to use external utility. > > Caveat: I have never looked at the ExtractText plugin. > > > -- > John Hardin KA7OHZ http://www.impsec.org/~jhardin/ > jhar...@impsec.org pgpk -a jhar...@impsec.org > key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 > ----------------------------------------------------------------------- > Are you a mildly tech-literate politico horrified by the level of > ignorance demonstrated by lawmakers gearing up to regulate online > technology they don't even begin to grasp? Cool. Now you have a > tiny glimpse into a day in the life of a gun owner. -- Sean Davis > ----------------------------------------------------------------------- > 2 days until the 76th anniversary of VE day