If you have a JVM lying around, you can extract docx text with Apache Tika.


—
Peter West
p...@ehealth.id.au
“I am the vine; you are the branches.”

> On 7 May 2021, at 2:30 pm, John Hardin <jhar...@impsec.org> wrote:
> 
> On Thu, 6 May 2021, Alex wrote:
> 
>> Hi,
>> 
>> I'm trying to use the latest ExtractText plugin, but the docx2txt
>> program the plugin references is no longer available from
>> http://docx2txt.sourceforge.net
> 
>> Do you have any recommendations for an alternative...?
> 
> Perhaps one of (from Stack Overflow):
> 
> unzip -p some.docx word/document.xml |\
>   sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'
> 
> unzip -p document.docx word/document.xml |\
>   sed -e 's/<\/w:p>/\n/g; s/<[^>]\{1,\}>//g; s/[^[:print:]\n]\{1,\}//g'
> 
> unzip -p document.docx word/document.xml |\
>   sed -e 's/<\/w:p>/ /g; s/<[^>]\{1,\}>/ /g; s/[^[:print:]]\{1,\}/ /g'
> 
> ...though html2text might be better than sed for reliably de-XMLizing the 
> document text.
> 
> There's also this:
> 
>  http://abisource.com/downloads/wv/
> 
> There's conflicting information on whether Antiword groks .docx, you may want 
> to try it and see. It may be available from your distro, otherwise:
> 
>  http://www.winfield.demon.nl/index.html
> 
> It might be worthwhile to use native perl utilities to unzip the file, 
> extract the document.xml content and pass it through XML::XPath to extract 
> the text, but that would probably involve code changes to ExtractText rather 
> than just configuring an it to use external utility.
> 
> Caveat: I have never looked at the ExtractText plugin.
> 
> 
> -- 
> John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
> jhar...@impsec.org                         pgpk -a jhar...@impsec.org
> key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
> -----------------------------------------------------------------------
>  Are you a mildly tech-literate politico horrified by the level of
>  ignorance demonstrated by lawmakers gearing up to regulate online
>  technology they don't even begin to grasp? Cool. Now you have a
>  tiny glimpse into a day in the life of a gun owner.   -- Sean Davis
> -----------------------------------------------------------------------
> 2 days until the 76th anniversary of VE day

Reply via email to