Yes, there is no glossary support, and I don't think templates are supported very well either, if at all. I tried once to read a template and save it as a document to another file, and things didn't go well. I'm sure this just scratches the surface. Of course you are looking at things from an extraction point of view, and I am looking at things from a document creation point of view. The two are likely very different.
On Mon, Dec 12, 2016 at 7:36 PM, Allison, Timothy B. <talli...@mitre.org> wrote: > This is very helpful, Mark. Thank you. Y, I'd add handling of the > glossary document, as well. > > As I was working on the SAX parser for Tika, it "feels" more robust from > an extraction standpoint because it is extracting all "w:t",...with a few > exceptions (deltext, moveFrom, alternatecontent, etc). Still needs more > work, but it sounds from the list you've compiled that the new parser might > not be a bad idea...if the sole goal is extraction. > > > > -----Original Message----- > From: Murphy, Mark [mailto:murphym...@metalexmfg.com] > Sent: Monday, December 12, 2016 3:56 PM > To: 'POI Developers List' <dev@poi.apache.org> > Subject: RE: got docx? > > Lol, just from looking through the code, and standard, there are a number > of things that I know are not handled or not handled properly in XWPF. A > quick subset from the top of my head includes: > * Pictures that are not inlined in the main document, header, or footer > parts. > * Sections > * SDT content > * Alternate content > * Many of the shared portions of the spec > * Tables have problems > * Versions - This is a tag that gets added to every node telling which > save (version) it was created for. > * Revisions - This is the stuff that tells what was changed and how. Which > nodes were inserted, or changed, or deleted, or moved, and when, and by > whom. > > There are thousands of hours left just to get it to version1 of the spec. > > But yes, thanks Dominik for providing this batch of test documents. It > should help prioritize fixes. > > -----Original Message----- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Monday, December 12, 2016 9:58 AM > To: POI Developers List <dev@poi.apache.org> > Cc: d...@tika.apache.org > Subject: RE: got docx? > > To close the loop and share my gratitude publicly... > > Thank you, Dominik, for transferring 41k, 5GB of docx/dotx to our > regression corpus! > > I’ve already found a number of “areas for improvement” in Tika's > experimental docx SAX parser, and a few areas for improvement in POI's > XWPFDocument/DOM parser…all thanks to your documents and your common crawl > code. > > Thank you! > > > Cheers, > > Tim > > B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB� > � [��X��ܚX�K K[XZ[ > � ]�][��X��ܚX�P �K�\ X� K�ܙ�B��܈ Y ] [ۘ[ ��[X[� � K[XZ[ > � ]�Z [ �K�\ X� K�ܙ�B�B > B�KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKCB� > � [��X��ܚX�K K[XZ[ > � ]�][��X��ܚX�P �K�\ X� K�ܙ�B��܈ Y ] [ۘ[ ��[X[� � K[XZ[ > � ]�Z [ �K�\ X� K�ܙ�B�B >