All, Again, my apologies for being late, but the results might still be useful for work towards 4.1.1.
http://162.242.228.174/reports/poi-4.1.0-reports.zip Some tentative observations: 1) there was the new and non-replicable set of problems with the XSSFBParser. 2) The emf/wmf regressions are responsible for the decrease in attachments and common words. 3) It looks like there are spacing problems/new line problems with the update emf/wmf code, but that might be on Tika's side. 4) The large increase in common words in ooxml that were formally tika-ooxml is caused by ZipSalvager. On the Tika side, we're now creating a valid zip from truncated zips and rerunning the parse. So, we used to get the content via the PkgParser and that content would have gone into "attachments". --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org