[ https://issues.apache.org/jira/browse/TIKA-4438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17983691#comment-17983691 ]
Tim Allison commented on TIKA-4438: ----------------------------------- The local changes I made to the emf parsing made things slightly worse than in the original reports – CHANGE_IN_COMMON_TOKENS_B went from 199724 to 195436 . That's still a ~3% gain over what we were getting before. We're still doing better than we were, and there are clearly areas for improvements. Y, [~tilman] I agree on those files. The challenge is that the font calculations and location info aren't as mature in POI's hemf parser (or my understanding of how to use it) than where we are in PDFBox. Also, a bunch of emf files I've now seen don't include coordinate information in the text records so you have to fallback to previous records. I think we're basically saying the same thing. :D > Prepare for 3.2.1 release > ------------------------- > > Key: TIKA-4438 > URL: https://issues.apache.org/jira/browse/TIKA-4438 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > Attachments: tika-3.2.1-reports.tgz, tika-3.2.1b.tgz > > -- This message was sent by Atlassian Jira (v8.20.10#820010)