[ https://issues.apache.org/jira/browse/TIKA-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721843#comment-16721843 ]
Hudson commented on TIKA-2800: ------------------------------ UNSTABLE: Integrated in Jenkins build tika-2.x-windows #365 (See [https://builds.apache.org/job/tika-2.x-windows/365/]) TIKA-2800 -- add num unique alphabetic tokens and num unique common (tallison: rev c7f292b5abb08096f6f4870326a16929cb326a33) * (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenCountManager.java * (edit) tika-eval/src/main/java/org/apache/tika/eval/AbstractProfiler.java * (edit) tika-eval/src/main/java/org/apache/tika/eval/ExtractProfiler.java * (edit) tika-eval/src/main/resources/comparison-reports.xml * (edit) tika-eval/src/main/java/org/apache/tika/eval/db/Cols.java * (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenResult.java * (edit) tika-eval/src/test/java/org/apache/tika/eval/SimpleComparerTest.java > Include num of unique common/alphabetic tokens (types) in tika-eval > ------------------------------------------------------------------- > > Key: TIKA-2800 > URL: https://issues.apache.org/jira/browse/TIKA-2800 > Project: Tika > Issue Type: Improvement > Reporter: Tim Allison > Assignee: Tim Allison > Priority: Major > Fix For: 2.0.0, 1.20 > > > We include token and unique token (type) counts in tika-eval. We should > include type counts for alphabetic and common words. If one tool is > incorrectly duplicating/triplicating content dramatically, that would > incorrectly inflate the "common_tokens" sum for that tool. -- This message was sent by Atlassian JIRA (v7.6.3#76005)