[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14166755#comment-14166755
 ] 

Tim Allison commented on TIKA-1442:
-----------------------------------

This is in response to our discussion on TIKA-1419.

Y, I agree that we should flag unparseables in the test set so that we don't 
have to manually open them again and again to confirm that there's junk there, 
just different junk. If you send me a junk list, I'll add a junk column to my 
local db for those files and include that in future dumps.  Once we make this 
testing public, it would be great to create a ui to allow people to flag 
extracted text as "great, let's use this extracted text as a gold standard for 
text/metadata extraction" or to flag source docs as unparseable.

In my dev-dev version of the extractor comparison code, I include the top 10 
most frequent words in the doc and a count of how many of those are English 
stop words.  As you suggest, that's a reasonable indicator (if the docs are 
English) that something might have gone wrong.

Another thing that would make manual review a whole lot easier would be a ui 
with a word-level diff.

What other statistics could we use to help guide the manual review?



> Upgrade to PDFBox 1.8.8
> -----------------------
>
>                 Key: TIKA-1442
>                 URL: https://issues.apache.org/jira/browse/TIKA-1442
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 1.7
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to