[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1442:
------------------------------
    Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx

[~tilman], thank you, again, for all of your work on this.

Tika community, if you have a chance, take a look at the attached comparison 
file and recommend other statistics that would be useful for file comparison 
(TIKA-1332) and junk detection TIKA-1443).

I added the following columns:
language id: language and confidence score
top10words
count of the top 10 words that are stopwords in English (based on Lucene's 
StandardAnalyzer's list)...I need to make this language specific...if the 
langid component says "so", we need to count the number of so stopwords.

I renamed some of the column headers.  I finally had a chance to break out 
Manning and Schutze... "token overlap" is actually Dice coefficient.

I added a vlookup column for [~tilman]'s notes. 

I cannot figure out why I'm getting different lang id confidence scores for a 
given file pair if the Dice Coefficient is 1.0.  I need to look into this.

All a work in progress...

> Upgrade to PDFBox 1.8.8
> -----------------------
>
>                 Key: TIKA-1442
>                 URL: https://issues.apache.org/jira/browse/TIKA-1442
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 1.7
>
>         Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to