[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

Tim Allison (JIRA) Wed, 22 Oct 2014 17:18:07 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180783#comment-14180783
 ]


Tim Allison commented on TIKA-1442:
-----------------------------------

Top10Words: top 10 most frequent tokens
NumTop10EnStopWords: of the top 10 most frequent tokens, how many are English 
stopwords

As above, if NumTop10EnStopWords proves to be of any use, we'll want to add 
stopwords for other languages and calculate the number of stop words _for that 
language_ that are in the top 10 most frequent.

On a side note, I figured out how a pair of docs can have a perfect Dice 
coefficient but have differing lang id confidence scores:  the Dice coefficient 
is calculated on tokens identified by Lucene's ICUTokenizer+ICUFoldingFilter; 
whereas the lang id score is calculated based on the string.  I suspect that 
for those doc pairs with a lower lang id score, there will be more junk that 
was "cleaned" out by the Analyzer.

> Upgrade to PDFBox 1.8.8
> -----------------------
>
>                 Key: TIKA-1442
>                 URL: https://issues.apache.org/jira/browse/TIKA-1442
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 1.7
>
>         Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

Reply via email to