[ 
https://issues.apache.org/jira/browse/TIKA-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155789#comment-17155789
 ] 

Clark Perkins commented on TIKA-3131:
-------------------------------------

I'm pretty sure this was just an oversight when copying defaults from PDFBox, 
so I went ahead and opened a PR to fix them.

> PDFParserConfig default values were accidentally swapped
> --------------------------------------------------------
>
>                 Key: TIKA-3131
>                 URL: https://issues.apache.org/jira/browse/TIKA-3131
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.24.1
>            Reporter: Clark Perkins
>            Priority: Major
>
> When default values were added for averageCharTolerance andĀ spacingTolerance 
> as a part of TIKA-3091, their values appear to have been inadvertently 
> swapped.
> From PDFBox:
> {noformat}
>     private float spacingTolerance = .5f;
>     private float averageCharTolerance = .3f;
> {noformat}
> From tika 1.24.1:
> {noformat}
>     //The character width-based tolerance value used to estimate where spaces 
> in text should be added
>     //Default taken from PDFBox.
>     private Float averageCharTolerance = 0.5f;
>     //The space width-based tolerance value used to estimate where spaces in 
> text should be added
>     //Default taken from PDFBox.
>     private Float spacingTolerance = 0.3f;
> {noformat}
> This effective change in defaults has caused PDFParser to start adding more 
> spaces than it did in 1.24 and earlier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to