[ 
https://issues.apache.org/jira/browse/TIKA-4744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084198#comment-18084198
 ] 

ASF GitHub Bot commented on TIKA-4744:
--------------------------------------

tballison opened a new pull request, #2847:
URL: https://github.com/apache/tika/pull/2847

   We were chunking and then selecting the best chunk. This is less than ideal 
if the chunks are of different sizes. This drops chunking entirely. If we 
somehow have to be more performant, we can replicate our probing algorithm from 
3.x. For now, just process everything the user sends in. We cap it, by default.




> Further xhtml fixes in 4.x
> --------------------------
>
>                 Key: TIKA-4744
>                 URL: https://issues.apache.org/jira/browse/TIKA-4744
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> In prep for 4.0.0-beta-1, I ran the full corpus with the strict xhtml 
> validator on. This surfaced a few further areas for improvement.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to