This is an automated email from the ASF dual-hosted git repository.
tballison pushed a change to branch TIKA-4745-more-junk-charset
in repository https://gitbox.apache.org/repos/asf/tika.git
from 6730ade8bd TIKA-4745 -- further efficiency improvements
add 0621874e07 TIKA-4745 -- further efficiency improvements
add bb57967c86 TIKA-4745 -- further efficiency improvements, respond to
copilot
No new revisions were added by this update.
Summary of changes:
.../html/StandardCharsets_unsupported_by_IANA.txt | 0
.../tika/parser/html/HtmlEncodingDetectorTest.java | 0
.../html/StandardHtmlEncodingDetectorTest.java | 0
.../ml/chardetect/MojibusterEncodingDetector.java | 3 +-
.../NaiveBayesBigramEncodingDetector.java | 55 +++++++----
.../apache/tika/ml/junkdetect/BigramTables.java | 3 +
.../apache/tika/ml/junkdetect/JunkDetector.java | 80 +++++++++-------
.../ml/junkdetect/JunkFilterEncodingDetector.java | 104 +++++++++++----------
.../tika/ml/junkdetect/TextQualityFeatures.java | 52 +++++++++--
.../tika-parser-html-module/pom.xml | 3 +
10 files changed, 195 insertions(+), 105 deletions(-)
rename
{tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module
=>
tika-encoding-detectors/tika-encoding-detector-html}/src/main/resources/org/apache/tika/parser/html/StandardCharsets_unsupported_by_IANA.txt
(100%)
rename
{tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module
=>
tika-encoding-detectors/tika-encoding-detector-html}/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectorTest.java
(100%)
rename
{tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module
=>
tika-encoding-detectors/tika-encoding-detector-html}/src/test/java/org/apache/tika/parser/html/StandardHtmlEncodingDetectorTest.java
(100%)