This is an automated email from the ASF dual-hosted git repository.

tballison pushed a change to branch TIKA-4745-more-junk-charset
in repository https://gitbox.apache.org/repos/asf/tika.git


    from 6730ade8bd TIKA-4745 -- further efficiency improvements
     add 0621874e07 TIKA-4745 -- further efficiency improvements
     add bb57967c86 TIKA-4745 -- further efficiency improvements, respond to 
copilot

No new revisions were added by this update.

Summary of changes:
 .../html/StandardCharsets_unsupported_by_IANA.txt  |   0
 .../tika/parser/html/HtmlEncodingDetectorTest.java |   0
 .../html/StandardHtmlEncodingDetectorTest.java     |   0
 .../ml/chardetect/MojibusterEncodingDetector.java  |   3 +-
 .../NaiveBayesBigramEncodingDetector.java          |  55 +++++++----
 .../apache/tika/ml/junkdetect/BigramTables.java    |   3 +
 .../apache/tika/ml/junkdetect/JunkDetector.java    |  80 +++++++++-------
 .../ml/junkdetect/JunkFilterEncodingDetector.java  | 104 +++++++++++----------
 .../tika/ml/junkdetect/TextQualityFeatures.java    |  52 +++++++++--
 .../tika-parser-html-module/pom.xml                |   3 +
 10 files changed, 195 insertions(+), 105 deletions(-)
 rename 
{tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module
 => 
tika-encoding-detectors/tika-encoding-detector-html}/src/main/resources/org/apache/tika/parser/html/StandardCharsets_unsupported_by_IANA.txt
 (100%)
 rename 
{tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module
 => 
tika-encoding-detectors/tika-encoding-detector-html}/src/test/java/org/apache/tika/parser/html/HtmlEncodingDetectorTest.java
 (100%)
 rename 
{tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module
 => 
tika-encoding-detectors/tika-encoding-detector-html}/src/test/java/org/apache/tika/parser/html/StandardHtmlEncodingDetectorTest.java
 (100%)

Reply via email to