This is an automated email from the ASF dual-hosted git repository.

tballison pushed a change to branch junk-detector-v6
in repository https://gitbox.apache.org/repos/asf/tika.git


    from eaa72ad066 Merge branch 'main' into junk-detector-v6
     new 3efaa019ff v6 mods
     new 49eb7b4884 junk-detector: corpus diagnostic tools for v7 sizing
     new c9bc39e641 junk-detector: add --min-bigram-count to TrainJunkModel
     new a24d53259e checkpoint
     new 8c91c28a58 junk-detector: move training choices into 
JunkDetectorTrainingConfig
     new 0e08c2d80a checkpoint

The 6 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../apache/tika/ml/junkdetect/JunkDetector.java    | 361 +++++-------
 .../org/apache/tika/ml/junkdetect/V7Tables.java    | 204 +++++++
 .../ml/junkdetect/tools/AnalyzeHanByBlock.java     | 201 +++++++
 .../ml/junkdetect/tools/BoundaryBigramAudit.java   | 170 ++++++
 .../ml/junkdetect/tools/BuildJunkTrainingData.java | 253 +++++---
 .../ml/junkdetect/tools/CountPerScriptBigrams.java | 326 +++++++++++
 .../tools/JunkDetectorTrainingConfig.java          | 195 +++++++
 .../ml/junkdetect/tools/LineScriptFractions.java   | 155 +++++
 .../tika/ml/junkdetect/tools/ScriptCensus.java     | 165 ++++++
 .../tika/ml/junkdetect/tools/TrainJunkModel.java   | 649 ++++++++++-----------
 .../org/apache/tika/ml/junkdetect/junkdetect.bin   | Bin 465105 -> 2810396 
bytes
 .../tika/ml/junkdetect/JunkDetectorSmokeTest.java  |   7 -
 .../tika/ml/junkdetect/JunkDetectorV6Test.java     | 376 ------------
 .../tika/ml/junkdetect/JunkDetectorV7Test.java     | 351 +++++++++++
 .../tools/JunkDetectorTrainingConfigTest.java      | 102 ++++
 15 files changed, 2498 insertions(+), 1017 deletions(-)
 create mode 100644 
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/V7Tables.java
 create mode 100644 
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/AnalyzeHanByBlock.java
 create mode 100644 
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/BoundaryBigramAudit.java
 create mode 100644 
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/CountPerScriptBigrams.java
 create mode 100644 
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/JunkDetectorTrainingConfig.java
 create mode 100644 
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/LineScriptFractions.java
 create mode 100644 
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/ScriptCensus.java
 delete mode 100644 
tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/JunkDetectorV6Test.java
 create mode 100644 
tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/JunkDetectorV7Test.java
 create mode 100644 
tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/tools/JunkDetectorTrainingConfigTest.java

Reply via email to