This is an automated email from the ASF dual-hosted git repository.
tballison pushed a change to branch junk-detector-v6
in repository https://gitbox.apache.org/repos/asf/tika.git
from eaa72ad066 Merge branch 'main' into junk-detector-v6
new 3efaa019ff v6 mods
new 49eb7b4884 junk-detector: corpus diagnostic tools for v7 sizing
new c9bc39e641 junk-detector: add --min-bigram-count to TrainJunkModel
new a24d53259e checkpoint
new 8c91c28a58 junk-detector: move training choices into
JunkDetectorTrainingConfig
new 0e08c2d80a checkpoint
The 6 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.
Summary of changes:
.../apache/tika/ml/junkdetect/JunkDetector.java | 361 +++++-------
.../org/apache/tika/ml/junkdetect/V7Tables.java | 204 +++++++
.../ml/junkdetect/tools/AnalyzeHanByBlock.java | 201 +++++++
.../ml/junkdetect/tools/BoundaryBigramAudit.java | 170 ++++++
.../ml/junkdetect/tools/BuildJunkTrainingData.java | 253 +++++---
.../ml/junkdetect/tools/CountPerScriptBigrams.java | 326 +++++++++++
.../tools/JunkDetectorTrainingConfig.java | 195 +++++++
.../ml/junkdetect/tools/LineScriptFractions.java | 155 +++++
.../tika/ml/junkdetect/tools/ScriptCensus.java | 165 ++++++
.../tika/ml/junkdetect/tools/TrainJunkModel.java | 649 ++++++++++-----------
.../org/apache/tika/ml/junkdetect/junkdetect.bin | Bin 465105 -> 2810396
bytes
.../tika/ml/junkdetect/JunkDetectorSmokeTest.java | 7 -
.../tika/ml/junkdetect/JunkDetectorV6Test.java | 376 ------------
.../tika/ml/junkdetect/JunkDetectorV7Test.java | 351 +++++++++++
.../tools/JunkDetectorTrainingConfigTest.java | 102 ++++
15 files changed, 2498 insertions(+), 1017 deletions(-)
create mode 100644
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/V7Tables.java
create mode 100644
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/AnalyzeHanByBlock.java
create mode 100644
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/BoundaryBigramAudit.java
create mode 100644
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/CountPerScriptBigrams.java
create mode 100644
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/JunkDetectorTrainingConfig.java
create mode 100644
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/LineScriptFractions.java
create mode 100644
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/tools/ScriptCensus.java
delete mode 100644
tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/JunkDetectorV6Test.java
create mode 100644
tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/JunkDetectorV7Test.java
create mode 100644
tika-ml/tika-ml-junkdetect/src/test/java/org/apache/tika/ml/junkdetect/tools/JunkDetectorTrainingConfigTest.java