[
https://issues.apache.org/jira/browse/TIKA-4745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18086381#comment-18086381
]
ASF GitHub Bot commented on TIKA-4745:
--------------------------------------
Copilot commented on code in PR #2872:
URL: https://github.com/apache/tika/pull/2872#discussion_r3363223702
##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkDetector.java:
##########
@@ -973,15 +980,70 @@ public static int packBigramKey(int idxA, int idxB) {
* once when scanning the text (avoiding a redundant binary search per
* codepoint).
*/
+ /** Small per-bigram log-prob penalty subtracted from the case-folded
+ * (lowercase) value when scoring an uppercase pair. All-caps is a
genuinely
+ * weaker/rarer signal than lowercase, so it should score a hair BELOW its
+ * lowercase form, not equal to it — and the margin guards the edge case
where
+ * an all-caps *mojibake* decode whose lowercase twin happens to be a seen
+ * bigram would otherwise score like real lowercase text. Kept small
(0.25):
+ * the lowercase/junk margin is ~0.8 logit, and δ=0.5 thinned it to ~0.1,
so
+ * 0.25 leaves all-caps clearly clean (~0.5 above junk) while honoring the
+ * "somewhat less languagey" principle. */
+ private static final double CASE_FOLD_PENALTY = 0.25;
+
private static double scorePairF1(int cpA, int idxA, int cpB, int idxB,
BigramTables tables) {
+ double direct = Double.NaN;
if (idxA >= 0 && idxB >= 0) {
int slot = lookupBigramSlot(tables, idxA, idxB);
if (slot >= 0) {
- return dequantize(tables.bigramValues[slot],
+ direct = dequantize(tables.bigramValues[slot],
tables.bigramQuantMin, tables.bigramQuantMax);
}
}
+ // Case-folded backoff: an ALL-UPPERCASE pair that is the case variant
of
+ // a SEEN lowercase pair is real text wearing a different case
(all-caps
+ // headings / emphasis, e.g. Greek "ΚΑΤΑΛΟΓΟΣ", Russian "МУЗЕЙ"), NOT
junk.
+ // Score it as the BETTER of its own log-prob and its lowercase twin's
—
+ // i.e. max(direct, fold). max (not fold-only-on-miss) is essential:
real
+ // all-caps bigrams ARE present in training (from headings) but rare,
so the
+ // direct lookup hits a low value (МУ −12.4 vs lowercase му −6.7) and
would
+ // otherwise bypass the fold and floor. This is the discriminator raw
+ // probability cannot be: all-caps real text and all-caps mojibake are
both
+ // improbable, but only real text has a SEEN lowercase twin. Gated on
BOTH
+ // codepoints being uppercase (case-CONSISTENT) so alternating-case
junk
+ // ("tHiS") stays unfolded and floors; and only the lowercase twin's
value
+ // is borrowed when that pair is actually seen, so all-caps mojibake
+ // (lowercase form also unseen) floors.
+ // Gate = "at least one uppercase letter AND no LOWERCASE letter" — so
it
+ // folds both an interior all-caps pair (МУ) AND an edge pair where
the other
+ // side is a sentinel or glue (^М, Й$, "М."), but NOT a mixed-case
pair (the
+ // lowercase letter in "aB"/"tHiS" trips the gate, so
case-inconsistent junk
+ // still floors). Each uppercase letter is folded;
sentinels/digits/glue
+ // pass through unchanged. Folding the edges too is what fully
rescues short
+ // all-caps headings, whose ^X/X$ bigrams would otherwise floor on the
rare
+ // uppercase-letter unigram backoff.
+ if ((Character.isUpperCase(cpA) || Character.isUpperCase(cpB))
+ && !Character.isLowerCase(cpA) && !Character.isLowerCase(cpB))
{
+ int lcA = Character.isUpperCase(cpA) ? Character.toLowerCase(cpA)
: cpA;
+ int lcB = Character.isUpperCase(cpB) ? Character.toLowerCase(cpB)
: cpB;
Review Comment:
`JunkDetector` uses TOKEN_START/TOKEN_END sentinels with values above
`Character.MAX_CODE_POINT` (0x10FFFF). In `scorePairF1`, calling
`Character.isUpperCase/isLowerCase/toLowerCase` on these sentinel values can
throw `IllegalArgumentException` on some JDKs, which would crash scoring for
any text (sentinels are emitted for every run). Guard the `Character.*` calls
with `Character.isValidCodePoint(...)` so the sentinels are treated as
non-letters while still allowing edge bigram folding when the *other* side is
uppercase.
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java:
##########
@@ -286,6 +286,41 @@ public void testHtml5Charset() throws Exception {
assertEquals("ISO-8859-15", metadata.get(Metadata.CONTENT_ENCODING));
}
+ /**
+ * A page that declares {@code charset=euc-kr} but actually uses UHC
(MS949)
+ * extension Hangul must be decoded with the superset {@code
x-windows-949},
+ * not strict EUC-KR (which U+FFFDs the extension syllables). Mirrors the
+ * promotion {@code AutoDetectReader} already applies for non-HTML.
+ * {@code CONTENT_ENCODING} still reports the detected charset; the
superset
+ * actually used is recorded in {@code DECODED_CHARSET}.
+ *
+ * @see org.apache.tika.detect.CharsetSupersets
+ */
+ @Test
+ public void testEucKrPromotedToMs949Superset() throws Exception {
+ // U+AC02 is outside EUC-KR (KS X 1001) but inside x-windows-949
(MS949);
+ // its MS949 bytes 0x81 0x41 decode to U+FFFD followed by 'A' under
strict
+ // EUC-KR, so a correct decode proves the superset was used. U+D55C
U+AD6D
+ // ("Korea") is a normal EUC-KR syllable pair.
+ String test = "<html><head><meta charset=\"euc-kr\" />" +
+ "<title>title</title></head>" +
+ "<body><p>\uAC02 \uD55C\uAD6D</p></body></html>";
+ Metadata metadata = new Metadata();
+ BodyContentHandler handler = new BodyContentHandler();
+ try (TikaInputStream tis =
TikaInputStream.get(test.getBytes("x-windows-949"))) {
+ new JSoupParser().parse(tis, handler, metadata, new
ParseContext());
+ }
+ // Metadata reports the *detected* charset ...
+ assertEquals("EUC-KR", metadata.get(Metadata.CONTENT_ENCODING));
+ // ... but decoding used the superset, recorded in DECODED_CHARSET.
+ assertEquals("x-windows-949",
metadata.get(TikaCoreProperties.DECODED_CHARSET));
Review Comment:
The assertion hard-codes `"x-windows-949"`, but `Charset.name()` returns the
JVM's canonical name, which can vary (e.g., `windows-949` vs `x-windows-949`).
Since `JSoupParser` stores `decodeAs.name()`, the test should compare against
the runtime canonical name for the charset to avoid platform-dependent failures.
> Small improvements to lang detection, charset detection and junk detection
> --------------------------------------------------------------------------
>
> Key: TIKA-4745
> URL: https://issues.apache.org/jira/browse/TIKA-4745
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
>
> I ran a regression test in prep for the 4.0.0-beta-1 release. There are a
> number of smallish things that we can clean up in the components listed in
> the title.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)