Copilot commented on code in PR #2872:
URL: https://github.com/apache/tika/pull/2872#discussion_r3363223702
##########
tika-ml/tika-ml-junkdetect/src/main/java/org/apache/tika/ml/junkdetect/JunkDetector.java:
##########
@@ -973,15 +980,70 @@ public static int packBigramKey(int idxA, int idxB) {
* once when scanning the text (avoiding a redundant binary search per
* codepoint).
*/
+ /** Small per-bigram log-prob penalty subtracted from the case-folded
+ * (lowercase) value when scoring an uppercase pair. All-caps is a
genuinely
+ * weaker/rarer signal than lowercase, so it should score a hair BELOW its
+ * lowercase form, not equal to it — and the margin guards the edge case
where
+ * an all-caps *mojibake* decode whose lowercase twin happens to be a seen
+ * bigram would otherwise score like real lowercase text. Kept small
(0.25):
+ * the lowercase/junk margin is ~0.8 logit, and δ=0.5 thinned it to ~0.1,
so
+ * 0.25 leaves all-caps clearly clean (~0.5 above junk) while honoring the
+ * "somewhat less languagey" principle. */
+ private static final double CASE_FOLD_PENALTY = 0.25;
+
private static double scorePairF1(int cpA, int idxA, int cpB, int idxB,
BigramTables tables) {
+ double direct = Double.NaN;
if (idxA >= 0 && idxB >= 0) {
int slot = lookupBigramSlot(tables, idxA, idxB);
if (slot >= 0) {
- return dequantize(tables.bigramValues[slot],
+ direct = dequantize(tables.bigramValues[slot],
tables.bigramQuantMin, tables.bigramQuantMax);
}
}
+ // Case-folded backoff: an ALL-UPPERCASE pair that is the case variant
of
+ // a SEEN lowercase pair is real text wearing a different case
(all-caps
+ // headings / emphasis, e.g. Greek "ΚΑΤΑΛΟΓΟΣ", Russian "МУЗЕЙ"), NOT
junk.
+ // Score it as the BETTER of its own log-prob and its lowercase twin's
—
+ // i.e. max(direct, fold). max (not fold-only-on-miss) is essential:
real
+ // all-caps bigrams ARE present in training (from headings) but rare,
so the
+ // direct lookup hits a low value (МУ −12.4 vs lowercase му −6.7) and
would
+ // otherwise bypass the fold and floor. This is the discriminator raw
+ // probability cannot be: all-caps real text and all-caps mojibake are
both
+ // improbable, but only real text has a SEEN lowercase twin. Gated on
BOTH
+ // codepoints being uppercase (case-CONSISTENT) so alternating-case
junk
+ // ("tHiS") stays unfolded and floors; and only the lowercase twin's
value
+ // is borrowed when that pair is actually seen, so all-caps mojibake
+ // (lowercase form also unseen) floors.
+ // Gate = "at least one uppercase letter AND no LOWERCASE letter" — so
it
+ // folds both an interior all-caps pair (МУ) AND an edge pair where
the other
+ // side is a sentinel or glue (^М, Й$, "М."), but NOT a mixed-case
pair (the
+ // lowercase letter in "aB"/"tHiS" trips the gate, so
case-inconsistent junk
+ // still floors). Each uppercase letter is folded;
sentinels/digits/glue
+ // pass through unchanged. Folding the edges too is what fully
rescues short
+ // all-caps headings, whose ^X/X$ bigrams would otherwise floor on the
rare
+ // uppercase-letter unigram backoff.
+ if ((Character.isUpperCase(cpA) || Character.isUpperCase(cpB))
+ && !Character.isLowerCase(cpA) && !Character.isLowerCase(cpB))
{
+ int lcA = Character.isUpperCase(cpA) ? Character.toLowerCase(cpA)
: cpA;
+ int lcB = Character.isUpperCase(cpB) ? Character.toLowerCase(cpB)
: cpB;
Review Comment:
`JunkDetector` uses TOKEN_START/TOKEN_END sentinels with values above
`Character.MAX_CODE_POINT` (0x10FFFF). In `scorePairF1`, calling
`Character.isUpperCase/isLowerCase/toLowerCase` on these sentinel values can
throw `IllegalArgumentException` on some JDKs, which would crash scoring for
any text (sentinels are emitted for every run). Guard the `Character.*` calls
with `Character.isValidCodePoint(...)` so the sentinels are treated as
non-letters while still allowing edge bigram folding when the *other* side is
uppercase.
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java:
##########
@@ -286,6 +286,41 @@ public void testHtml5Charset() throws Exception {
assertEquals("ISO-8859-15", metadata.get(Metadata.CONTENT_ENCODING));
}
+ /**
+ * A page that declares {@code charset=euc-kr} but actually uses UHC
(MS949)
+ * extension Hangul must be decoded with the superset {@code
x-windows-949},
+ * not strict EUC-KR (which U+FFFDs the extension syllables). Mirrors the
+ * promotion {@code AutoDetectReader} already applies for non-HTML.
+ * {@code CONTENT_ENCODING} still reports the detected charset; the
superset
+ * actually used is recorded in {@code DECODED_CHARSET}.
+ *
+ * @see org.apache.tika.detect.CharsetSupersets
+ */
+ @Test
+ public void testEucKrPromotedToMs949Superset() throws Exception {
+ // U+AC02 is outside EUC-KR (KS X 1001) but inside x-windows-949
(MS949);
+ // its MS949 bytes 0x81 0x41 decode to U+FFFD followed by 'A' under
strict
+ // EUC-KR, so a correct decode proves the superset was used. U+D55C
U+AD6D
+ // ("Korea") is a normal EUC-KR syllable pair.
+ String test = "<html><head><meta charset=\"euc-kr\" />" +
+ "<title>title</title></head>" +
+ "<body><p>\uAC02 \uD55C\uAD6D</p></body></html>";
+ Metadata metadata = new Metadata();
+ BodyContentHandler handler = new BodyContentHandler();
+ try (TikaInputStream tis =
TikaInputStream.get(test.getBytes("x-windows-949"))) {
+ new JSoupParser().parse(tis, handler, metadata, new
ParseContext());
+ }
+ // Metadata reports the *detected* charset ...
+ assertEquals("EUC-KR", metadata.get(Metadata.CONTENT_ENCODING));
+ // ... but decoding used the superset, recorded in DECODED_CHARSET.
+ assertEquals("x-windows-949",
metadata.get(TikaCoreProperties.DECODED_CHARSET));
Review Comment:
The assertion hard-codes `"x-windows-949"`, but `Charset.name()` returns the
JVM's canonical name, which can vary (e.g., `windows-949` vs `x-windows-949`).
Since `JSoupParser` stores `decodeAs.name()`, the test should compare against
the runtime canonical name for the charset to avoid platform-dependent failures.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]