Copilot commented on code in PR #2887:
URL: https://github.com/apache/tika/pull/2887#discussion_r3380658532
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/chm/ChmLzxcControlData.java:
##########
@@ -225,8 +225,10 @@ private long unmarshalUInt32(byte[] data, long dest)
throws ChmParsingException
if (4 > getDataRemained()) {
throw new ChmParsingException("4 > dataLenght");
}
- dest = data[this.getCurrentPlace()] | data[this.getCurrentPlace() + 1]
<< 8 |
- data[this.getCurrentPlace() + 2] << 16 |
data[this.getCurrentPlace() + 3] << 24;
+ dest = (data[this.getCurrentPlace()] & 0xff) |
+ (data[this.getCurrentPlace() + 1] & 0xff) << 8 |
+ (data[this.getCurrentPlace() + 2] & 0xff) << 16 |
+ (data[this.getCurrentPlace() + 3] & 0xff) << 24;
Review Comment:
`unmarshalUInt32` still builds the value using 32-bit `int` arithmetic. If
the most-significant byte is >= 0x80, the `(… & 0xff) << 24` term sets the sign
bit of the intermediate `int`, and the final `int` is widened to a negative
`long`. For true unsigned 32-bit parsing, do the shifts/ORs in `long` (or mask
the final result with `0xffffffffL`).
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/chm/TestChmLzxcControlData.java:
##########
@@ -124,4 +124,22 @@ public void testGetSignaure() {
chmLzxcControlData.getSignature().length);
}
+ @Test
+ public void testUInt32HighBitNotSignExtended() throws Exception {
+ // size = 0x00008000 little-endian; the 0x80 byte has bit 7 set, so an
+ // unmasked shift sign-extends it and yields -32768 instead of 32768
+ byte[] data = new byte[ChmConstants.CHM_LZXC_MIN_LEN];
+ data[1] = (byte) 0x80;
+ byte[] sig = ChmConstants.LZXC.getBytes(UTF_8);
+ System.arraycopy(sig, 0, data, 4, sig.length);
+ data[8] = 0x01; // version
+ data[12] = 0x02; // resetInterval
+ data[16] = 0x02; // windowSize
+ data[20] = 0x02; // windowsPerReset
+
+ ChmLzxcControlData control = new ChmLzxcControlData();
+ control.parse(data, control);
+ assertEquals(0x8000L, control.getSize());
+ }
Review Comment:
This test covers sign-extension in a non-leading byte (0x00008000), but the
same `int`-intermediate issue also shows up when the most-significant byte is
>= 0x80 (e.g., 0x80000000 becomes a negative `long`). Adding a second case here
will prevent regressions for that scenario too.
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/chm/ChmItsfHeader.java:
##########
@@ -395,8 +395,10 @@ private long unmarshalUInt32(byte[] data, long dest)
throws TikaException {
if (4 > getDataRemained()) {
throw new TikaException("4 > dataLenght");
}
- dest = data[this.getCurrentPlace()] | data[this.getCurrentPlace() + 1]
<< 8 |
- data[this.getCurrentPlace() + 2] << 16 |
data[this.getCurrentPlace() + 3] << 24;
+ dest = (data[this.getCurrentPlace()] & 0xff) |
+ (data[this.getCurrentPlace() + 1] & 0xff) << 8 |
+ (data[this.getCurrentPlace() + 2] & 0xff) << 16 |
+ (data[this.getCurrentPlace() + 3] & 0xff) << 24;
Review Comment:
Like `ChmLzxcControlData`, this `unmarshalUInt32` still uses `int`
intermediates. If the top byte is >= 0x80, the computed `int` becomes negative
and then widens to a negative `long`, which is incorrect for unsigned 32-bit
fields. Cast to `long` before shifting/ORing (or apply `& 0xffffffffL`).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]