Copilot commented on code in PR #2887:
URL: https://github.com/apache/tika/pull/2887#discussion_r3380658532


##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/chm/ChmLzxcControlData.java:
##########
@@ -225,8 +225,10 @@ private long unmarshalUInt32(byte[] data, long dest) 
throws ChmParsingException
         if (4 > getDataRemained()) {
             throw new ChmParsingException("4 > dataLenght");
         }
-        dest = data[this.getCurrentPlace()] | data[this.getCurrentPlace() + 1] 
<< 8 |
-                data[this.getCurrentPlace() + 2] << 16 | 
data[this.getCurrentPlace() + 3] << 24;
+        dest = (data[this.getCurrentPlace()] & 0xff) |
+                (data[this.getCurrentPlace() + 1] & 0xff) << 8 |
+                (data[this.getCurrentPlace() + 2] & 0xff) << 16 |
+                (data[this.getCurrentPlace() + 3] & 0xff) << 24;

Review Comment:
   `unmarshalUInt32` still builds the value using 32-bit `int` arithmetic. If 
the most-significant byte is >= 0x80, the `(… & 0xff) << 24` term sets the sign 
bit of the intermediate `int`, and the final `int` is widened to a negative 
`long`. For true unsigned 32-bit parsing, do the shifts/ORs in `long` (or mask 
the final result with `0xffffffffL`).



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/chm/TestChmLzxcControlData.java:
##########
@@ -124,4 +124,22 @@ public void testGetSignaure() {
                 chmLzxcControlData.getSignature().length);
     }
 
+    @Test
+    public void testUInt32HighBitNotSignExtended() throws Exception {
+        // size = 0x00008000 little-endian; the 0x80 byte has bit 7 set, so an
+        // unmasked shift sign-extends it and yields -32768 instead of 32768
+        byte[] data = new byte[ChmConstants.CHM_LZXC_MIN_LEN];
+        data[1] = (byte) 0x80;
+        byte[] sig = ChmConstants.LZXC.getBytes(UTF_8);
+        System.arraycopy(sig, 0, data, 4, sig.length);
+        data[8] = 0x01;  // version
+        data[12] = 0x02; // resetInterval
+        data[16] = 0x02; // windowSize
+        data[20] = 0x02; // windowsPerReset
+
+        ChmLzxcControlData control = new ChmLzxcControlData();
+        control.parse(data, control);
+        assertEquals(0x8000L, control.getSize());
+    }

Review Comment:
   This test covers sign-extension in a non-leading byte (0x00008000), but the 
same `int`-intermediate issue also shows up when the most-significant byte is 
>= 0x80 (e.g., 0x80000000 becomes a negative `long`). Adding a second case here 
will prevent regressions for that scenario too.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/chm/ChmItsfHeader.java:
##########
@@ -395,8 +395,10 @@ private long unmarshalUInt32(byte[] data, long dest) 
throws TikaException {
         if (4 > getDataRemained()) {
             throw new TikaException("4 > dataLenght");
         }
-        dest = data[this.getCurrentPlace()] | data[this.getCurrentPlace() + 1] 
<< 8 |
-                data[this.getCurrentPlace() + 2] << 16 | 
data[this.getCurrentPlace() + 3] << 24;
+        dest = (data[this.getCurrentPlace()] & 0xff) |
+                (data[this.getCurrentPlace() + 1] & 0xff) << 8 |
+                (data[this.getCurrentPlace() + 2] & 0xff) << 16 |
+                (data[this.getCurrentPlace() + 3] & 0xff) << 24;

Review Comment:
   Like `ChmLzxcControlData`, this `unmarshalUInt32` still uses `int` 
intermediates. If the top byte is >= 0x80, the computed `int` becomes negative 
and then widens to a negative `long`, which is incorrect for unsigned 32-bit 
fields. Cast to `long` before shifting/ORing (or apply `& 0xffffffffL`).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to