mattcasters opened a new issue, #7346:
URL: https://github.com/apache/hop/issues/7346

   ### Apache Hop version?
   
   2.18.1
   
   ### Java version?
   
   OpenJDK 21
   
   ### Operating system
   
   Linux
   
   ### What happened?
   
   # Bug: `ValueMetaBase.typeCompare()` compares `byte[]` incorrectly for 
sorting
   
   ## Summary
   
   Hop's binary field comparison in `ValueMetaBase.typeCompare()` does not 
match standard database `bytea` ordering. Transforms that sort or merge on 
binary columns (e.g. `SortRows`, `MergeRows`) can therefore produce a different 
order than SQL `ORDER BY` on the same data.
   
   This affects Data Vault hash keys and other binary identifiers stored in 
PostgreSQL `bytea` columns.
   
   ## Affected component
   
   | Item | Value |
   |------|-------|
   | **Class** | `org.apache.hop.core.row.value.ValueMetaBase` |
   | **Method** | `typeCompare(Object data1, Object data2)` — `TYPE_BINARY` 
branch |
   | **Related class** | `ValueMetaBinary` (does not override `typeCompare()`; 
inherits base behavior) |
   | **Affected transforms** | Any transform using `IValueMeta.compare()` on 
binary fields, including `SortRows` and `MergeRows` |
   
   ## Environment
   
   - **Apache Hop:** 2.18.1 (likely affects other versions with the same 
implementation)
   - **Java:** 21
   - **Database:** PostgreSQL (`bytea` column, e.g. Data Vault hash key)
   
   ## Steps to reproduce
   
   1. Create a table with a binary hash key column, e.g. `product_hk bytea`.
   2. Load a dataset large enough to expose ordering differences (hash values 
with bytes ≥ 128 are especially likely to fail).
   3. Run two reads of the same table in a pipeline:
      - **Reference:** `SELECT * FROM hub_syn_product ORDER BY product_hk` 
(Postgres sort)
      - **Compare:** `SELECT * FROM hub_syn_product` → `SortRows` on 
`product_hk`
   4. Merge the two streams with `MergeRows` on `product_hk`.
   5. Observe mismatches (`new` / `deleted` / `changed` flags) even though the 
underlying rows are identical.
   
   **Example pipeline structure:**
   
   ```
   TableInput (ORDER BY product_hk) → REF ─┐
                                           ├→ MergeRows (diff) → Memory group by
   TableInput (no ORDER BY) → SortRows ───┘
   ```
   
   ## Expected behavior
   
   Hop's sort order on `byte[]` / binary fields should match PostgreSQL `ORDER 
BY` on `bytea`:
   
   - Compare bytes lexicographically as **unsigned** values (0–255)
   - Compare common prefix bytes first
   - If one value is a prefix of the other, the shorter value sorts first
   
   This is equivalent to Java's `Arrays.compareUnsigned(byte[], byte[])`.
   
   ## Actual behavior
   
   Hop uses a flawed comparison in `typeCompare()` for `TYPE_BINARY`:
   
   ```java
   case TYPE_BINARY:
     byte[] b1 = (byte[]) data1;
     byte[] b2 = (byte[]) data2;
   
     int byteLength = Math.min(b1.length, b2.length);
   
     cmp = b1.length - b2.length;   // Bug 1: length compared before content
     if (cmp == 0) {
       for (int i = 0; i < byteLength; i++) {
         cmp = b1[i] - b2[i];       // Bug 2: signed byte comparison
         if (cmp != 0) {
           cmp = cmp < 0 ? -1 : 1;
           break;
         }
       }
     }
     break;
   ```
   
   ### Bug 1: Length compared before content
   
   When arrays differ in length, Hop compares lengths only and never inspects 
the shared prefix.
   
   | Value A | Value B | PostgreSQL `ORDER BY` | Hop (current) |
   |---------|---------|----------------------|---------------|
   | `[0x02]` | `[0x01, 0x00]` | A > B | A < B ❌ |
   
   ### Bug 2: Signed byte comparison
   
   Bytes are compared using signed subtraction (`-128`..`127`), not unsigned 
(`0`..`255`).
   
   | Value A | Value B | PostgreSQL `ORDER BY` | Hop (current) |
   |---------|---------|----------------------|---------------|
   | `[0xFF]` (255) | `[0x00]` (0) | A > B | A < B ❌ |
   
   Hash keys frequently contain high-byte values, so this shows up in real Data 
Vault workloads.
   
   ## Suggested fix
   
   Replace the manual loop with unsigned lexicographic comparison:
   
   ```java
   case TYPE_BINARY:
     cmp = Integer.signum(Arrays.compareUnsigned((byte[]) data1, (byte[]) 
data2));
     break;
   ```
   
   `Integer.signum()` keeps the return value consistent with the documented 
`-1` / `0` / `+1` contract of `compare()`.
   
   ## Suggested unit tests
   
   Add tests to `ValueMetaBaseTest` (or `ValueMetaBinary`-specific tests):
   
   ```java
   @Test
   void testCompareBinaryUnsignedAndLength() throws HopValueException {
     IValueMeta binaryMeta = new ValueMetaBinary("hash");
   
     // Unsigned: 0xFF (255) > 0x00 (0)
     assertEquals(1, binaryMeta.compare(new byte[] {(byte) 0xFF}, new byte[] 
{0}));
     assertEquals(-1, binaryMeta.compare(new byte[] {0}, new byte[] {(byte) 
0xFF}));
   
     // Content before length
     assertEquals(1, binaryMeta.compare(new byte[] {2}, new byte[] {1, 0}));
     assertEquals(-1, binaryMeta.compare(new byte[] {1, 0}, new byte[] {2}));
   
     // Prefix ordering
     assertEquals(-1, binaryMeta.compare(new byte[] {1, 2, 3}, new byte[] {1, 
2, 3, 4}));
     assertEquals(1, binaryMeta.compare(new byte[] {1, 2, 3, 4}, new byte[] {1, 
2, 3}));
   }
   ```
   
   ## Impact
   
   - Incorrect sort order for binary/hash key columns
   - `MergeRows` / diff operations report false differences
   - Any logic depending on binary key ordering may behave inconsistently vs 
the source database
   
   
   ### Issue Priority
   
   Priority: 3
   
   ### Issue Component
   
   Component: API


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to