mattcasters opened a new issue, #7346:
URL: https://github.com/apache/hop/issues/7346
### Apache Hop version?
2.18.1
### Java version?
OpenJDK 21
### Operating system
Linux
### What happened?
# Bug: `ValueMetaBase.typeCompare()` compares `byte[]` incorrectly for
sorting
## Summary
Hop's binary field comparison in `ValueMetaBase.typeCompare()` does not
match standard database `bytea` ordering. Transforms that sort or merge on
binary columns (e.g. `SortRows`, `MergeRows`) can therefore produce a different
order than SQL `ORDER BY` on the same data.
This affects Data Vault hash keys and other binary identifiers stored in
PostgreSQL `bytea` columns.
## Affected component
| Item | Value |
|------|-------|
| **Class** | `org.apache.hop.core.row.value.ValueMetaBase` |
| **Method** | `typeCompare(Object data1, Object data2)` — `TYPE_BINARY`
branch |
| **Related class** | `ValueMetaBinary` (does not override `typeCompare()`;
inherits base behavior) |
| **Affected transforms** | Any transform using `IValueMeta.compare()` on
binary fields, including `SortRows` and `MergeRows` |
## Environment
- **Apache Hop:** 2.18.1 (likely affects other versions with the same
implementation)
- **Java:** 21
- **Database:** PostgreSQL (`bytea` column, e.g. Data Vault hash key)
## Steps to reproduce
1. Create a table with a binary hash key column, e.g. `product_hk bytea`.
2. Load a dataset large enough to expose ordering differences (hash values
with bytes ≥ 128 are especially likely to fail).
3. Run two reads of the same table in a pipeline:
- **Reference:** `SELECT * FROM hub_syn_product ORDER BY product_hk`
(Postgres sort)
- **Compare:** `SELECT * FROM hub_syn_product` → `SortRows` on
`product_hk`
4. Merge the two streams with `MergeRows` on `product_hk`.
5. Observe mismatches (`new` / `deleted` / `changed` flags) even though the
underlying rows are identical.
**Example pipeline structure:**
```
TableInput (ORDER BY product_hk) → REF ─┐
├→ MergeRows (diff) → Memory group by
TableInput (no ORDER BY) → SortRows ───┘
```
## Expected behavior
Hop's sort order on `byte[]` / binary fields should match PostgreSQL `ORDER
BY` on `bytea`:
- Compare bytes lexicographically as **unsigned** values (0–255)
- Compare common prefix bytes first
- If one value is a prefix of the other, the shorter value sorts first
This is equivalent to Java's `Arrays.compareUnsigned(byte[], byte[])`.
## Actual behavior
Hop uses a flawed comparison in `typeCompare()` for `TYPE_BINARY`:
```java
case TYPE_BINARY:
byte[] b1 = (byte[]) data1;
byte[] b2 = (byte[]) data2;
int byteLength = Math.min(b1.length, b2.length);
cmp = b1.length - b2.length; // Bug 1: length compared before content
if (cmp == 0) {
for (int i = 0; i < byteLength; i++) {
cmp = b1[i] - b2[i]; // Bug 2: signed byte comparison
if (cmp != 0) {
cmp = cmp < 0 ? -1 : 1;
break;
}
}
}
break;
```
### Bug 1: Length compared before content
When arrays differ in length, Hop compares lengths only and never inspects
the shared prefix.
| Value A | Value B | PostgreSQL `ORDER BY` | Hop (current) |
|---------|---------|----------------------|---------------|
| `[0x02]` | `[0x01, 0x00]` | A > B | A < B ❌ |
### Bug 2: Signed byte comparison
Bytes are compared using signed subtraction (`-128`..`127`), not unsigned
(`0`..`255`).
| Value A | Value B | PostgreSQL `ORDER BY` | Hop (current) |
|---------|---------|----------------------|---------------|
| `[0xFF]` (255) | `[0x00]` (0) | A > B | A < B ❌ |
Hash keys frequently contain high-byte values, so this shows up in real Data
Vault workloads.
## Suggested fix
Replace the manual loop with unsigned lexicographic comparison:
```java
case TYPE_BINARY:
cmp = Integer.signum(Arrays.compareUnsigned((byte[]) data1, (byte[])
data2));
break;
```
`Integer.signum()` keeps the return value consistent with the documented
`-1` / `0` / `+1` contract of `compare()`.
## Suggested unit tests
Add tests to `ValueMetaBaseTest` (or `ValueMetaBinary`-specific tests):
```java
@Test
void testCompareBinaryUnsignedAndLength() throws HopValueException {
IValueMeta binaryMeta = new ValueMetaBinary("hash");
// Unsigned: 0xFF (255) > 0x00 (0)
assertEquals(1, binaryMeta.compare(new byte[] {(byte) 0xFF}, new byte[]
{0}));
assertEquals(-1, binaryMeta.compare(new byte[] {0}, new byte[] {(byte)
0xFF}));
// Content before length
assertEquals(1, binaryMeta.compare(new byte[] {2}, new byte[] {1, 0}));
assertEquals(-1, binaryMeta.compare(new byte[] {1, 0}, new byte[] {2}));
// Prefix ordering
assertEquals(-1, binaryMeta.compare(new byte[] {1, 2, 3}, new byte[] {1,
2, 3, 4}));
assertEquals(1, binaryMeta.compare(new byte[] {1, 2, 3, 4}, new byte[] {1,
2, 3}));
}
```
## Impact
- Incorrect sort order for binary/hash key columns
- `MergeRows` / diff operations report false differences
- Any logic depending on binary key ordering may behave inconsistently vs
the source database
### Issue Priority
Priority: 3
### Issue Component
Component: API
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]