Re: [CODEC] Sign Extension Error in Murmur3

Alex Herbert Mon, 04 Nov 2019 00:08:53 -0800


> On 4 Nov 2019, at 02:13, sebb <seb...@gmail.com> wrote:
> 
> On Mon, 4 Nov 2019 at 00:53, Claude Warren <cla...@xenei.com 
> <mailto:cla...@xenei.com>> wrote:
>> 
>> I think the way to prove they behave correctly is to test the results
>> against the original "C" implementation.  With this in mind I proposed the
>> changes.
>> 
>> I agree that there should be a way to revert to the old code as there is
>> code in the wild that depends on the earlier implementation.
>> 
>> I know that there are broken implementations in other Apache projects, but
>> those are not libraries (i.e. Cassandra).
>> 
>> I would suggest that perhaps the changes that I provided be renamed as
>> hash128Standard and hash32Standard (or some other designator to show that
>> it is different from the original CODEC implementation).
>> 
>> Alternatively hash128 and hash32 could be deprecated  and replaced with
>> hashx86_32Old, and hashx64_128Old and provide the corrected versions for
>> hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
>> hashx86_128 implementation.
>> 
>> My opinion is that we should provide Murmur3 hash implementations that
>> produce the same results as the "C"/C++ implementations for cross
>> application compatibility.
> 
> I agree: but the original implementation has far fewer methods.
> So how does one check the Codec methods that have no corresponding C/C++ 
> method?


AFAICT the methods that accept a short, int, or long are convenience methods to 
avoid creating a byte[] for the data. The JUnit tests demonstrate this by 
passing the same values to a ByteBuffer and then calling the corresponding 
byte[] methods. The methods do assume the endian order of the primitive 
datatypes as processed by a default ByteBuffer (Big Endian). So perhaps this 
should be added to the Javadoc with an example for the equivalent call to the 
method using the byte[].

Note that these methods will not be broken. The bug in the code is for input 
byte[] lengths that are not a factor of 4. The fix is for any of the leftover 
1, 2, or 3 bytes when computing the hash.

The broken methods are:

public static int hash32(final byte[] data)
public static int hash32(final String data)
public static int hash32(final byte[] data, final int length)
public static int hash32(final byte[] data, final int length, final int seed)
public static int hash32(final byte[] data, final int offset, final int length, 
final int seed)

This class is also broken:
public static class IncrementalHash32

> 
> Maybe the Codec implementation needs to be drastically pruned as well.
> 
>> Claude
>> 
>> On Sun, Nov 3, 2019 at 11:34 PM sebb <seb...@gmail.com> wrote:
>> 
>>> As I see it, there are two use cases here.
>>> 
>>> 1) Commons Codec code is used exclusively, in which case it is
>>> essential that the output does not change for a given seed.
>>> 
>>> 2) Commons Codec is used in conjunction with other implementations, in
>>> which case it is essential that Codec is aligned with other
>>> implementations
>>> 
>>> Both use cases seem valid to me, so I think it's important to allow
>>> both old and new algorithms to co-exist going forward.
>>> 
>>> However:
>>> Should the default behaviour change (with option to revert)?
>>> Or would it be better to make the corrected behaviour optional?
>>> 
>>> Or deliberately break the API to force users to choose between old and
>>> new behaviour?
>>> 
>>> Note that Wikipedia says that the x86 and x64 versions of the
>>> algorithm generate different results in the case of the 128 bit hash.
>>> However Commons Codec does not have different implementations for x86
>>> and x64; it is not clear which has been implemented.
>>> The JavaDoc also fails to mention that there is a 64bit hash.

And lacks <p> and <a> tags in the header as appropriate.

>>> 
>>> The class clearly needs additional test data; it's possible that there
>>> are discrepancies in other hash sizes as well.
>>> 
>>> The code has multiple implementations for each hash size depending on
>>> the parameter types, which increases the number of tests needed.
>>> 
>>> Note that the original C++ implementation only has 3 hash methods:
>>> MurmurHash3_x86_32
>>> MurmurHash3_x86_128
>>> MurmurHash3_x64_128
>>> 
>>> AFAICT these all take the same parameters:
>>> - unsigned byte array
>>> - length
>>> - unsigned 32 bit seed
>>> - pointer to output buffer
>>> 
>>> No idea why the Codec version allows additional input such as one (or
>>> two) longs, etc.
>>> Unless there is a defined standard for how these should be handled, it
>>> will be impossible to prove that they behave correctly.
>>> 
>>> On Sun, 3 Nov 2019 at 22:01, Alex Herbert <alex.d.herb...@gmail.com>
>>> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On 3 Nov 2019, at 21:45, Gary Gregory <garydgreg...@gmail.com> wrote:
>>>>> 
>>>>> I feel like I am missing something basic in the assumption of this
>>> issue:
>>>>> there is no such thing as an unsigned int in Java and the ticket talks
>>>>> about (C?) unsigned ints. Please help me understand how or why we
>>> should
>>>>> care about C vs. Java ints. Are we comparing apples to oranges here?
>>>> 
>>>> When a byte is converted to an int there is sign extension if it is
>>> negative. A negative byte will have all bits above the 8th bit set to 1. So
>>> if the byte is negative then when converted to an int for bit shift and xor
>>> operations the raw bits are not the same.
>>>> 
>>>> These are not the same:
>>>> 
>>>> byte b = -1;
>>>> (int) b != (b & 0xff);
>>>> b << 8 != (b & 0xff) << 8;
>>>> b << 16 != (b & 0xff) << 16;
>>>> 
>>>> The original code has the use of the 0xff mask for most of the murmur3
>>> algorithm. It has been missed from the final steps applied the the last 3
>>> bytes in the hash32 algorithm variant.
>>>> 
>>>> Alex
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> Thank you,
>>>>> Gary
>>>>> 
>>>>> On Sun, Nov 3, 2019, 07:59 Claude Warren <cla...@xenei.com> wrote:
>>>>> 
>>>>>> There is an error in the current Murmur3 code introduced by sign
>>> extension
>>>>>> errors.  This is documented in CODEC-264.[1]
>>>>>> 
>>>>>> I have created a pull request to fix it.[2]
>>>>>> 
>>>>>> While the code changes did not change any of the existing Murmur3
>>> tests, I
>>>>>> did add new tests that failed until the changes were applied.
>>>>>> 
>>>>>> However, I am concerned that the commons-codec Murmur3 code is in the
>>> wild
>>>>>> and this change will impact some hashes.
>>>>>> 
>>>>>> Therefore, I am asking for a discussion concerning how to apply any
>>> patches
>>>>>> like CODEC-264 to hashing algorithms that are in the wild.
>>>>>> 
>>>>>> Thx,
>>>>>> Claude
>>>>>> 
>>>>>> [1] https://issues.apache.org/jira/browse/CODEC-264
>>>>>> [2] https://github.com/apache/commons-codec/pull/27
>>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
>>> For additional commands, e-mail: dev-h...@commons.apache.org
>>> 
>>> 
>> 
>> --
>> I like: Like Like - The likeliest place on the web
>> <http://like-like.xenei.com>
>> LinkedIn: http://www.linkedin.com/in/claudewarren
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org 
> <mailto:dev-unsubscr...@commons.apache.org>
> For additional commands, e-mail: dev-h...@commons.apache.org 
> <mailto:dev-h...@commons.apache.org>

Re: [CODEC] Sign Extension Error in Murmur3

Reply via email to