Re: [CODEC] Sign Extension Error in Murmur3

Claude Warren Mon, 11 Nov 2019 08:27:40 -0800

https://github.com/apache/commons-codec/pull/27


On Mon, Nov 11, 2019 at 4:25 PM Claude Warren <[email protected]> wrote:

> I took the approach that I would leave the original code there and add new
> methods hash128_x64 and hash32_x86.  I also marked the older methods as
> deprecated with a note that the implementation has a sign extension error
> and to use the new methods for new development and cases where the change
> does not impact the code in the wild.
>
> Claude
>
> On Mon, Nov 4, 2019 at 10:50 AM Claude Warren <[email protected]> wrote:
>
>> There are a number of issues with the format and potential bugs in the
>> Codec Murmur3 code. ( See spotbugs, PMD, and codestyle reports)  The one
>> that tripped me up was the mix of tab/space lines.  Anyway, these issues
>> can be fixed in later pull requests.  I think is one overarching question
>> and one specific question:
>>
>> In general, what is the policy for how do we deal with digest
>> implementations that have a bug and that have been released into the wild?
>>
>> Specifically in this case, what do we do to solve the problem?
>>
>> I was in favor of the following specific action:
>>
>> 1. Deprecate all affected methods and document the issue.
>> 2. Move the deprecated methods into a new class called HiveMurmur3 and
>> have all deprecation point to this class.
>> 3. Create a new Murmur3 class with method names like hash128_x64 (ones
>> that do not conflict with the old names).
>>
>> But the more I think about it the more I think that we should follow KISS
>> and
>>
>> 1. Create a new Murmur3 class with method names like hash128_x64 (ones
>> that do not conflict with the old names).
>> 2. Update the javadocs for the defective methods with notes indicating
>> they have defects.  Perhaps write a readme type document explaining in
>> detail the issue.
>>
>> I think in general we should follow the same action with any digest
>> defect: New methods and Javadoc old.
>>
>> Claude
>>
>>
>>
>>
>> On Mon, Nov 4, 2019 at 8:08 AM Alex Herbert <[email protected]>
>> wrote:
>>
>>>
>>>
>>> > On 4 Nov 2019, at 02:13, sebb <[email protected]> wrote:
>>> >
>>> > On Mon, 4 Nov 2019 at 00:53, Claude Warren <[email protected] <mailto:
>>> [email protected]>> wrote:
>>> >>
>>> >> I think the way to prove they behave correctly is to test the results
>>> >> against the original "C" implementation.  With this in mind I
>>> proposed the
>>> >> changes.
>>> >>
>>> >> I agree that there should be a way to revert to the old code as there
>>> is
>>> >> code in the wild that depends on the earlier implementation.
>>> >>
>>> >> I know that there are broken implementations in other Apache
>>> projects, but
>>> >> those are not libraries (i.e. Cassandra).
>>> >>
>>> >> I would suggest that perhaps the changes that I provided be renamed as
>>> >> hash128Standard and hash32Standard (or some other designator to show
>>> that
>>> >> it is different from the original CODEC implementation).
>>> >>
>>> >> Alternatively hash128 and hash32 could be deprecated  and replaced
>>> with
>>> >> hashx86_32Old, and hashx64_128Old and provide the corrected versions
>>> for
>>> >> hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
>>> >> hashx86_128 implementation.
>>> >>
>>> >> My opinion is that we should provide Murmur3 hash implementations that
>>> >> produce the same results as the "C"/C++ implementations for cross
>>> >> application compatibility.
>>> >
>>> > I agree: but the original implementation has far fewer methods.
>>> > So how does one check the Codec methods that have no corresponding
>>> C/C++ method?
>>>
>>> AFAICT the methods that accept a short, int, or long are convenience
>>> methods to avoid creating a byte[] for the data. The JUnit tests
>>> demonstrate this by passing the same values to a ByteBuffer and then
>>> calling the corresponding byte[] methods. The methods do assume the endian
>>> order of the primitive datatypes as processed by a default ByteBuffer (Big
>>> Endian). So perhaps this should be added to the Javadoc with an example for
>>> the equivalent call to the method using the byte[].
>>>
>>> Note that these methods will not be broken. The bug in the code is for
>>> input byte[] lengths that are not a factor of 4. The fix is for any of the
>>> leftover 1, 2, or 3 bytes when computing the hash.
>>>
>>> The broken methods are:
>>>
>>> public static int hash32(final byte[] data)
>>> public static int hash32(final String data)
>>> public static int hash32(final byte[] data, final int length)
>>> public static int hash32(final byte[] data, final int length, final int
>>> seed)
>>> public static int hash32(final byte[] data, final int offset, final int
>>> length, final int seed)
>>>
>>> This class is also broken:
>>> public static class IncrementalHash32
>>>
>>> >
>>> > Maybe the Codec implementation needs to be drastically pruned as well.
>>> >
>>> >> Claude
>>> >>
>>> >> On Sun, Nov 3, 2019 at 11:34 PM sebb <[email protected]> wrote:
>>> >>
>>> >>> As I see it, there are two use cases here.
>>> >>>
>>> >>> 1) Commons Codec code is used exclusively, in which case it is
>>> >>> essential that the output does not change for a given seed.
>>> >>>
>>> >>> 2) Commons Codec is used in conjunction with other implementations,
>>> in
>>> >>> which case it is essential that Codec is aligned with other
>>> >>> implementations
>>> >>>
>>> >>> Both use cases seem valid to me, so I think it's important to allow
>>> >>> both old and new algorithms to co-exist going forward.
>>> >>>
>>> >>> However:
>>> >>> Should the default behaviour change (with option to revert)?
>>> >>> Or would it be better to make the corrected behaviour optional?
>>> >>>
>>> >>> Or deliberately break the API to force users to choose between old
>>> and
>>> >>> new behaviour?
>>> >>>
>>> >>> Note that Wikipedia says that the x86 and x64 versions of the
>>> >>> algorithm generate different results in the case of the 128 bit hash.
>>> >>> However Commons Codec does not have different implementations for x86
>>> >>> and x64; it is not clear which has been implemented.
>>> >>> The JavaDoc also fails to mention that there is a 64bit hash.
>>>
>>> And lacks <p> and <a> tags in the header as appropriate.
>>>
>>> >>>
>>> >>> The class clearly needs additional test data; it's possible that
>>> there
>>> >>> are discrepancies in other hash sizes as well.
>>> >>>
>>> >>> The code has multiple implementations for each hash size depending on
>>> >>> the parameter types, which increases the number of tests needed.
>>> >>>
>>> >>> Note that the original C++ implementation only has 3 hash methods:
>>> >>> MurmurHash3_x86_32
>>> >>> MurmurHash3_x86_128
>>> >>> MurmurHash3_x64_128
>>> >>>
>>> >>> AFAICT these all take the same parameters:
>>> >>> - unsigned byte array
>>> >>> - length
>>> >>> - unsigned 32 bit seed
>>> >>> - pointer to output buffer
>>> >>>
>>> >>> No idea why the Codec version allows additional input such as one (or
>>> >>> two) longs, etc.
>>> >>> Unless there is a defined standard for how these should be handled,
>>> it
>>> >>> will be impossible to prove that they behave correctly.
>>> >>>
>>> >>> On Sun, 3 Nov 2019 at 22:01, Alex Herbert <[email protected]>
>>> >>> wrote:
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>> On 3 Nov 2019, at 21:45, Gary Gregory <[email protected]>
>>> wrote:
>>> >>>>>
>>> >>>>> I feel like I am missing something basic in the assumption of this
>>> >>> issue:
>>> >>>>> there is no such thing as an unsigned int in Java and the ticket
>>> talks
>>> >>>>> about (C?) unsigned ints. Please help me understand how or why we
>>> >>> should
>>> >>>>> care about C vs. Java ints. Are we comparing apples to oranges
>>> here?
>>> >>>>
>>> >>>> When a byte is converted to an int there is sign extension if it is
>>> >>> negative. A negative byte will have all bits above the 8th bit set
>>> to 1. So
>>> >>> if the byte is negative then when converted to an int for bit shift
>>> and xor
>>> >>> operations the raw bits are not the same.
>>> >>>>
>>> >>>> These are not the same:
>>> >>>>
>>> >>>> byte b = -1;
>>> >>>> (int) b != (b & 0xff);
>>> >>>> b << 8 != (b & 0xff) << 8;
>>> >>>> b << 16 != (b & 0xff) << 16;
>>> >>>>
>>> >>>> The original code has the use of the 0xff mask for most of the
>>> murmur3
>>> >>> algorithm. It has been missed from the final steps applied the the
>>> last 3
>>> >>> bytes in the hash32 algorithm variant.
>>> >>>>
>>> >>>> Alex
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>> Thank you,
>>> >>>>> Gary
>>> >>>>>
>>> >>>>> On Sun, Nov 3, 2019, 07:59 Claude Warren <[email protected]> wrote:
>>> >>>>>
>>> >>>>>> There is an error in the current Murmur3 code introduced by sign
>>> >>> extension
>>> >>>>>> errors.  This is documented in CODEC-264.[1]
>>> >>>>>>
>>> >>>>>> I have created a pull request to fix it.[2]
>>> >>>>>>
>>> >>>>>> While the code changes did not change any of the existing Murmur3
>>> >>> tests, I
>>> >>>>>> did add new tests that failed until the changes were applied.
>>> >>>>>>
>>> >>>>>> However, I am concerned that the commons-codec Murmur3 code is in
>>> the
>>> >>> wild
>>> >>>>>> and this change will impact some hashes.
>>> >>>>>>
>>> >>>>>> Therefore, I am asking for a discussion concerning how to apply
>>> any
>>> >>> patches
>>> >>>>>> like CODEC-264 to hashing algorithms that are in the wild.
>>> >>>>>>
>>> >>>>>> Thx,
>>> >>>>>> Claude
>>> >>>>>>
>>> >>>>>> [1] https://issues.apache.org/jira/browse/CODEC-264
>>> >>>>>> [2] https://github.com/apache/commons-codec/pull/27
>>> >>>>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> ---------------------------------------------------------------------
>>> >>>> To unsubscribe, e-mail: [email protected]
>>> >>>> For additional commands, e-mail: [email protected]
>>> >>>>
>>> >>>
>>> >>> ---------------------------------------------------------------------
>>> >>> To unsubscribe, e-mail: [email protected]
>>> >>> For additional commands, e-mail: [email protected]
>>> >>>
>>> >>>
>>> >>
>>> >> --
>>> >> I like: Like Like - The likeliest place on the web
>>> >> <http://like-like.xenei.com>
>>> >> LinkedIn: http://www.linkedin.com/in/claudewarren
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: [email protected] <mailto:
>>> [email protected]>
>>> > For additional commands, e-mail: [email protected] <mailto:
>>> [email protected]>
>>>
>>
>>
>> --
>> I like: Like Like - The likeliest place on the web
>> <http://like-like.xenei.com>
>> LinkedIn: http://www.linkedin.com/in/claudewarren
>>
>
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [CODEC] Sign Extension Error in Murmur3

Reply via email to