Re: [CODEC] Sign Extension Error in Murmur3

Claude Warren Sun, 03 Nov 2019 16:53:52 -0800

I think the way to prove they behave correctly is to test the results
against the original "C" implementation.  With this in mind I proposed the
changes.


I agree that there should be a way to revert to the old code as there is
code in the wild that depends on the earlier implementation.

I know that there are broken implementations in other Apache projects, but
those are not libraries (i.e. Cassandra).

I would suggest that perhaps the changes that I provided be renamed as
hash128Standard and hash32Standard (or some other designator to show that
it is different from the original CODEC implementation).

Alternatively hash128 and hash32 could be deprecated  and replaced with
hashx86_32Old, and hashx64_128Old and provide the corrected versions for
hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
hashx86_128 implementation.

My opinion is that we should provide Murmur3 hash implementations that
produce the same results as the "C"/C++ implementations for cross
application compatibility.

Claude

On Sun, Nov 3, 2019 at 11:34 PM sebb <seb...@gmail.com> wrote:

> As I see it, there are two use cases here.
>
> 1) Commons Codec code is used exclusively, in which case it is
> essential that the output does not change for a given seed.
>
> 2) Commons Codec is used in conjunction with other implementations, in
> which case it is essential that Codec is aligned with other
> implementations
>
> Both use cases seem valid to me, so I think it's important to allow
> both old and new algorithms to co-exist going forward.
>
> However:
> Should the default behaviour change (with option to revert)?
> Or would it be better to make the corrected behaviour optional?
>
> Or deliberately break the API to force users to choose between old and
> new behaviour?
>
> Note that Wikipedia says that the x86 and x64 versions of the
> algorithm generate different results in the case of the 128 bit hash.
> However Commons Codec does not have different implementations for x86
> and x64; it is not clear which has been implemented.
> The JavaDoc also fails to mention that there is a 64bit hash.
>
> The class clearly needs additional test data; it's possible that there
> are discrepancies in other hash sizes as well.
>
> The code has multiple implementations for each hash size depending on
> the parameter types, which increases the number of tests needed.
>
> Note that the original C++ implementation only has 3 hash methods:
> MurmurHash3_x86_32
> MurmurHash3_x86_128
> MurmurHash3_x64_128
>
> AFAICT these all take the same parameters:
> - unsigned byte array
> - length
> - unsigned 32 bit seed
> - pointer to output buffer
>
> No idea why the Codec version allows additional input such as one (or
> two) longs, etc.
> Unless there is a defined standard for how these should be handled, it
> will be impossible to prove that they behave correctly.
>
> On Sun, 3 Nov 2019 at 22:01, Alex Herbert <alex.d.herb...@gmail.com>
> wrote:
> >
> >
> >
> > > On 3 Nov 2019, at 21:45, Gary Gregory <garydgreg...@gmail.com> wrote:
> > >
> > > I feel like I am missing something basic in the assumption of this
> issue:
> > > there is no such thing as an unsigned int in Java and the ticket talks
> > > about (C?) unsigned ints. Please help me understand how or why we
> should
> > > care about C vs. Java ints. Are we comparing apples to oranges here?
> >
> > When a byte is converted to an int there is sign extension if it is
> negative. A negative byte will have all bits above the 8th bit set to 1. So
> if the byte is negative then when converted to an int for bit shift and xor
> operations the raw bits are not the same.
> >
> > These are not the same:
> >
> > byte b = -1;
> > (int) b != (b & 0xff);
> > b << 8 != (b & 0xff) << 8;
> > b << 16 != (b & 0xff) << 16;
> >
> > The original code has the use of the 0xff mask for most of the murmur3
> algorithm. It has been missed from the final steps applied the the last 3
> bytes in the hash32 algorithm variant.
> >
> > Alex
> >
> >
> >
> > >
> > > Thank you,
> > > Gary
> > >
> > > On Sun, Nov 3, 2019, 07:59 Claude Warren <cla...@xenei.com> wrote:
> > >
> > >> There is an error in the current Murmur3 code introduced by sign
> extension
> > >> errors.  This is documented in CODEC-264.[1]
> > >>
> > >> I have created a pull request to fix it.[2]
> > >>
> > >> While the code changes did not change any of the existing Murmur3
> tests, I
> > >> did add new tests that failed until the changes were applied.
> > >>
> > >> However, I am concerned that the commons-codec Murmur3 code is in the
> wild
> > >> and this change will impact some hashes.
> > >>
> > >> Therefore, I am asking for a discussion concerning how to apply any
> patches
> > >> like CODEC-264 to hashing algorithms that are in the wild.
> > >>
> > >> Thx,
> > >> Claude
> > >>
> > >> [1] https://issues.apache.org/jira/browse/CODEC-264
> > >> [2] https://github.com/apache/commons-codec/pull/27
> > >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > For additional commands, e-mail: dev-h...@commons.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [CODEC] Sign Extension Error in Murmur3

Reply via email to