On Mon, 4 Nov 2019 at 00:53, Claude Warren <cla...@xenei.com> wrote:
>
> I think the way to prove they behave correctly is to test the results
> against the original "C" implementation.  With this in mind I proposed the
> changes.
>
> I agree that there should be a way to revert to the old code as there is
> code in the wild that depends on the earlier implementation.
>
> I know that there are broken implementations in other Apache projects, but
> those are not libraries (i.e. Cassandra).
>
> I would suggest that perhaps the changes that I provided be renamed as
> hash128Standard and hash32Standard (or some other designator to show that
> it is different from the original CODEC implementation).
>
> Alternatively hash128 and hash32 could be deprecated  and replaced with
> hashx86_32Old, and hashx64_128Old and provide the corrected versions for
> hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
> hashx86_128 implementation.
>
> My opinion is that we should provide Murmur3 hash implementations that
> produce the same results as the "C"/C++ implementations for cross
> application compatibility.

I agree: but the original implementation has far fewer methods.
So how does one check the Codec methods that have no corresponding C/C++ method?

Maybe the Codec implementation needs to be drastically pruned as well.

> Claude
>
> On Sun, Nov 3, 2019 at 11:34 PM sebb <seb...@gmail.com> wrote:
>
> > As I see it, there are two use cases here.
> >
> > 1) Commons Codec code is used exclusively, in which case it is
> > essential that the output does not change for a given seed.
> >
> > 2) Commons Codec is used in conjunction with other implementations, in
> > which case it is essential that Codec is aligned with other
> > implementations
> >
> > Both use cases seem valid to me, so I think it's important to allow
> > both old and new algorithms to co-exist going forward.
> >
> > However:
> > Should the default behaviour change (with option to revert)?
> > Or would it be better to make the corrected behaviour optional?
> >
> > Or deliberately break the API to force users to choose between old and
> > new behaviour?
> >
> > Note that Wikipedia says that the x86 and x64 versions of the
> > algorithm generate different results in the case of the 128 bit hash.
> > However Commons Codec does not have different implementations for x86
> > and x64; it is not clear which has been implemented.
> > The JavaDoc also fails to mention that there is a 64bit hash.
> >
> > The class clearly needs additional test data; it's possible that there
> > are discrepancies in other hash sizes as well.
> >
> > The code has multiple implementations for each hash size depending on
> > the parameter types, which increases the number of tests needed.
> >
> > Note that the original C++ implementation only has 3 hash methods:
> > MurmurHash3_x86_32
> > MurmurHash3_x86_128
> > MurmurHash3_x64_128
> >
> > AFAICT these all take the same parameters:
> > - unsigned byte array
> > - length
> > - unsigned 32 bit seed
> > - pointer to output buffer
> >
> > No idea why the Codec version allows additional input such as one (or
> > two) longs, etc.
> > Unless there is a defined standard for how these should be handled, it
> > will be impossible to prove that they behave correctly.
> >
> > On Sun, 3 Nov 2019 at 22:01, Alex Herbert <alex.d.herb...@gmail.com>
> > wrote:
> > >
> > >
> > >
> > > > On 3 Nov 2019, at 21:45, Gary Gregory <garydgreg...@gmail.com> wrote:
> > > >
> > > > I feel like I am missing something basic in the assumption of this
> > issue:
> > > > there is no such thing as an unsigned int in Java and the ticket talks
> > > > about (C?) unsigned ints. Please help me understand how or why we
> > should
> > > > care about C vs. Java ints. Are we comparing apples to oranges here?
> > >
> > > When a byte is converted to an int there is sign extension if it is
> > negative. A negative byte will have all bits above the 8th bit set to 1. So
> > if the byte is negative then when converted to an int for bit shift and xor
> > operations the raw bits are not the same.
> > >
> > > These are not the same:
> > >
> > > byte b = -1;
> > > (int) b != (b & 0xff);
> > > b << 8 != (b & 0xff) << 8;
> > > b << 16 != (b & 0xff) << 16;
> > >
> > > The original code has the use of the 0xff mask for most of the murmur3
> > algorithm. It has been missed from the final steps applied the the last 3
> > bytes in the hash32 algorithm variant.
> > >
> > > Alex
> > >
> > >
> > >
> > > >
> > > > Thank you,
> > > > Gary
> > > >
> > > > On Sun, Nov 3, 2019, 07:59 Claude Warren <cla...@xenei.com> wrote:
> > > >
> > > >> There is an error in the current Murmur3 code introduced by sign
> > extension
> > > >> errors.  This is documented in CODEC-264.[1]
> > > >>
> > > >> I have created a pull request to fix it.[2]
> > > >>
> > > >> While the code changes did not change any of the existing Murmur3
> > tests, I
> > > >> did add new tests that failed until the changes were applied.
> > > >>
> > > >> However, I am concerned that the commons-codec Murmur3 code is in the
> > wild
> > > >> and this change will impact some hashes.
> > > >>
> > > >> Therefore, I am asking for a discussion concerning how to apply any
> > patches
> > > >> like CODEC-264 to hashing algorithms that are in the wild.
> > > >>
> > > >> Thx,
> > > >> Claude
> > > >>
> > > >> [1] https://issues.apache.org/jira/browse/CODEC-264
> > > >> [2] https://github.com/apache/commons-codec/pull/27
> > > >>
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > > For additional commands, e-mail: dev-h...@commons.apache.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > For additional commands, e-mail: dev-h...@commons.apache.org
> >
> >
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org

Reply via email to