On Mon, 4 Nov 2019 at 00:53, Claude Warren <cla...@xenei.com> wrote: > > I think the way to prove they behave correctly is to test the results > against the original "C" implementation. With this in mind I proposed the > changes. > > I agree that there should be a way to revert to the old code as there is > code in the wild that depends on the earlier implementation. > > I know that there are broken implementations in other Apache projects, but > those are not libraries (i.e. Cassandra). > > I would suggest that perhaps the changes that I provided be renamed as > hash128Standard and hash32Standard (or some other designator to show that > it is different from the original CODEC implementation). > > Alternatively hash128 and hash32 could be deprecated and replaced with > hashx86_32Old, and hashx64_128Old and provide the corrected versions for > hashx86_32 and hashx64_128. Obviously this leaves the door open to a > hashx86_128 implementation. > > My opinion is that we should provide Murmur3 hash implementations that > produce the same results as the "C"/C++ implementations for cross > application compatibility.
I agree: but the original implementation has far fewer methods. So how does one check the Codec methods that have no corresponding C/C++ method? Maybe the Codec implementation needs to be drastically pruned as well. > Claude > > On Sun, Nov 3, 2019 at 11:34 PM sebb <seb...@gmail.com> wrote: > > > As I see it, there are two use cases here. > > > > 1) Commons Codec code is used exclusively, in which case it is > > essential that the output does not change for a given seed. > > > > 2) Commons Codec is used in conjunction with other implementations, in > > which case it is essential that Codec is aligned with other > > implementations > > > > Both use cases seem valid to me, so I think it's important to allow > > both old and new algorithms to co-exist going forward. > > > > However: > > Should the default behaviour change (with option to revert)? > > Or would it be better to make the corrected behaviour optional? > > > > Or deliberately break the API to force users to choose between old and > > new behaviour? > > > > Note that Wikipedia says that the x86 and x64 versions of the > > algorithm generate different results in the case of the 128 bit hash. > > However Commons Codec does not have different implementations for x86 > > and x64; it is not clear which has been implemented. > > The JavaDoc also fails to mention that there is a 64bit hash. > > > > The class clearly needs additional test data; it's possible that there > > are discrepancies in other hash sizes as well. > > > > The code has multiple implementations for each hash size depending on > > the parameter types, which increases the number of tests needed. > > > > Note that the original C++ implementation only has 3 hash methods: > > MurmurHash3_x86_32 > > MurmurHash3_x86_128 > > MurmurHash3_x64_128 > > > > AFAICT these all take the same parameters: > > - unsigned byte array > > - length > > - unsigned 32 bit seed > > - pointer to output buffer > > > > No idea why the Codec version allows additional input such as one (or > > two) longs, etc. > > Unless there is a defined standard for how these should be handled, it > > will be impossible to prove that they behave correctly. > > > > On Sun, 3 Nov 2019 at 22:01, Alex Herbert <alex.d.herb...@gmail.com> > > wrote: > > > > > > > > > > > > > On 3 Nov 2019, at 21:45, Gary Gregory <garydgreg...@gmail.com> wrote: > > > > > > > > I feel like I am missing something basic in the assumption of this > > issue: > > > > there is no such thing as an unsigned int in Java and the ticket talks > > > > about (C?) unsigned ints. Please help me understand how or why we > > should > > > > care about C vs. Java ints. Are we comparing apples to oranges here? > > > > > > When a byte is converted to an int there is sign extension if it is > > negative. A negative byte will have all bits above the 8th bit set to 1. So > > if the byte is negative then when converted to an int for bit shift and xor > > operations the raw bits are not the same. > > > > > > These are not the same: > > > > > > byte b = -1; > > > (int) b != (b & 0xff); > > > b << 8 != (b & 0xff) << 8; > > > b << 16 != (b & 0xff) << 16; > > > > > > The original code has the use of the 0xff mask for most of the murmur3 > > algorithm. It has been missed from the final steps applied the the last 3 > > bytes in the hash32 algorithm variant. > > > > > > Alex > > > > > > > > > > > > > > > > > Thank you, > > > > Gary > > > > > > > > On Sun, Nov 3, 2019, 07:59 Claude Warren <cla...@xenei.com> wrote: > > > > > > > >> There is an error in the current Murmur3 code introduced by sign > > extension > > > >> errors. This is documented in CODEC-264.[1] > > > >> > > > >> I have created a pull request to fix it.[2] > > > >> > > > >> While the code changes did not change any of the existing Murmur3 > > tests, I > > > >> did add new tests that failed until the changes were applied. > > > >> > > > >> However, I am concerned that the commons-codec Murmur3 code is in the > > wild > > > >> and this change will impact some hashes. > > > >> > > > >> Therefore, I am asking for a discussion concerning how to apply any > > patches > > > >> like CODEC-264 to hashing algorithms that are in the wild. > > > >> > > > >> Thx, > > > >> Claude > > > >> > > > >> [1] https://issues.apache.org/jira/browse/CODEC-264 > > > >> [2] https://github.com/apache/commons-codec/pull/27 > > > >> > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > > For additional commands, e-mail: dev-h...@commons.apache.org > > > > > > -- > I like: Like Like - The likeliest place on the web > <http://like-like.xenei.com> > LinkedIn: http://www.linkedin.com/in/claudewarren --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org