> On 4 Nov 2019, at 02:13, sebb <seb...@gmail.com> wrote: > > On Mon, 4 Nov 2019 at 00:53, Claude Warren <cla...@xenei.com > <mailto:cla...@xenei.com>> wrote: >> >> I think the way to prove they behave correctly is to test the results >> against the original "C" implementation. With this in mind I proposed the >> changes. >> >> I agree that there should be a way to revert to the old code as there is >> code in the wild that depends on the earlier implementation. >> >> I know that there are broken implementations in other Apache projects, but >> those are not libraries (i.e. Cassandra). >> >> I would suggest that perhaps the changes that I provided be renamed as >> hash128Standard and hash32Standard (or some other designator to show that >> it is different from the original CODEC implementation). >> >> Alternatively hash128 and hash32 could be deprecated and replaced with >> hashx86_32Old, and hashx64_128Old and provide the corrected versions for >> hashx86_32 and hashx64_128. Obviously this leaves the door open to a >> hashx86_128 implementation. >> >> My opinion is that we should provide Murmur3 hash implementations that >> produce the same results as the "C"/C++ implementations for cross >> application compatibility. > > I agree: but the original implementation has far fewer methods. > So how does one check the Codec methods that have no corresponding C/C++ > method?
AFAICT the methods that accept a short, int, or long are convenience methods to avoid creating a byte[] for the data. The JUnit tests demonstrate this by passing the same values to a ByteBuffer and then calling the corresponding byte[] methods. The methods do assume the endian order of the primitive datatypes as processed by a default ByteBuffer (Big Endian). So perhaps this should be added to the Javadoc with an example for the equivalent call to the method using the byte[]. Note that these methods will not be broken. The bug in the code is for input byte[] lengths that are not a factor of 4. The fix is for any of the leftover 1, 2, or 3 bytes when computing the hash. The broken methods are: public static int hash32(final byte[] data) public static int hash32(final String data) public static int hash32(final byte[] data, final int length) public static int hash32(final byte[] data, final int length, final int seed) public static int hash32(final byte[] data, final int offset, final int length, final int seed) This class is also broken: public static class IncrementalHash32 > > Maybe the Codec implementation needs to be drastically pruned as well. > >> Claude >> >> On Sun, Nov 3, 2019 at 11:34 PM sebb <seb...@gmail.com> wrote: >> >>> As I see it, there are two use cases here. >>> >>> 1) Commons Codec code is used exclusively, in which case it is >>> essential that the output does not change for a given seed. >>> >>> 2) Commons Codec is used in conjunction with other implementations, in >>> which case it is essential that Codec is aligned with other >>> implementations >>> >>> Both use cases seem valid to me, so I think it's important to allow >>> both old and new algorithms to co-exist going forward. >>> >>> However: >>> Should the default behaviour change (with option to revert)? >>> Or would it be better to make the corrected behaviour optional? >>> >>> Or deliberately break the API to force users to choose between old and >>> new behaviour? >>> >>> Note that Wikipedia says that the x86 and x64 versions of the >>> algorithm generate different results in the case of the 128 bit hash. >>> However Commons Codec does not have different implementations for x86 >>> and x64; it is not clear which has been implemented. >>> The JavaDoc also fails to mention that there is a 64bit hash. And lacks <p> and <a> tags in the header as appropriate. >>> >>> The class clearly needs additional test data; it's possible that there >>> are discrepancies in other hash sizes as well. >>> >>> The code has multiple implementations for each hash size depending on >>> the parameter types, which increases the number of tests needed. >>> >>> Note that the original C++ implementation only has 3 hash methods: >>> MurmurHash3_x86_32 >>> MurmurHash3_x86_128 >>> MurmurHash3_x64_128 >>> >>> AFAICT these all take the same parameters: >>> - unsigned byte array >>> - length >>> - unsigned 32 bit seed >>> - pointer to output buffer >>> >>> No idea why the Codec version allows additional input such as one (or >>> two) longs, etc. >>> Unless there is a defined standard for how these should be handled, it >>> will be impossible to prove that they behave correctly. >>> >>> On Sun, 3 Nov 2019 at 22:01, Alex Herbert <alex.d.herb...@gmail.com> >>> wrote: >>>> >>>> >>>> >>>>> On 3 Nov 2019, at 21:45, Gary Gregory <garydgreg...@gmail.com> wrote: >>>>> >>>>> I feel like I am missing something basic in the assumption of this >>> issue: >>>>> there is no such thing as an unsigned int in Java and the ticket talks >>>>> about (C?) unsigned ints. Please help me understand how or why we >>> should >>>>> care about C vs. Java ints. Are we comparing apples to oranges here? >>>> >>>> When a byte is converted to an int there is sign extension if it is >>> negative. A negative byte will have all bits above the 8th bit set to 1. So >>> if the byte is negative then when converted to an int for bit shift and xor >>> operations the raw bits are not the same. >>>> >>>> These are not the same: >>>> >>>> byte b = -1; >>>> (int) b != (b & 0xff); >>>> b << 8 != (b & 0xff) << 8; >>>> b << 16 != (b & 0xff) << 16; >>>> >>>> The original code has the use of the 0xff mask for most of the murmur3 >>> algorithm. It has been missed from the final steps applied the the last 3 >>> bytes in the hash32 algorithm variant. >>>> >>>> Alex >>>> >>>> >>>> >>>>> >>>>> Thank you, >>>>> Gary >>>>> >>>>> On Sun, Nov 3, 2019, 07:59 Claude Warren <cla...@xenei.com> wrote: >>>>> >>>>>> There is an error in the current Murmur3 code introduced by sign >>> extension >>>>>> errors. This is documented in CODEC-264.[1] >>>>>> >>>>>> I have created a pull request to fix it.[2] >>>>>> >>>>>> While the code changes did not change any of the existing Murmur3 >>> tests, I >>>>>> did add new tests that failed until the changes were applied. >>>>>> >>>>>> However, I am concerned that the commons-codec Murmur3 code is in the >>> wild >>>>>> and this change will impact some hashes. >>>>>> >>>>>> Therefore, I am asking for a discussion concerning how to apply any >>> patches >>>>>> like CODEC-264 to hashing algorithms that are in the wild. >>>>>> >>>>>> Thx, >>>>>> Claude >>>>>> >>>>>> [1] https://issues.apache.org/jira/browse/CODEC-264 >>>>>> [2] https://github.com/apache/commons-codec/pull/27 >>>>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>>> For additional commands, e-mail: dev-h...@commons.apache.org >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org >>> For additional commands, e-mail: dev-h...@commons.apache.org >>> >>> >> >> -- >> I like: Like Like - The likeliest place on the web >> <http://like-like.xenei.com> >> LinkedIn: http://www.linkedin.com/in/claudewarren > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > <mailto:dev-unsubscr...@commons.apache.org> > For additional commands, e-mail: dev-h...@commons.apache.org > <mailto:dev-h...@commons.apache.org>