There are a number of issues with the format and potential bugs in the Codec Murmur3 code. ( See spotbugs, PMD, and codestyle reports) The one that tripped me up was the mix of tab/space lines. Anyway, these issues can be fixed in later pull requests. I think is one overarching question and one specific question:
In general, what is the policy for how do we deal with digest implementations that have a bug and that have been released into the wild? Specifically in this case, what do we do to solve the problem? I was in favor of the following specific action: 1. Deprecate all affected methods and document the issue. 2. Move the deprecated methods into a new class called HiveMurmur3 and have all deprecation point to this class. 3. Create a new Murmur3 class with method names like hash128_x64 (ones that do not conflict with the old names). But the more I think about it the more I think that we should follow KISS and 1. Create a new Murmur3 class with method names like hash128_x64 (ones that do not conflict with the old names). 2. Update the javadocs for the defective methods with notes indicating they have defects. Perhaps write a readme type document explaining in detail the issue. I think in general we should follow the same action with any digest defect: New methods and Javadoc old. Claude On Mon, Nov 4, 2019 at 8:08 AM Alex Herbert <alex.d.herb...@gmail.com> wrote: > > > > On 4 Nov 2019, at 02:13, sebb <seb...@gmail.com> wrote: > > > > On Mon, 4 Nov 2019 at 00:53, Claude Warren <cla...@xenei.com <mailto: > cla...@xenei.com>> wrote: > >> > >> I think the way to prove they behave correctly is to test the results > >> against the original "C" implementation. With this in mind I proposed > the > >> changes. > >> > >> I agree that there should be a way to revert to the old code as there is > >> code in the wild that depends on the earlier implementation. > >> > >> I know that there are broken implementations in other Apache projects, > but > >> those are not libraries (i.e. Cassandra). > >> > >> I would suggest that perhaps the changes that I provided be renamed as > >> hash128Standard and hash32Standard (or some other designator to show > that > >> it is different from the original CODEC implementation). > >> > >> Alternatively hash128 and hash32 could be deprecated and replaced with > >> hashx86_32Old, and hashx64_128Old and provide the corrected versions for > >> hashx86_32 and hashx64_128. Obviously this leaves the door open to a > >> hashx86_128 implementation. > >> > >> My opinion is that we should provide Murmur3 hash implementations that > >> produce the same results as the "C"/C++ implementations for cross > >> application compatibility. > > > > I agree: but the original implementation has far fewer methods. > > So how does one check the Codec methods that have no corresponding C/C++ > method? > > AFAICT the methods that accept a short, int, or long are convenience > methods to avoid creating a byte[] for the data. The JUnit tests > demonstrate this by passing the same values to a ByteBuffer and then > calling the corresponding byte[] methods. The methods do assume the endian > order of the primitive datatypes as processed by a default ByteBuffer (Big > Endian). So perhaps this should be added to the Javadoc with an example for > the equivalent call to the method using the byte[]. > > Note that these methods will not be broken. The bug in the code is for > input byte[] lengths that are not a factor of 4. The fix is for any of the > leftover 1, 2, or 3 bytes when computing the hash. > > The broken methods are: > > public static int hash32(final byte[] data) > public static int hash32(final String data) > public static int hash32(final byte[] data, final int length) > public static int hash32(final byte[] data, final int length, final int > seed) > public static int hash32(final byte[] data, final int offset, final int > length, final int seed) > > This class is also broken: > public static class IncrementalHash32 > > > > > Maybe the Codec implementation needs to be drastically pruned as well. > > > >> Claude > >> > >> On Sun, Nov 3, 2019 at 11:34 PM sebb <seb...@gmail.com> wrote: > >> > >>> As I see it, there are two use cases here. > >>> > >>> 1) Commons Codec code is used exclusively, in which case it is > >>> essential that the output does not change for a given seed. > >>> > >>> 2) Commons Codec is used in conjunction with other implementations, in > >>> which case it is essential that Codec is aligned with other > >>> implementations > >>> > >>> Both use cases seem valid to me, so I think it's important to allow > >>> both old and new algorithms to co-exist going forward. > >>> > >>> However: > >>> Should the default behaviour change (with option to revert)? > >>> Or would it be better to make the corrected behaviour optional? > >>> > >>> Or deliberately break the API to force users to choose between old and > >>> new behaviour? > >>> > >>> Note that Wikipedia says that the x86 and x64 versions of the > >>> algorithm generate different results in the case of the 128 bit hash. > >>> However Commons Codec does not have different implementations for x86 > >>> and x64; it is not clear which has been implemented. > >>> The JavaDoc also fails to mention that there is a 64bit hash. > > And lacks <p> and <a> tags in the header as appropriate. > > >>> > >>> The class clearly needs additional test data; it's possible that there > >>> are discrepancies in other hash sizes as well. > >>> > >>> The code has multiple implementations for each hash size depending on > >>> the parameter types, which increases the number of tests needed. > >>> > >>> Note that the original C++ implementation only has 3 hash methods: > >>> MurmurHash3_x86_32 > >>> MurmurHash3_x86_128 > >>> MurmurHash3_x64_128 > >>> > >>> AFAICT these all take the same parameters: > >>> - unsigned byte array > >>> - length > >>> - unsigned 32 bit seed > >>> - pointer to output buffer > >>> > >>> No idea why the Codec version allows additional input such as one (or > >>> two) longs, etc. > >>> Unless there is a defined standard for how these should be handled, it > >>> will be impossible to prove that they behave correctly. > >>> > >>> On Sun, 3 Nov 2019 at 22:01, Alex Herbert <alex.d.herb...@gmail.com> > >>> wrote: > >>>> > >>>> > >>>> > >>>>> On 3 Nov 2019, at 21:45, Gary Gregory <garydgreg...@gmail.com> > wrote: > >>>>> > >>>>> I feel like I am missing something basic in the assumption of this > >>> issue: > >>>>> there is no such thing as an unsigned int in Java and the ticket > talks > >>>>> about (C?) unsigned ints. Please help me understand how or why we > >>> should > >>>>> care about C vs. Java ints. Are we comparing apples to oranges here? > >>>> > >>>> When a byte is converted to an int there is sign extension if it is > >>> negative. A negative byte will have all bits above the 8th bit set to > 1. So > >>> if the byte is negative then when converted to an int for bit shift > and xor > >>> operations the raw bits are not the same. > >>>> > >>>> These are not the same: > >>>> > >>>> byte b = -1; > >>>> (int) b != (b & 0xff); > >>>> b << 8 != (b & 0xff) << 8; > >>>> b << 16 != (b & 0xff) << 16; > >>>> > >>>> The original code has the use of the 0xff mask for most of the murmur3 > >>> algorithm. It has been missed from the final steps applied the the > last 3 > >>> bytes in the hash32 algorithm variant. > >>>> > >>>> Alex > >>>> > >>>> > >>>> > >>>>> > >>>>> Thank you, > >>>>> Gary > >>>>> > >>>>> On Sun, Nov 3, 2019, 07:59 Claude Warren <cla...@xenei.com> wrote: > >>>>> > >>>>>> There is an error in the current Murmur3 code introduced by sign > >>> extension > >>>>>> errors. This is documented in CODEC-264.[1] > >>>>>> > >>>>>> I have created a pull request to fix it.[2] > >>>>>> > >>>>>> While the code changes did not change any of the existing Murmur3 > >>> tests, I > >>>>>> did add new tests that failed until the changes were applied. > >>>>>> > >>>>>> However, I am concerned that the commons-codec Murmur3 code is in > the > >>> wild > >>>>>> and this change will impact some hashes. > >>>>>> > >>>>>> Therefore, I am asking for a discussion concerning how to apply any > >>> patches > >>>>>> like CODEC-264 to hashing algorithms that are in the wild. > >>>>>> > >>>>>> Thx, > >>>>>> Claude > >>>>>> > >>>>>> [1] https://issues.apache.org/jira/browse/CODEC-264 > >>>>>> [2] https://github.com/apache/commons-codec/pull/27 > >>>>>> > >>>> > >>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > >>>> For additional commands, e-mail: dev-h...@commons.apache.org > >>>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > >>> For additional commands, e-mail: dev-h...@commons.apache.org > >>> > >>> > >> > >> -- > >> I like: Like Like - The likeliest place on the web > >> <http://like-like.xenei.com> > >> LinkedIn: http://www.linkedin.com/in/claudewarren > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org <mailto: > dev-unsubscr...@commons.apache.org> > > For additional commands, e-mail: dev-h...@commons.apache.org <mailto: > dev-h...@commons.apache.org> > -- I like: Like Like - The likeliest place on the web <http://like-like.xenei.com> LinkedIn: http://www.linkedin.com/in/claudewarren