Re: [CODEC] Sign Extension Error in Murmur3

Claude Warren Mon, 04 Nov 2019 02:51:40 -0800

There are a number of issues with the format and potential bugs in the
Codec Murmur3 code. ( See spotbugs, PMD, and codestyle reports)  The one
that tripped me up was the mix of tab/space lines.  Anyway, these issues
can be fixed in later pull requests.  I think is one overarching question
and one specific question:


In general, what is the policy for how do we deal with digest
implementations that have a bug and that have been released into the wild?

Specifically in this case, what do we do to solve the problem?

I was in favor of the following specific action:

1. Deprecate all affected methods and document the issue.
2. Move the deprecated methods into a new class called HiveMurmur3 and have
all deprecation point to this class.
3. Create a new Murmur3 class with method names like hash128_x64 (ones that
do not conflict with the old names).

But the more I think about it the more I think that we should follow KISS
and

1. Create a new Murmur3 class with method names like hash128_x64 (ones that
do not conflict with the old names).
2. Update the javadocs for the defective methods with notes indicating they
have defects.  Perhaps write a readme type document explaining in detail
the issue.

I think in general we should follow the same action with any digest defect:
New methods and Javadoc old.

Claude




On Mon, Nov 4, 2019 at 8:08 AM Alex Herbert <alex.d.herb...@gmail.com>
wrote:

>
>
> > On 4 Nov 2019, at 02:13, sebb <seb...@gmail.com> wrote:
> >
> > On Mon, 4 Nov 2019 at 00:53, Claude Warren <cla...@xenei.com <mailto:
> cla...@xenei.com>> wrote:
> >>
> >> I think the way to prove they behave correctly is to test the results
> >> against the original "C" implementation.  With this in mind I proposed
> the
> >> changes.
> >>
> >> I agree that there should be a way to revert to the old code as there is
> >> code in the wild that depends on the earlier implementation.
> >>
> >> I know that there are broken implementations in other Apache projects,
> but
> >> those are not libraries (i.e. Cassandra).
> >>
> >> I would suggest that perhaps the changes that I provided be renamed as
> >> hash128Standard and hash32Standard (or some other designator to show
> that
> >> it is different from the original CODEC implementation).
> >>
> >> Alternatively hash128 and hash32 could be deprecated  and replaced with
> >> hashx86_32Old, and hashx64_128Old and provide the corrected versions for
> >> hashx86_32 and hashx64_128.  Obviously this leaves the door open to a
> >> hashx86_128 implementation.
> >>
> >> My opinion is that we should provide Murmur3 hash implementations that
> >> produce the same results as the "C"/C++ implementations for cross
> >> application compatibility.
> >
> > I agree: but the original implementation has far fewer methods.
> > So how does one check the Codec methods that have no corresponding C/C++
> method?
>
> AFAICT the methods that accept a short, int, or long are convenience
> methods to avoid creating a byte[] for the data. The JUnit tests
> demonstrate this by passing the same values to a ByteBuffer and then
> calling the corresponding byte[] methods. The methods do assume the endian
> order of the primitive datatypes as processed by a default ByteBuffer (Big
> Endian). So perhaps this should be added to the Javadoc with an example for
> the equivalent call to the method using the byte[].
>
> Note that these methods will not be broken. The bug in the code is for
> input byte[] lengths that are not a factor of 4. The fix is for any of the
> leftover 1, 2, or 3 bytes when computing the hash.
>
> The broken methods are:
>
> public static int hash32(final byte[] data)
> public static int hash32(final String data)
> public static int hash32(final byte[] data, final int length)
> public static int hash32(final byte[] data, final int length, final int
> seed)
> public static int hash32(final byte[] data, final int offset, final int
> length, final int seed)
>
> This class is also broken:
> public static class IncrementalHash32
>
> >
> > Maybe the Codec implementation needs to be drastically pruned as well.
> >
> >> Claude
> >>
> >> On Sun, Nov 3, 2019 at 11:34 PM sebb <seb...@gmail.com> wrote:
> >>
> >>> As I see it, there are two use cases here.
> >>>
> >>> 1) Commons Codec code is used exclusively, in which case it is
> >>> essential that the output does not change for a given seed.
> >>>
> >>> 2) Commons Codec is used in conjunction with other implementations, in
> >>> which case it is essential that Codec is aligned with other
> >>> implementations
> >>>
> >>> Both use cases seem valid to me, so I think it's important to allow
> >>> both old and new algorithms to co-exist going forward.
> >>>
> >>> However:
> >>> Should the default behaviour change (with option to revert)?
> >>> Or would it be better to make the corrected behaviour optional?
> >>>
> >>> Or deliberately break the API to force users to choose between old and
> >>> new behaviour?
> >>>
> >>> Note that Wikipedia says that the x86 and x64 versions of the
> >>> algorithm generate different results in the case of the 128 bit hash.
> >>> However Commons Codec does not have different implementations for x86
> >>> and x64; it is not clear which has been implemented.
> >>> The JavaDoc also fails to mention that there is a 64bit hash.
>
> And lacks <p> and <a> tags in the header as appropriate.
>
> >>>
> >>> The class clearly needs additional test data; it's possible that there
> >>> are discrepancies in other hash sizes as well.
> >>>
> >>> The code has multiple implementations for each hash size depending on
> >>> the parameter types, which increases the number of tests needed.
> >>>
> >>> Note that the original C++ implementation only has 3 hash methods:
> >>> MurmurHash3_x86_32
> >>> MurmurHash3_x86_128
> >>> MurmurHash3_x64_128
> >>>
> >>> AFAICT these all take the same parameters:
> >>> - unsigned byte array
> >>> - length
> >>> - unsigned 32 bit seed
> >>> - pointer to output buffer
> >>>
> >>> No idea why the Codec version allows additional input such as one (or
> >>> two) longs, etc.
> >>> Unless there is a defined standard for how these should be handled, it
> >>> will be impossible to prove that they behave correctly.
> >>>
> >>> On Sun, 3 Nov 2019 at 22:01, Alex Herbert <alex.d.herb...@gmail.com>
> >>> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> On 3 Nov 2019, at 21:45, Gary Gregory <garydgreg...@gmail.com>
> wrote:
> >>>>>
> >>>>> I feel like I am missing something basic in the assumption of this
> >>> issue:
> >>>>> there is no such thing as an unsigned int in Java and the ticket
> talks
> >>>>> about (C?) unsigned ints. Please help me understand how or why we
> >>> should
> >>>>> care about C vs. Java ints. Are we comparing apples to oranges here?
> >>>>
> >>>> When a byte is converted to an int there is sign extension if it is
> >>> negative. A negative byte will have all bits above the 8th bit set to
> 1. So
> >>> if the byte is negative then when converted to an int for bit shift
> and xor
> >>> operations the raw bits are not the same.
> >>>>
> >>>> These are not the same:
> >>>>
> >>>> byte b = -1;
> >>>> (int) b != (b & 0xff);
> >>>> b << 8 != (b & 0xff) << 8;
> >>>> b << 16 != (b & 0xff) << 16;
> >>>>
> >>>> The original code has the use of the 0xff mask for most of the murmur3
> >>> algorithm. It has been missed from the final steps applied the the
> last 3
> >>> bytes in the hash32 algorithm variant.
> >>>>
> >>>> Alex
> >>>>
> >>>>
> >>>>
> >>>>>
> >>>>> Thank you,
> >>>>> Gary
> >>>>>
> >>>>> On Sun, Nov 3, 2019, 07:59 Claude Warren <cla...@xenei.com> wrote:
> >>>>>
> >>>>>> There is an error in the current Murmur3 code introduced by sign
> >>> extension
> >>>>>> errors.  This is documented in CODEC-264.[1]
> >>>>>>
> >>>>>> I have created a pull request to fix it.[2]
> >>>>>>
> >>>>>> While the code changes did not change any of the existing Murmur3
> >>> tests, I
> >>>>>> did add new tests that failed until the changes were applied.
> >>>>>>
> >>>>>> However, I am concerned that the commons-codec Murmur3 code is in
> the
> >>> wild
> >>>>>> and this change will impact some hashes.
> >>>>>>
> >>>>>> Therefore, I am asking for a discussion concerning how to apply any
> >>> patches
> >>>>>> like CODEC-264 to hashing algorithms that are in the wild.
> >>>>>>
> >>>>>> Thx,
> >>>>>> Claude
> >>>>>>
> >>>>>> [1] https://issues.apache.org/jira/browse/CODEC-264
> >>>>>> [2] https://github.com/apache/commons-codec/pull/27
> >>>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> >>>> For additional commands, e-mail: dev-h...@commons.apache.org
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> >>> For additional commands, e-mail: dev-h...@commons.apache.org
> >>>
> >>>
> >>
> >> --
> >> I like: Like Like - The likeliest place on the web
> >> <http://like-like.xenei.com>
> >> LinkedIn: http://www.linkedin.com/in/claudewarren
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org <mailto:
> dev-unsubscr...@commons.apache.org>
> > For additional commands, e-mail: dev-h...@commons.apache.org <mailto:
> dev-h...@commons.apache.org>
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: [CODEC] Sign Extension Error in Murmur3

Reply via email to