On Mon, Dec 30, 2019 at 4:26 PM Alex Herbert <alex.d.herb...@gmail.com>
wrote:

>
>
> > On 30 Dec 2019, at 14:58, Gary Gregory <garydgreg...@gmail.com> wrote:
> >
> > On Mon, Dec 30, 2019 at 9:38 AM Alex Herbert <alex.d.herb...@gmail.com
> <mailto:alex.d.herb...@gmail.com>>
> > wrote:
> >
> >> On Mon, 30 Dec 2019, 14:19 Gary Gregory, <garydgreg...@gmail.com>
> wrote:
> >>
> >>> On Mon, Dec 30, 2019 at 9:10 AM Alex Herbert <alex.d.herb...@gmail.com
> >
> >>> wrote:
> >>>
> >>>> On my second review I fixed the potential overflow in the incremental
> >>> hash
> >>>> for murmur3.
> >>>
> >>>
> >>> I did note that fix flying by in a commit but I would have loved to see
> >> it
> >>> accompanied by a // comment. Since I am not sure that fix came with a
> >> test,
> >>> the fix could likely be lost in a future edit by someone wanting to
> >> improve
> >>> the readability of the code, YMMV on "readability ;-)
> >>>
> >>> Please do feel free to check for other edge case like this one.
> >>>
> >>
> >> I'll add a better comment later. Sorry.
> >>
> >>>
> >> I did not add a test as Travis may not run it. It would require a 2GB
> array
> >> to pass to the incremental. On the BaseNCodecTest I had to test for
> >> available memory to run large memory allocation tests using assume to
> skip
> >> it if there was not enough memory.
> >>
> >> I'll try a test later and see if it works on Travis. The 3gb limit may
> be
> >> enough to run this one. It will be interesting to see how long it takes
> to
> >> hash a 2gb array.
> >>
> >> The base 64 encoding of a 1gb array is a fair chunk of the test suite
> time
> >> if it runs. It requires about 5gb of memory to run.
> >>
> >> I have memory and speed improvements for BaseNCodec but not enough time
> to
> >> finish them for this release. That can wait for next year.
> >>
> >
> > That all sounds good. We can do a 1.14.1 release in January for
> performance
> > improvements.
> >
>
> I added a test for overflow in MurmurHash3Test. Using the original code
> this test throws. With the overflow safe length check it is OK. This is all
> locally with a lot of RAM to run the test.
>
> Travis did not run the test. It gets a ’Skipped 1’ in JDK 8, 11 and 13. I
> did not check the others. I presume this is due to memory requirements.
>
> I’ve had a scan through the code and this same overflow during incremental
> update is a problem in:
>
> XXHash32
>
> Other classes implementing Checksum (PureJavaCrc32, PureJavaCrc32C) use a
> different method to update from a byte array of a given length.
>
> The overflow in XXHash32 can be solved with the same fix used in
> MurmurHash3 incremental hash. There is also an overflow issue in the
> getValue() method of XXHash32:
>
>         if (totalLen > BUF_SIZE) {
>
> Here totalLen is being used to store the number of bytes processed. All
> this check requires is to know that the state array has been updated since
> the reset with the original seed. If totalLen overflows to negative then
> the hash will be incorrect as it will not use the entire state array.
>
> To really check this would require a XXHash32 implementation capable of
> hashing >2^31 bytes to get a reference answer. However a simple change to
> use a boolean flag to hold the state updated condition would fix the bug
> and pass all the current tests.
>
> WDYT?
>

Thank you for going through the code.
I just cut the RC. We can continue changes/fixes for the next release.

Gary


> > Gary
> >
> >
> >>
> >>
> >>
> >>> Gary
> >>>
> >>>
> >>>> Maybe this type of code is present elsewhere in codec for
> >>>> other incremental algorithms. Basically you have to guard against
> >> someone
> >>>> incrementally adding max length arrays which will overflow a small
> >> count
> >>> of
> >>>> unprocessed bytes held as an integer.
> >>>>
> >>>> I'm not at the computer now so can't check for other incrementals. I
> >> can
> >>> do
> >>>> this later if you want or you could have a look. It's not a major
> >> thing.
> >>>> You would have to want to break the code to find this.
> >>>>
> >>>> Alex
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Mon, 30 Dec 2019, 14:02 Gary Gregory, <garydgreg...@gmail.com>
> >> wrote:
> >>>>
> >>>>> On Sun, Dec 29, 2019 at 5:46 PM Alex Herbert <
> >> alex.d.herb...@gmail.com
> >>>>
> >>>>> wrote:
> >>>>>
> >>>>>> OK.
> >>>>>>
> >>>>>> If you are going to make the behaviour change for the string
> >> methods
> >>>> then
> >>>>>> the javadoc should be updated. It currently states:
> >>>>>>
> >>>>>> "The string is converted to bytes using the default encoding.”
> >>>>>>
> >>>>>> So perhaps switch this to:
> >>>>>>
> >>>>>> “Before 1.14 the string was converted using default encoding. Since
> >>>> 1.14
> >>>>>> the string is converted to bytes using UTF-8 encoding.”
> >>>>>>
> >>>>>> The MurmurHash3 methods should still be deprecated as they call the
> >>>>>> hash32/hash128 functions that have sign bugs. I deliberately did
> >> not
> >>>> add
> >>>>> a
> >>>>>> new method accepting a String for the fixed murmur3 algorithm. IMO
> >> it
> >>>> is
> >>>>>> unwarranted API bloat.
> >>>>>>
> >>>>>> MurmurHash2 is not deprecated. In that case the String methods are
> >>> just
> >>>>>> API bloat but the javadoc should be updated.
> >>>>>>
> >>>>>> Alex
> >>>>>>
> >>>>>
> >>>>> Thank you Alex, please review Javadoc changes in git master.
> >>>>>
> >>>>> Also, if there is anything else that can be improved in code,
> >> Javadoc,
> >>> or
> >>>>> test coverage, please let me know so I can delay creating the release
> >>>>> candidate, which will likely happen tonight (US EST) time permitting.
> >>>>>
> >>>>> Gary
> >>>>>
> >>>>>
> >>>>>>
> >>>>>>> On 29 Dec 2019, at 21:33, ggreg...@apache.org wrote:
> >>>>>>>
> >>>>>>> This is an automated email from the ASF dual-hosted git
> >> repository.
> >>>>>>>
> >>>>>>> ggregory pushed a commit to branch master
> >>>>>>> in repository
> >>> https://gitbox.apache.org/repos/asf/commons-codec.git
> >>>>>>>
> >>>>>>>
> >>>>>>> The following commit(s) were added to refs/heads/master by this
> >>> push:
> >>>>>>>    new f40005a  [CODEC-276] Reliance on default encoding in
> >>>>> MurmurHash2
> >>>>>> and MurmurHash3.
> >>>>>>> f40005a is described below
> >>>>>>>
> >>>>>>> commit f40005ad5a9c892fa8d216b12029a49062224351
> >>>>>>> Author: Gary Gregory <garydgreg...@gmail.com>
> >>>>>>> AuthorDate: Sun Dec 29 16:33:14 2019 -0500
> >>>>>>>
> >>>>>>>   [CODEC-276] Reliance on default encoding in MurmurHash2 and
> >>>>>> MurmurHash3.
> >>>>>>> ---
> >>>>>>> src/changes/changes.xml                            |  1 +
> >>>>>>> .../apache/commons/codec/digest/MurmurHash2.java   | 22
> >>>>>> ++++++++++++++++------
> >>>>>>> .../apache/commons/codec/digest/MurmurHash3.java   | 18
> >>>>>> ++++++++++++++----
> >>>>>>> .../commons/codec/digest/MurmurHash3Test.java      |  8 ++++----
> >>>>>>> 4 files changed, 35 insertions(+), 14 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/src/changes/changes.xml b/src/changes/changes.xml
> >>>>>>> index 55ba445..9e1c3c4 100644
> >>>>>>> --- a/src/changes/changes.xml
> >>>>>>> +++ b/src/changes/changes.xml
> >>>>>>> @@ -55,6 +55,7 @@ The <action> type attribute can be
> >>>>>> add,update,fix,remove.
> >>>>>>>      <action issue="CODEC-273" dev="ggregory" type="add"
> >>>> due-to="Gary
> >>>>>> Gregory">Add Path APIs to
> >> org.apache.commons.codec.digest.DigestUtils
> >>>>>> similar to File APIs.</action>
> >>>>>>>      <action issue="CODEC-274" dev="ggregory" type="add"
> >>>> due-to="Gary
> >>>>>> Gregory">Add SHA-512/224 and SHA-512/256 to DigestUtils for Java 9
> >>> and
> >>>>>> up.</action>
> >>>>>>>      <action issue="CODEC-275" dev="ggregory" type="add"
> >>>>> due-to="Claude
> >>>>>> Warren">Add missing note in javadoc when sign extension error is
> >>>> present
> >>>>>> #34.</action>
> >>>>>>> +      <action issue="CODEC-276" dev="ggregory" type="add"
> >>>> due-to="Gary
> >>>>>> Gregory">Reliance on default encoding in MurmurHash2 and
> >>>>>> MurmurHash3.</action>
> >>>>>>>    </release>
> >>>>>>>
> >>>>>>>    <release version="1.13" date="2019-07-20"
> >> description="Feature
> >>>> and
> >>>>>> fix release.">
> >>>>>>> diff --git
> >>>>>> a/src/main/java/org/apache/commons/codec/digest/MurmurHash2.java
> >>>>>> b/src/main/java/org/apache/commons/codec/digest/MurmurHash2.java
> >>>>>>> index 5b0a712..4dfeca9 100644
> >>>>>>> ---
> >>> a/src/main/java/org/apache/commons/codec/digest/MurmurHash2.java
> >>>>>>> +++
> >>> b/src/main/java/org/apache/commons/codec/digest/MurmurHash2.java
> >>>>>>> @@ -17,6 +17,9 @@
> >>>>>>>
> >>>>>>> package org.apache.commons.codec.digest;
> >>>>>>>
> >>>>>>> +import java.nio.charset.Charset;
> >>>>>>> +import java.nio.charset.StandardCharsets;
> >>>>>>> +
> >>>>>>> /**
> >>>>>>> * Implementation of the MurmurHash2 32-bit and 64-bit hash
> >>>> functions.
> >>>>>>> *
> >>>>>>> @@ -48,6 +51,13 @@ package org.apache.commons.codec.digest;
> >>>>>>> */
> >>>>>>> public final class MurmurHash2 {
> >>>>>>>
> >>>>>>> +    /**
> >>>>>>> +     * Default Charset used to convert strings into bytes.
> >>>>>>> +     *
> >>>>>>> +     * Consider private; package private for tests only.
> >>>>>>> +     */
> >>>>>>> +    static final Charset GET_BYTES_CHARSET =
> >>> StandardCharsets.UTF_8;
> >>>>>>> +
> >>>>>>>    // Constants for 32-bit variant
> >>>>>>>    private static final int M32 = 0x5bd1e995;
> >>>>>>>    private static final int R32 = 24;
> >>>>>>> @@ -132,7 +142,7 @@ public final class MurmurHash2 {
> >>>>>>>     *
> >>>>>>>     * <pre>
> >>>>>>>     * int seed = 0x9747b28c;
> >>>>>>> -     * byte[] bytes = data.getBytes();
> >>>>>>> +     * byte[] bytes = data.getBytes(StandardCharsets.UTF_8);
> >>>>>>>     * int hash = MurmurHash2.hash32(bytes, bytes.length, seed);
> >>>>>>>     * </pre>
> >>>>>>>     *
> >>>>>>> @@ -141,7 +151,7 @@ public final class MurmurHash2 {
> >>>>>>>     * @see #hash32(byte[], int, int)
> >>>>>>>     */
> >>>>>>>    public static int hash32(final String text) {
> >>>>>>> -        final byte[] bytes = text.getBytes();
> >>>>>>> +        final byte[] bytes = text.getBytes(GET_BYTES_CHARSET);
> >>>>>>>        return hash32(bytes, bytes.length);
> >>>>>>>    }
> >>>>>>>
> >>>>>>> @@ -152,7 +162,7 @@ public final class MurmurHash2 {
> >>>>>>>     *
> >>>>>>>     * <pre>
> >>>>>>>     * int seed = 0x9747b28c;
> >>>>>>> -     * byte[] bytes = text.substring(from, from +
> >>>> length).getBytes();
> >>>>>>> +     * byte[] bytes = text.substring(from, from +
> >>>>>> length).getBytes(StandardCharsets.UTF_8);
> >>>>>>>     * int hash = MurmurHash2.hash32(bytes, bytes.length, seed);
> >>>>>>>     * </pre>
> >>>>>>>     *
> >>>>>>> @@ -243,7 +253,7 @@ public final class MurmurHash2 {
> >>>>>>>     *
> >>>>>>>     * <pre>
> >>>>>>>     * int seed = 0xe17a1465;
> >>>>>>> -     * byte[] bytes = data.getBytes();
> >>>>>>> +     * byte[] bytes = data.getBytes(StandardCharsets.UTF_8);
> >>>>>>>     * int hash = MurmurHash2.hash64(bytes, bytes.length, seed);
> >>>>>>>     * </pre>
> >>>>>>>     *
> >>>>>>> @@ -252,7 +262,7 @@ public final class MurmurHash2 {
> >>>>>>>     * @see #hash64(byte[], int, int)
> >>>>>>>     */
> >>>>>>>    public static long hash64(final String text) {
> >>>>>>> -        final byte[] bytes = text.getBytes();
> >>>>>>> +        final byte[] bytes = text.getBytes(GET_BYTES_CHARSET);
> >>>>>>>        return hash64(bytes, bytes.length);
> >>>>>>>    }
> >>>>>>>
> >>>>>>> @@ -263,7 +273,7 @@ public final class MurmurHash2 {
> >>>>>>>     *
> >>>>>>>     * <pre>
> >>>>>>>     * int seed = 0xe17a1465;
> >>>>>>> -     * byte[] bytes = text.substring(from, from +
> >>>> length).getBytes();
> >>>>>>> +     * byte[] bytes = text.substring(from, from +
> >>>>>> length).getBytes(StandardCharsets.UTF_8);
> >>>>>>>     * int hash = MurmurHash2.hash64(bytes, bytes.length, seed);
> >>>>>>>     * </pre>
> >>>>>>>     *
> >>>>>>> diff --git
> >>>>>> a/src/main/java/org/apache/commons/codec/digest/MurmurHash3.java
> >>>>>> b/src/main/java/org/apache/commons/codec/digest/MurmurHash3.java
> >>>>>>> index cfeb3d1..4509a8f 100644
> >>>>>>> ---
> >>> a/src/main/java/org/apache/commons/codec/digest/MurmurHash3.java
> >>>>>>> +++
> >>> b/src/main/java/org/apache/commons/codec/digest/MurmurHash3.java
> >>>>>>> @@ -17,6 +17,9 @@
> >>>>>>>
> >>>>>>> package org.apache.commons.codec.digest;
> >>>>>>>
> >>>>>>> +import java.nio.charset.Charset;
> >>>>>>> +import java.nio.charset.StandardCharsets;
> >>>>>>> +
> >>>>>>> /**
> >>>>>>> * Implementation of the MurmurHash3 32-bit and 128-bit hash
> >>>> functions.
> >>>>>>> *
> >>>>>>> @@ -56,6 +59,13 @@ package org.apache.commons.codec.digest;
> >>>>>>> public final class MurmurHash3 {
> >>>>>>>
> >>>>>>>    /**
> >>>>>>> +     * Default Charset used to convert strings into bytes.
> >>>>>>> +     *
> >>>>>>> +     * Consider private; package private for tests only.
> >>>>>>> +     */
> >>>>>>> +    static final Charset GET_BYTES_CHARSET =
> >>> StandardCharsets.UTF_8;
> >>>>>>> +
> >>>>>>> +    /**
> >>>>>>>     * A random number to use for a hash code.
> >>>>>>>     *
> >>>>>>>     * @deprecated This is not used internally and will be
> >> removed
> >>>> in a
> >>>>>> future release.
> >>>>>>> @@ -233,7 +243,7 @@ public final class MurmurHash3 {
> >>>>>>>     * <pre>
> >>>>>>>     * int offset = 0;
> >>>>>>>     * int seed = 104729;
> >>>>>>> -     * byte[] bytes = data.getBytes();
> >>>>>>> +     * byte[] bytes = data.getBytes(StandardCharsets.UTF_8);
> >>>>>>>     * int hash = MurmurHash3.hash32(bytes, offset, bytes.length,
> >>>>> seed);
> >>>>>>>     * </pre>
> >>>>>>>     *
> >>>>>>> @@ -249,7 +259,7 @@ public final class MurmurHash3 {
> >>>>>>>     */
> >>>>>>>    @Deprecated
> >>>>>>>    public static int hash32(final String data) {
> >>>>>>> -        final byte[] bytes = data.getBytes();
> >>>>>>> +        final byte[] bytes = data.getBytes(GET_BYTES_CHARSET);
> >>>>>>>        return hash32(bytes, 0, bytes.length, DEFAULT_SEED);
> >>>>>>>    }
> >>>>>>>
> >>>>>>> @@ -751,7 +761,7 @@ public final class MurmurHash3 {
> >>>>>>>     * <pre>
> >>>>>>>     * int offset = 0;
> >>>>>>>     * int seed = 104729;
> >>>>>>> -     * byte[] bytes = data.getBytes();
> >>>>>>> +     * byte[] bytes = data.getBytes(StandardCharsets.UTF_8);
> >>>>>>>     * int hash = MurmurHash3.hash128(bytes, offset,
> >> bytes.length,
> >>>>> seed);
> >>>>>>>     * </pre>
> >>>>>>>     *
> >>>>>>> @@ -766,7 +776,7 @@ public final class MurmurHash3 {
> >>>>>>>     */
> >>>>>>>    @Deprecated
> >>>>>>>    public static long[] hash128(final String data) {
> >>>>>>> -        final byte[] bytes = data.getBytes();
> >>>>>>> +        final byte[] bytes = data.getBytes(GET_BYTES_CHARSET);
> >>>>>>>        return hash128(bytes, 0, bytes.length, DEFAULT_SEED);
> >>>>>>>    }
> >>>>>>>
> >>>>>>> diff --git
> >>>>>>
> >> a/src/test/java/org/apache/commons/codec/digest/MurmurHash3Test.java
> >>>>>>
> >> b/src/test/java/org/apache/commons/codec/digest/MurmurHash3Test.java
> >>>>>>> index 4703f9f..61df7f2 100644
> >>>>>>> ---
> >>>>> a/src/test/java/org/apache/commons/codec/digest/MurmurHash3Test.java
> >>>>>>> +++
> >>>>> b/src/test/java/org/apache/commons/codec/digest/MurmurHash3Test.java
> >>>>>>> @@ -355,7 +355,7 @@ public class MurmurHash3Test {
> >>>>>>>                pos += Character.toChars(codePoint, chars, pos);
> >>>>>>>            }
> >>>>>>>            final String text = String.copyValueOf(chars, 0,
> >> pos);
> >>>>>>> -            final byte[] bytes = text.getBytes();
> >>>>>>> +            final byte[] bytes =
> >>>>>> text.getBytes(MurmurHash3.GET_BYTES_CHARSET);
> >>>>>>>            final int h1 = MurmurHash3.hash32(bytes, 0,
> >>> bytes.length,
> >>>>>> seed);
> >>>>>>>            final int h2 = MurmurHash3.hash32(text);
> >>>>>>>            Assert.assertEquals(h1, h2);
> >>>>>>> @@ -455,7 +455,7 @@ public class MurmurHash3Test {
> >>>>>>>     */
> >>>>>>>    @Test
> >>>>>>>    public void testHash64() {
> >>>>>>> -        final byte[] origin = TEST_HASH64.getBytes();
> >>>>>>> +        final byte[] origin =
> >>>>>> TEST_HASH64.getBytes(MurmurHash3.GET_BYTES_CHARSET);
> >>>>>>>        final long hash = MurmurHash3.hash64(origin);
> >>>>>>>        Assert.assertEquals(5785358552565094607L, hash);
> >>>>>>>    }
> >>>>>>> @@ -466,7 +466,7 @@ public class MurmurHash3Test {
> >>>>>>>     */
> >>>>>>>    @Test
> >>>>>>>    public void testHash64WithOffsetAndLength() {
> >>>>>>> -        final byte[] origin = TEST_HASH64.getBytes();
> >>>>>>> +        final byte[] origin =
> >>>>>> TEST_HASH64.getBytes(MurmurHash3.GET_BYTES_CHARSET);
> >>>>>>>        final byte[] originOffset = new byte[origin.length +
> >> 150];
> >>>>>>>        Arrays.fill(originOffset, (byte) 123);
> >>>>>>>        System.arraycopy(origin, 0, originOffset, 150,
> >>>> origin.length);
> >>>>>>> @@ -627,7 +627,7 @@ public class MurmurHash3Test {
> >>>>>>>                pos += Character.toChars(codePoint, chars, pos);
> >>>>>>>            }
> >>>>>>>            final String text = String.copyValueOf(chars, 0,
> >> pos);
> >>>>>>> -            final byte[] bytes = text.getBytes();
> >>>>>>> +            final byte[] bytes =
> >>>>>> text.getBytes(MurmurHash3.GET_BYTES_CHARSET);
> >>>>>>>            final long[] h1 = MurmurHash3.hash128(bytes, 0,
> >>>>>> bytes.length, seed);
> >>>>>>>            final long[] h2 = MurmurHash3.hash128(text);
> >>>>>>>            Assert.assertArrayEquals(h1, h2);
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> >>>>>> For additional commands, e-mail: dev-h...@commons.apache.org
>
>

Reply via email to