Re: [Discuss][Java] 64-bit lengths for ValueVectors

Wes McKinney Wed, 14 Aug 2019 15:24:15 -0700

On Sun, Aug 11, 2019 at 9:40 PM Micah Kornfield <[email protected]> wrote:
>
> Hi Wes and Jacques,
> See responses below.
>
> With regards to the reference implementation point. It is a good point. I'm
> > on vacation this week. Unless you're pushing hard on this, can we pick this
> > up and discuss more next week?
>
>
> Sure thing, enjoy your vacation.  I think the only practical implications
> are it delays choices around implementing LargeList, LargeBinary,
> LargeString in Java, which in turn might push out the 0.15.0 release.
>


To copy the sentiments from the 0.15.0 release thread, I think it
would be best to decouple this discussion from the release timeline
given how many people we have relying on regular releases coming out.
We can keep continue making major 0.x releases until we're ready to
release 1.0.0.

> My stance on this is that I don't know how important it is for Java to
> > support vectors over INT32_MAX elements. The use cases enabled by
> > having very large arrays seem to be concentrated in the native code
> > world (e.g. C/C++/Rust) -- that could just be implementation-centrism
> > on my part, though.
>
>
> A data point against this view is Spark has done work to eliminate 2GB
> memory limits on its block sizes [1].  I don't claim to understand the
> implications of this. Bryan might you have any thoughts here?  I'm OK with
> INT32_MAX, as well, I think we should think about what this means for
> adding Large types to Java and implications for reference implementations.
>
> Thanks,
> Micah
>
> [1] https://issues.apache.org/jira/browse/SPARK-6235
>
> On Sun, Aug 11, 2019 at 6:31 PM Jacques Nadeau <[email protected]> wrote:
>
> > Hey Micah,
> >
> > Appreciate the offer on the compiling. The reality is I'm more concerned
> > about the unknowns than the compiling issue itself. Any time you've been
> > tuning for a while, changing something like this could be totally fine or
> > cause a couple of major issues. For example, we've done a very large amount
> > of work reducing heap memory footprint of the vectors. Are target is to
> > actually get it down to 24 bytes per ArrowBuf and 24 bytes heap per vector
> > (not including arrow bufs).
> >
> > With regards to the reference implementation point. It is a good point.
> > I'm on vacation this week. Unless you're pushing hard on this, can we pick
> > this up and discuss more next week?
> >
> > thanks,
> > Jacques
> >
> > On Sat, Aug 10, 2019 at 7:39 PM Micah Kornfield <[email protected]>
> > wrote:
> >
> >> Hi Jacques,
> >> I definitely understand these concerns and this change is risky because it
> >> is so large.  Perhaps, creating a new hierarchy, might be the cleanest way
> >> of dealing with this.  This could have other benefits like cleaning up
> >> some
> >> cruft around dictionary encode and "orphaned" method.   Per past e-mail
> >> threads I agree it is beneficial to have 2 separate reference
> >> implementations that can communicate fully, and my intent here was to
> >> close
> >> that gap.
> >>
> >> Trying to
> >> > determine the ramifications of these changes would be challenging and
> >> time
> >> > consuming against all the different ways we interact with the Arrow Java
> >> > library.
> >>
> >>
> >> Understood.  I took a quick look at Dremio-OSS it seems like it has a
> >> simple java build system?  If it is helpful, I can try to get a fork
> >> running that at least compiles against this PR.  My plan would be to cast
> >> any place that was changed to return a long back to an int, so in essence
> >> the Dremio algorithms would reman 32-bit implementations.
> >>
> >> I don't  have the infrastructure to test this change properly from a
> >> distributed systems perspective, so it would still take some time from
> >> Dremio to validate for regressions.
> >>
> >> I'm not saying I'm against this but want to make sure we've
> >> > explored all less disruptive options before considering changing
> >> something
> >> > this fundamental (especially when I generally hold the view that large
> >> cell
> >> > counts against massive contiguous memory is an anti pattern to scalable
> >> > analytical processing--purely subjective of course).
> >>
> >>
> >> I'm open to other ideas here, as well. I don't think it is out of the
> >> question to leave the Java implementation as 32-bit, but if we do, then I
> >> think we should consider a different strategy for reference
> >> implementations.
> >>
> >> Thanks,
> >> Micah
> >>
> >> On Sat, Aug 10, 2019 at 5:09 PM Jacques Nadeau <[email protected]>
> >> wrote:
> >>
> >> > Hey Micah, I didn't have a particular path in mind. Was thinking more
> >> along
> >> > the lines of extra methods as opposed to separate classes.
> >> >
> >> > Arrow hasn't historically been a place where we're writing algorithms in
> >> > Java so the fact that they aren't there doesn't mean they don't exist.
> >> We
> >> > have a large amount of code that depends on the current behavior that is
> >> > deployed in hundreds of customer clusters (you can peruse our dremio
> >> repo
> >> > to see how extensively we leverage Arrow if interested). Trying to
> >> > determine the ramifications of these changes would be challenging and
> >> time
> >> > consuming against all the different ways we interact with the Arrow Java
> >> > library. I'm not saying I'm against this but want to make sure we've
> >> > explored all less disruptive options before considering changing
> >> something
> >> > this fundamental (especially when I generally hold the view that large
> >> cell
> >> > counts against massive contiguous memory is an anti pattern to scalable
> >> > analytical processing--purely subjective of course).
> >> >
> >> > On Sat, Aug 10, 2019, 4:17 PM Micah Kornfield <[email protected]>
> >> > wrote:
> >> >
> >> > > Hi Jacques,
> >> > > What avenue were you thinking for supporting both paths?   I didn't
> >> want
> >> > > to pursue a different class hierarchy, because I felt like that would
> >> > > effectively fork the code base, but that is potentially an option that
> >> > > would allow us to have a complete reference implementation in Java
> >> that
> >> > can
> >> > > fully interact with C++, without major changes to this code.
> >> > >
> >> > > For supporting both APIs on the same classes/interfaces, I think they
> >> > > roughly fall into three categories, changes to input parameters,
> >> changes
> >> > to
> >> > > output parameters and algorithm changes.
> >> > >
> >> > > For inputs, changing from int to long is essentially a no-op from the
> >> > > compiler perspective.  From the limited micro-benchmarking this also
> >> > > doesn't seem to have a performance impact.  So we could keep two
> >> versions
> >> > > of the methods that only differ on inputs, but it is not clear what
> >> the
> >> > > value of that would be.
> >> > >
> >> > > For outputs, we can't support methods "long getLength()" and "int
> >> > > getLength()" in the same class, so we would be forced into something
> >> like
> >> > > "long getLength(boolean dummy)" which I think is a less desirable.
> >> > >
> >> > > For algorithm changes, there did not appear to be too many places
> >> where
> >> > we
> >> > > actually loop over all elements (it is quite possible I missed
> >> something
> >> > > here), the ones that I did find I was able to mitigate performance
> >> > > penalties as noted above.  Some of the current implementation will
> >> get a
> >> > > lot slower for "large arrays", but we can likely fix those later or in
> >> > this
> >> > > PR with a nested while loop instead of 2 for loops.
> >> > >
> >> > > Thanks,
> >> > > Micah
> >> > >
> >> > >
> >> > > On Saturday, August 10, 2019, Jacques Nadeau <[email protected]>
> >> wrote:
> >> > >
> >> > >> This is a pretty massive change to the apis. I wonder how nasty it
> >> would
> >> > >> be to just support both paths. Have you evaluated how complex that
> >> > would be?
> >> > >>
> >> > >> On Wed, Aug 7, 2019 at 11:08 PM Micah Kornfield <
> >> [email protected]>
> >> > >> wrote:
> >> > >>
> >> > >>> After more investigation, it looks like Float8Benchmarks at least
> >> on my
> >> > >>> machine are within the range of noise.
> >> > >>>
> >> > >>> For BitVectorHelper I pushed a new commit [1], seems to bring the
> >> > >>> BitVectorHelper benchmarks back inline (and even with some
> >> improvement
> >> > >>> for
> >> > >>> getNullCountBenchmark).
> >> > >>>
> >> > >>> Benchmark                                        Mode  Cnt   Score
> >> > >>>  Error
> >> > >>>  Units
> >> > >>> BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5   3.821 ±
> >> > >>> 0.031
> >> > >>>  ns/op
> >> > >>> BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5  14.884 ±
> >> > >>> 0.141
> >> > >>>  ns/op
> >> > >>>
> >> > >>> I applied the same pattern to other loops that I could find, and for
> >> > any
> >> > >>> "for (long" loop on the critical path, I broke it up into two loops.
> >> > the
> >> > >>> first loop does iteration by integer, the second finishes off for
> >> any
> >> > >>> long
> >> > >>> values.  As a side note it seems like optimization for loops using
> >> long
> >> > >>> counters at least have a semi-recent open bug for the JVM [2]
> >> > >>>
> >> > >>> Thanks,
> >> > >>> Micah
> >> > >>>
> >> > >>> [1]
> >> > >>>
> >> > >>>
> >> >
> >> https://github.com/apache/arrow/pull/5020/commits/2ea2c1ae83e3baa7b9a99a6d06276d968df41797
> >> > >>> [2] https://bugs.openjdk.java.net/browse/JDK-8223051
> >> > >>>
> >> > >>> On Wed, Aug 7, 2019 at 8:11 PM Micah Kornfield <
> >> [email protected]>
> >> > >>> wrote:
> >> > >>>
> >> > >>> > Indeed, the BoundChecking and CheckNullForGet variables can make a
> >> > big
> >> > >>> > difference.  I didn't initially run the benchmarks with these
> >> turned
> >> > on
> >> > >>> > (you can see the result from above with Float8Benchmarks).  Here
> >> are
> >> > >>> new
> >> > >>> > numbers including with the flags enabled.  It looks like using
> >> longs
> >> > >>> might
> >> > >>> > be a little bit slower, I'll see what I can do to mitigate this.
> >> > >>> >
> >> > >>> > Ravindra also volunteered to try to benchmark the changes with
> >> > Dremio's
> >> > >>> > code on today's sync call.
> >> > >>> >
> >> > >>> > New
> >> > >>> >
> >> > >>> > Benchmark                                        Mode  Cnt   Score
> >> > >>>  Error
> >> > >>> > Units
> >> > >>> >
> >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
> >>  4.176 ±
> >> > >>> 1.292
> >> > >>> > ns/op
> >> > >>> >
> >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
> >> 26.102 ±
> >> > >>> 0.700
> >> > >>> > ns/op
> >> > >>> >
> >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  7.398 ± 0.084
> >> us/op
> >> > >>> >
> >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.711 ± 0.057
> >> us/op
> >> > >>> >
> >> > >>> >
> >> > >>> >
> >> > >>> > old
> >> > >>> >
> >> > >>> > BitVectorHelperBenchmarks.allBitsNullBenchmark   avgt    5
> >>  3.828 ±
> >> > >>> 0.030
> >> > >>> > ns/op
> >> > >>> >
> >> > >>> > BitVectorHelperBenchmarks.getNullCountBenchmark  avgt    5
> >> 20.611 ±
> >> > >>> 0.188
> >> > >>> > ns/op
> >> > >>> >
> >> > >>> > Float8Benchmarks.copyFromBenchmark   avgt    5  6.597 ± 0.462
> >> us/op
> >> > >>> >
> >> > >>> > Float8Benchmarks.readWriteBenchmark  avgt    5  2.615 ± 0.027
> >> us/op
> >> > >>> >
> >> > >>> > On Wed, Aug 7, 2019 at 7:13 PM Fan Liya <[email protected]>
> >> > wrote:
> >> > >>> >
> >> > >>> >> Hi Gonzalo,
> >> > >>> >>
> >> > >>> >> Thanks for sharing the performance results.
> >> > >>> >> I am wondering if you have turned off the flag
> >> > >>> >> BoundsChecking#BOUNDS_CHECKING_ENABLED.
> >> > >>> >> If not, the lower throughput should be expected.
> >> > >>> >>
> >> > >>> >> Best,
> >> > >>> >> Liya Fan
> >> > >>> >>
> >> > >>> >> On Wed, Aug 7, 2019 at 10:23 PM Micah Kornfield <
> >> > >>> [email protected]>
> >> > >>> >> wrote:
> >> > >>> >>
> >> > >>> >>> Hi Gonzalo,
> >> > >>> >>> Thank you for the feedback.  I wasn't aware of the JIT
> >> > >>> implications.   At
> >> > >>> >>> least on the benchmark run they don't seem to have an impact.
> >> > >>> >>>
> >> > >>> >>> If there are other benchmarks that people have that can
> >> validate if
> >> > >>> this
> >> > >>> >>> change will be problematic I would appreciate trying to run them
> >> > >>> with the
> >> > >>> >>> PR.  I will try to run the ones for zeroing/popcnt tonight to
> >> see
> >> > if
> >> > >>> >>> there
> >> > >>> >>> is a change in those.
> >> > >>> >>>
> >> > >>> >>> -Micah
> >> > >>> >>>
> >> > >>> >>>
> >> > >>> >>>
> >> > >>> >>> On Wednesday, August 7, 2019, Gonzalo Ortiz Jaureguizar <
> >> > >>> >>> [email protected]> wrote:
> >> > >>> >>>
> >> > >>> >>> > I would recommend to take care with this kind of changes.
> >> > >>> >>> >
> >> > >>> >>> > I didn't try Arrow in more than one year, but by then the
> >> > >>> performance
> >> > >>> >>> was
> >> > >>> >>> > quite bad in comparison with plain byte buffer access
> >> > >>> >>> > (see http://git.net/apache-arrow-development/msg02353.html *)
> >> > and
> >> > >>> >>> > there are several optimizations that the JVM (specifically,
> >> C2)
> >> > >>> does
> >> > >>> >>> not
> >> > >>> >>> > apply when dealing with int instead of longs. One of the
> >> > >>> >>> > most commons is the loop unrolling and vectorization.
> >> > >>> >>> >
> >> > >>> >>> > * It doesn't seem the best way to reference an old email on
> >> the
> >> > >>> list,
> >> > >>> >>> but
> >> > >>> >>> > it is the only result shown by Google
> >> > >>> >>> >
> >> > >>> >>> > El mié., 7 ago. 2019 a las 11:42, Fan Liya (<
> >> > [email protected]
> >> > >>> >)
> >> > >>> >>> > escribió:
> >> > >>> >>> >
> >> > >>> >>> >> Hi Micah,
> >> > >>> >>> >>
> >> > >>> >>> >> Thanks for your effort. The performance result looks good.
> >> > >>> >>> >>
> >> > >>> >>> >> As you indicated, ArrowBuf will take additional 12 bytes (4
> >> > bytes
> >> > >>> for
> >> > >>> >>> each
> >> > >>> >>> >> of length, write index, and read index).
> >> > >>> >>> >> Similar overheads also exist for vectors like
> >> > >>> BaseFixedWidthVector,
> >> > >>> >>> >> BaseVariableWidthVector, etc.
> >> > >>> >>> >>
> >> > >>> >>> >> IMO, such overheads are small enough to justify the change.
> >> > >>> >>> >> Let's check if there are other overheads.
> >> > >>> >>> >>
> >> > >>> >>> >> Best,
> >> > >>> >>> >> Liya Fan
> >> > >>> >>> >>
> >> > >>> >>> >> On Wed, Aug 7, 2019 at 3:30 PM Micah Kornfield <
> >> > >>> [email protected]
> >> > >>> >>> >
> >> > >>> >>> >> wrote:
> >> > >>> >>> >>
> >> > >>> >>> >> > Hi Liya Fan,
> >> > >>> >>> >> > Based on the Float8Benchmark there does not seem to be any
> >> > >>> >>> meaningful
> >> > >>> >>> >> > performance difference on my machine.  At least for me, the
> >> > >>> >>> benchmarks
> >> > >>> >>> >> are
> >> > >>> >>> >> > not stable enough to say one is faster than the other (I've
> >> > >>> pasted
> >> > >>> >>> >> results
> >> > >>> >>> >> > below).  That being said my machine isn't necessarily the
> >> most
> >> > >>> >>> reliable
> >> > >>> >>> >> for
> >> > >>> >>> >> > benchmarking.
> >> > >>> >>> >> >
> >> > >>> >>> >> > On an intuitive level, this makes sense to me,  for the
> >> most
> >> > >>> part it
> >> > >>> >>> >> seems
> >> > >>> >>> >> > like the change just moves casting from "int" to "long"
> >> > further
> >> > >>> up
> >> > >>> >>> the
> >> > >>> >>> >> > stack  for  "PlatformDepdendent" operations.  If there are
> >> > other
> >> > >>> >>> >> benchmarks
> >> > >>> >>> >> > that you think are worth running let me know.
> >> > >>> >>> >> >
> >> > >>> >>> >> > One downside performance wise I think for his change is it
> >> > >>> >>> increases the
> >> > >>> >>> >> > size of ArrowBuf objects, which I suppose could influence
> >> > cache
> >> > >>> >>> misses
> >> > >>> >>> >> at
> >> > >>> >>> >> > some level or increase the size of call-stacks, but this
> >> > doesn't
> >> > >>> >>> seem to
> >> > >>> >>> >> > show up in the benchmark..
> >> > >>> >>> >> >
> >> > >>> >>> >> > Thanks,
> >> > >>> >>> >> > Micah
> >> > >>> >>> >> >
> >> > >>> >>> >> > Sample benchmark numbers:
> >> > >>> >>> >> >
> >> > >>> >>> >> > [New Code]
> >> > >>> >>> >> > Benchmark                            Mode  Cnt   Score
> >>  Error
> >> > >>> >>> Units
> >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  15.441 ±
> >> 0.469
> >> > >>> >>> us/op
> >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.057 ±
> >> 0.115
> >> > >>> >>> us/op
> >> > >>> >>> >> >
> >> > >>> >>> >> >
> >> > >>> >>> >> > [Old code]
> >> > >>> >>> >> > Benchmark                            Mode  Cnt   Score
> >>  Error
> >> > >>> >>> Units
> >> > >>> >>> >> > Float8Benchmarks.copyFromBenchmark   avgt    5  16.248 ±
> >> 1.409
> >> > >>> >>> us/op
> >> > >>> >>> >> > Float8Benchmarks.readWriteBenchmark  avgt    5  14.150 ±
> >> 0.084
> >> > >>> >>> us/op
> >> > >>> >>> >> >
> >> > >>> >>> >> > On Tue, Aug 6, 2019 at 1:18 AM Fan Liya <
> >> [email protected]
> >> > >
> >> > >>> >>> wrote:
> >> > >>> >>> >> >
> >> > >>> >>> >> >> Hi Micah,
> >> > >>> >>> >> >>
> >> > >>> >>> >> >> Thanks a lot for doing this.
> >> > >>> >>> >> >>
> >> > >>> >>> >> >> I am a little concerned about if there is any negative
> >> > >>> performance
> >> > >>> >>> >> impact
> >> > >>> >>> >> >> on the current 32-bit-length based applications.
> >> > >>> >>> >> >> Can we do some performance comparison on our existing
> >> > >>> benchmarks?
> >> > >>> >>> >> >>
> >> > >>> >>> >> >> Best,
> >> > >>> >>> >> >> Liya Fan
> >> > >>> >>> >> >>
> >> > >>> >>> >> >>
> >> > >>> >>> >> >> On Tue, Aug 6, 2019 at 3:35 PM Micah Kornfield <
> >> > >>> >>> [email protected]>
> >> > >>> >>> >> >> wrote:
> >> > >>> >>> >> >>
> >> > >>> >>> >> >>> There have been some previous discussions on the mailing
> >> > about
> >> > >>> >>> >> supporting
> >> > >>> >>> >> >>> 64-bit lengths for  Java ValueVectors (this is what the
> >> IPC
> >> > >>> >>> >> specification
> >> > >>> >>> >> >>> and C++ support).  I created a PR [1] that changes all
> >> APIs
> >> > >>> that I
> >> > >>> >>> >> could
> >> > >>> >>> >> >>> find that take an index to take an "long" instead of an
> >> > "int"
> >> > >>> (and
> >> > >>> >>> >> >>> similarly change "size/rowcount" APIs).
> >> > >>> >>> >> >>>
> >> > >>> >>> >> >>> It is a big change, so I think it is worth discussing if
> >> it
> >> > is
> >> > >>> >>> >> something
> >> > >>> >>> >> >>> we
> >> > >>> >>> >> >>> still want to move forward with.  It would be nice to
> >> come
> >> > to
> >> > >>> a
> >> > >>> >>> >> >>> conclusion
> >> > >>> >>> >> >>> quickly, ideally in the next few days, to avoid a lot of
> >> > merge
> >> > >>> >>> >> conflicts.
> >> > >>> >>> >> >>>
> >> > >>> >>> >> >>> The reason I did this work now is the C++ implementation
> >> has
> >> > >>> added
> >> > >>> >>> >> >>> support
> >> > >>> >>> >> >>> for LargeList, LargeBinary and LargeString arrays and
> >> based
> >> > on
> >> > >>> >>> prior
> >> > >>> >>> >> >>> discussions we need to have similar support in Java
> >> before
> >> > our
> >> > >>> >>> next
> >> > >>> >>> >> >>> release. Support 64-bit indexes means we can have full
> >> > >>> >>> compatibility
> >> > >>> >>> >> and
> >> > >>> >>> >> >>> make the most use of the types in Java.
> >> > >>> >>> >> >>>
> >> > >>> >>> >> >>> Look forward to hearing feedback.
> >> > >>> >>> >> >>>
> >> > >>> >>> >> >>> Thanks,
> >> > >>> >>> >> >>> Micah
> >> > >>> >>> >> >>>
> >> > >>> >>> >> >>> [1] https://github.com/apache/arrow/pull/5020
> >> > >>> >>> >> >>>
> >> > >>> >>> >> >>
> >> > >>> >>> >>
> >> > >>> >>> >
> >> > >>> >>>
> >> > >>> >>
> >> > >>>
> >> > >>
> >> >
> >>
> >

Re: [Discuss][Java] 64-bit lengths for ValueVectors

Reply via email to