Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

Wes McKinney Thu, 09 May 2019 06:39:20 -0700

It has also been previously suggested to add a get* method that
returns the value in the ArrowBuf without null checking, like
getDirty. See


https://issues.apache.org/jira/browse/ARROW-1833

Any thoughts about that?

On Thu, May 9, 2019 at 4:54 AM niki.lj <niki...@aliyun.com.invalid> wrote:
>
> +1 on this proposal.
>
>
> ------------------------------------------------------------------
> 发件人：Fan Liya <liya.fa...@gmail.com>
> 发送时间：2019年5月9日(星期四) 16:33
> 收件人：dev <dev@arrow.apache.org>
> 主　题：Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow
>
> Hi all,
>
> Our previous results on micro-benchmarks show that, the original Arrow API
> is 30% slower than the unsafe API.
> After profiling, we found that, the performance overhead comes from the
> null-checking in the get method. For example, the get method of
> Float8Vector looks like this:
>
>   public double get(int index) throws IllegalStateException {
>     if (isSet(index) == 0) {
>       throw new IllegalStateException("Value at index is null");
>     }
>     return valueBuffer.getDouble(index * TYPE_WIDTH);
>   }
>
> It first makes sure the value is not null, and then retrieve the value.
>
> In some cases, the first check is redundant, because the application code
> usually do the check before calling the get method. For such cases, the
> first check can be skipped.
> Therefore, @Jacques Nadeau suggests adding another flag to enable/disable
> such check. I think this is a good suggestion, because it solves the
> performance problem, without introducing a new set of vector classes. What
> do you think?
>
> I have opened a JIRA for that (ARROW-5290
> <https://issues.apache.org/jira/browse/ARROW-5290>). Please give your
> valuable comments.
> Thanks a lot for your attention and valuable comments.
> Special thanks to @Jacques Nadeau for all the suggestions and helpful
> comments.
>
> Best,
> Liya Fan
>
>
>
>
> On Wed, May 8, 2019 at 1:05 PM Fan Liya <liya.fa...@gmail.com> wrote:
>
> > Hi Jacques,
> >
> > Thanks a lot for your comments.
> >
> > I have evaluated the assembly code of original Arrow API, as well as the
> > unsafe API in our PR <https://github.com/apache/arrow/pull/4212>
> > Generally, the assembly code generated by JIT for both APIs are of high
> > quality, and for most cases, the assembly code are almost the same.
> >
> > However, some checks can be further removed. The following figures give an
> > example (the figures are too big to be attached, so I have attached them in
> > a JIRA comment. Please see comment
> > <https://issues.apache.org/jira/browse/ARROW-5200?focusedCommentId=16835303&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16835303>.
> >  Sorry
> > for the inconvenience):
> >
> > The first figure shows the code of original Arrow API, while the second
> > shows the code for the unsafe API.
> > It can be observed that for the unsafe API, the amounts of the source,
> > byte and assembly code are all smaller. So it can be expected that the
> > performance of unsafe API is better.
> >
> > Concerning this particular example for the Float8Vector, I think it is
> > reasonable to further remove the check in the get method:
> > Before we call the get method, we must check if the value is null, so the
> > check in the get method is redundant. And this is a typical scenario of
> > using Arrow API (check and then get), at least for our scenario (please
> > take a glimpse of our benchmark in PR
> > <https://github.com/apache/arrow/pull/4198>).
> >
> > Concerning the other problem, about the real algorithm in our scenario. I
> > want to make two points:
> >
> > 1. SQL engines are performance critical, so 30% is a large number for us.
> > For the past year, it took our team several months just to improve the
> > performance of our runtime engine by around 15%.
> >
> > 2. The performance of engine heavily depends on the performance of Arrow.
> > Most SQL engines are memory-intensive, so the performance of get/set
> > methods is the key. To get a flavor of the algorithms in our engine, please
> > refer to PR <https://github.com/apache/arrow/pull/4198>. That is the core
> > algorithm of our operator, which is executed many times during the
> > processing of a SQL query. You can find that the computation is relatively
> > simple, and most method calls are memory accesses.
> >
> > Best,
> > Liya Fan
> >
> > On Mon, May 6, 2019 at 5:52 PM Jacques Nadeau <jacq...@apache.org> wrote:
> >
> >> I am still asking the same question: can you please analyze the assembly
> >> the JIT is producing and look to identify why the disabled bounds checking
> >> is at 30% and what types of things we can do to address. For example, we
> >> have talked before about a bytecode transformer that simply removes the
> >> bounds checking when loading Arrow if you want that behavior. If
> >> necessary,
> >> that may be a big win from a code maintenance standpoint over having
> >> duplicate interfaces.
> >>
> >> The static block seems like a non-problem. You could simply load another
> >> class that system property before loading any Arrow code. If you're
> >> proposing a code change to solve your problem today, this seems just as
> >> feasible.
> >>
> >> The other question: in a real algorithm, how much does that 30% matter?
> >> Your benchmarks are entirely about this one call whereas real workloads
> >> are
> >> impacted by many things and the time in writing/reading vectors is
> >> miniscule versus other things.
> >>
> >> On Mon, May 6, 2019 at 1:16 PM Fan Liya <liya.fa...@gmail.com> wrote:
> >>
> >> > Hi Jacques,
> >> >
> >> > Thank you so much for your kind reminder.
> >> >
> >> > To come up with some performance data, I have set up an environment and
> >> > run some micro-benchmarks.
> >> > The server runs Linux, has 64 cores and has 256 GB memory.
> >> > The benchmarks are simple iterations over some double vectors (the
> >> source
> >> > file is attached):
> >> >
> >> >   @Benchmark
> >> >   @BenchmarkMode(Mode.AverageTime)
> >> >   @OutputTimeUnit(TimeUnit.MICROSECONDS)
> >> >   public void testSafe() {
> >> >     safeSum = 0;
> >> >     for (int i = 0; i < VECTOR_LENGTH; i++) {
> >> >       safeVector.set(i, i + 10.0);
> >> >       safeSum += safeVector.get(i);
> >> >     }
> >> >   }
> >> >
> >> >   @Benchmark
> >> >   @BenchmarkMode(Mode.AverageTime)
> >> >   @OutputTimeUnit(TimeUnit.MICROSECONDS)
> >> >   public void testUnSafe() {
> >> >     unSafeSum = 0;
> >> >     for (int i = 0; i < VECTOR_LENGTH; i++) {
> >> >       unsafeVector.set(i, i + 10.0);
> >> >       unSafeSum += unsafeVector.get(i);
> >> >     }
> >> >   }
> >> >
> >> > The safe vector in the testSafe benchmark is from the original Arrow
> >> > implementation, whereas the unsafe vector in the testUnsafe benchmark is
> >> > based on our initial implementation in PR
> >> > <https://github.com/apache/arrow/pull/4212> (This is not the final
> >> > version. However, we believe much overhead has been removed).
> >> > The evaluation is based on JMH framework (thanks to the suggestion from
> >> > Jacques Nadeau). The benchmarks are run so many times by the framework
> >> that
> >> > the effects of JIT are well considered.
> >> >
> >> > In the first experiment, we use the default configuration (boundary
> >> > checking enabled), and the original Arrow vector is about 4 times
> >> slower:
> >> >
> >> > Benchmark                       Mode  Cnt   Score   Error  Units
> >> > VectorAPIBenchmarks.testSafe    avgt    5  11.546 ± 0.012  us/op
> >> > VectorAPIBenchmarks.testUnSafe  avgt    5   2.822 ± 0.006  us/op
> >> >
> >> > In the second experiment, we disable the boundary checking by JVM
> >> options:
> >> >
> >> > -Ddrill.enable_unsafe_memory_access=true
> >> > -Darrow.enable_unsafe_memory_access=true
> >> >
> >> > This time, the original Arrow vector is about 30% slower:
> >> >
> >> > Benchmark                       Mode  Cnt  Score   Error  Units
> >> > VectorAPIBenchmarks.testSafe    avgt    5  4.069 ± 0.004  us/op
> >> > VectorAPIBenchmarks.testUnSafe  avgt    5  2.819 ± 0.005  us/op
> >> >
> >> > This is a significant improvement, about 2.84x faster compared to when
> >> > bound checking is enabled.
> >> > However, in our scenario, we would still chose to bypass Arrow APIs
> >> > without hesitation, because such memory accesses are so frequent
> >> > operations, that a 30% performance degradation will easily cause us lose
> >> > edge.
> >> >
> >> > The results can be attributed to the following factors:
> >> > 1. Although the checks have been disabled, we still need to read the
> >> flag
> >> > and check it repeatedly in the Arrow APIs, which accumulates to large
> >> > performance overhead.
> >> > 2. There is too much code in the call stacks, compared with the unsafe
> >> > API. This will lead to less efficient i-cache, even if JIT can avoids
> >> the
> >> > cost of stack frames by in-lining most method code.
> >> >
> >> > Another, maybe separate problem is that, the
> >> > flag BoundsChecking#BOUNDS_CHECKING_ENABLED is final and initialized in
> >> a
> >> > static block. That means the only reliable way to override it is to
> >> > override system properties in the JVM command line. However, for some
> >> > scenarios, we do not have access to the command line (e.g. running
> >> Flink in
> >> > Yarn). I think this deserves a separate issue.
> >> >
> >> > Best,
> >> > Liya Fan
> >> >
> >> > On Mon, May 6, 2019 at 1:23 PM Jacques Nadeau <jacq...@apache.org>
> >> wrote:
> >> >
> >> >> >
> >> >> > Maybe I need to take a closer look at how the other SQL engines are
> >> >> using
> >> >> > Arrow. To see if they are also bypassing Arrow APIs.
> >> >> > I agree that a random user should be able to protect themselves, and
> >> >> this
> >> >> > is the utmost priority.
> >> >> >
> >> >> > According to my experience in Flink, JIT cannot optimize away the
> >> >> checks,
> >> >> > and removing the checks addresses the issue.
> >> >> > I want to illustrate this from two points:
> >> >> >
> >> >> > 1. Theoretical view point: JIT makes optimizations without changing
> >> >> > semantics of the code, so it can never remove the checks without
> >> >> changing
> >> >> > code semantics. To make it simple, if the JIT has witness the engine
> >> >> > successfully processed 1,000,000 records, how can it be sure that the
> >> >> > 1,000,001th record will be successful?
> >> >> >
> >> >> > 2. Practical view point: we have evaluated our SQL engine on TPC-H
> >> 1TB
> >> >> data
> >> >> > set. This is really a large number of records. So the JIT must have
> >> done
> >> >> > all it could to improve the code. According to the performance
> >> results,
> >> >> > however, it could not eliminate the impact caused checks.
> >> >> >
> >> >>
> >> >> I don't think you're following my point. There are two different
> >> points it
> >> >> seems like you want to discuss. Let's evaluate each separately:
> >> >>
> >> >> 1) Bounds checking for safety
> >> >> 2) Supposed inefficiency of the call hierarchy.
> >> >>
> >> >> For #1 we provide a system level property that can disable these. The
> >> JVM
> >> >> should succesfully optimize away this operation if that flag is set.
> >> >> Please
> >> >> look at the JIT output to confirm whether this is true.
> >> >>
> >> >> For #2: We designed things to collapse so the call hierarchy shouldn't
> >> be
> >> >> a
> >> >> problem. Please look at the JIT output to confirm.
> >> >>
> >> >> Please come with data around #1 and #2 to make an argument for a set of
> >> >> changes.
> >> >>
> >> >> thanks
> >> >>
> >> >
> >>
> >

Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

Reply via email to