+1 on this proposal.

------------------------------------------------------------------
发件人:Fan Liya <liya.fa...@gmail.com>
发送时间:2019年5月9日(星期四) 16:33
收件人:dev <dev@arrow.apache.org>
主 题:Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow

Hi all,

Our previous results on micro-benchmarks show that, the original Arrow API
is 30% slower than the unsafe API.
After profiling, we found that, the performance overhead comes from the
null-checking in the get method. For example, the get method of
Float8Vector looks like this:

  public double get(int index) throws IllegalStateException {
    if (isSet(index) == 0) {
      throw new IllegalStateException("Value at index is null");
    }
    return valueBuffer.getDouble(index * TYPE_WIDTH);
  }

It first makes sure the value is not null, and then retrieve the value.

In some cases, the first check is redundant, because the application code
usually do the check before calling the get method. For such cases, the
first check can be skipped.
Therefore, @Jacques Nadeau suggests adding another flag to enable/disable
such check. I think this is a good suggestion, because it solves the
performance problem, without introducing a new set of vector classes. What
do you think?

I have opened a JIRA for that (ARROW-5290
<https://issues.apache.org/jira/browse/ARROW-5290>). Please give your
valuable comments.
Thanks a lot for your attention and valuable comments.
Special thanks to @Jacques Nadeau for all the suggestions and helpful
comments.

Best,
Liya Fan




On Wed, May 8, 2019 at 1:05 PM Fan Liya <liya.fa...@gmail.com> wrote:

> Hi Jacques,
>
> Thanks a lot for your comments.
>
> I have evaluated the assembly code of original Arrow API, as well as the
> unsafe API in our PR <https://github.com/apache/arrow/pull/4212>
> Generally, the assembly code generated by JIT for both APIs are of high
> quality, and for most cases, the assembly code are almost the same.
>
> However, some checks can be further removed. The following figures give an
> example (the figures are too big to be attached, so I have attached them in
> a JIRA comment. Please see comment
> <https://issues.apache.org/jira/browse/ARROW-5200?focusedCommentId=16835303&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16835303>.
>  Sorry
> for the inconvenience):
>
> The first figure shows the code of original Arrow API, while the second
> shows the code for the unsafe API.
> It can be observed that for the unsafe API, the amounts of the source,
> byte and assembly code are all smaller. So it can be expected that the
> performance of unsafe API is better.
>
> Concerning this particular example for the Float8Vector, I think it is
> reasonable to further remove the check in the get method:
> Before we call the get method, we must check if the value is null, so the
> check in the get method is redundant. And this is a typical scenario of
> using Arrow API (check and then get), at least for our scenario (please
> take a glimpse of our benchmark in PR
> <https://github.com/apache/arrow/pull/4198>).
>
> Concerning the other problem, about the real algorithm in our scenario. I
> want to make two points:
>
> 1. SQL engines are performance critical, so 30% is a large number for us.
> For the past year, it took our team several months just to improve the
> performance of our runtime engine by around 15%.
>
> 2. The performance of engine heavily depends on the performance of Arrow.
> Most SQL engines are memory-intensive, so the performance of get/set
> methods is the key. To get a flavor of the algorithms in our engine, please
> refer to PR <https://github.com/apache/arrow/pull/4198>. That is the core
> algorithm of our operator, which is executed many times during the
> processing of a SQL query. You can find that the computation is relatively
> simple, and most method calls are memory accesses.
>
> Best,
> Liya Fan
>
> On Mon, May 6, 2019 at 5:52 PM Jacques Nadeau <jacq...@apache.org> wrote:
>
>> I am still asking the same question: can you please analyze the assembly
>> the JIT is producing and look to identify why the disabled bounds checking
>> is at 30% and what types of things we can do to address. For example, we
>> have talked before about a bytecode transformer that simply removes the
>> bounds checking when loading Arrow if you want that behavior. If
>> necessary,
>> that may be a big win from a code maintenance standpoint over having
>> duplicate interfaces.
>>
>> The static block seems like a non-problem. You could simply load another
>> class that system property before loading any Arrow code. If you're
>> proposing a code change to solve your problem today, this seems just as
>> feasible.
>>
>> The other question: in a real algorithm, how much does that 30% matter?
>> Your benchmarks are entirely about this one call whereas real workloads
>> are
>> impacted by many things and the time in writing/reading vectors is
>> miniscule versus other things.
>>
>> On Mon, May 6, 2019 at 1:16 PM Fan Liya <liya.fa...@gmail.com> wrote:
>>
>> > Hi Jacques,
>> >
>> > Thank you so much for your kind reminder.
>> >
>> > To come up with some performance data, I have set up an environment and
>> > run some micro-benchmarks.
>> > The server runs Linux, has 64 cores and has 256 GB memory.
>> > The benchmarks are simple iterations over some double vectors (the
>> source
>> > file is attached):
>> >
>> >   @Benchmark
>> >   @BenchmarkMode(Mode.AverageTime)
>> >   @OutputTimeUnit(TimeUnit.MICROSECONDS)
>> >   public void testSafe() {
>> >     safeSum = 0;
>> >     for (int i = 0; i < VECTOR_LENGTH; i++) {
>> >       safeVector.set(i, i + 10.0);
>> >       safeSum += safeVector.get(i);
>> >     }
>> >   }
>> >
>> >   @Benchmark
>> >   @BenchmarkMode(Mode.AverageTime)
>> >   @OutputTimeUnit(TimeUnit.MICROSECONDS)
>> >   public void testUnSafe() {
>> >     unSafeSum = 0;
>> >     for (int i = 0; i < VECTOR_LENGTH; i++) {
>> >       unsafeVector.set(i, i + 10.0);
>> >       unSafeSum += unsafeVector.get(i);
>> >     }
>> >   }
>> >
>> > The safe vector in the testSafe benchmark is from the original Arrow
>> > implementation, whereas the unsafe vector in the testUnsafe benchmark is
>> > based on our initial implementation in PR
>> > <https://github.com/apache/arrow/pull/4212> (This is not the final
>> > version. However, we believe much overhead has been removed).
>> > The evaluation is based on JMH framework (thanks to the suggestion from
>> > Jacques Nadeau). The benchmarks are run so many times by the framework
>> that
>> > the effects of JIT are well considered.
>> >
>> > In the first experiment, we use the default configuration (boundary
>> > checking enabled), and the original Arrow vector is about 4 times
>> slower:
>> >
>> > Benchmark                       Mode  Cnt   Score   Error  Units
>> > VectorAPIBenchmarks.testSafe    avgt    5  11.546 ± 0.012  us/op
>> > VectorAPIBenchmarks.testUnSafe  avgt    5   2.822 ± 0.006  us/op
>> >
>> > In the second experiment, we disable the boundary checking by JVM
>> options:
>> >
>> > -Ddrill.enable_unsafe_memory_access=true
>> > -Darrow.enable_unsafe_memory_access=true
>> >
>> > This time, the original Arrow vector is about 30% slower:
>> >
>> > Benchmark                       Mode  Cnt  Score   Error  Units
>> > VectorAPIBenchmarks.testSafe    avgt    5  4.069 ± 0.004  us/op
>> > VectorAPIBenchmarks.testUnSafe  avgt    5  2.819 ± 0.005  us/op
>> >
>> > This is a significant improvement, about 2.84x faster compared to when
>> > bound checking is enabled.
>> > However, in our scenario, we would still chose to bypass Arrow APIs
>> > without hesitation, because such memory accesses are so frequent
>> > operations, that a 30% performance degradation will easily cause us lose
>> > edge.
>> >
>> > The results can be attributed to the following factors:
>> > 1. Although the checks have been disabled, we still need to read the
>> flag
>> > and check it repeatedly in the Arrow APIs, which accumulates to large
>> > performance overhead.
>> > 2. There is too much code in the call stacks, compared with the unsafe
>> > API. This will lead to less efficient i-cache, even if JIT can avoids
>> the
>> > cost of stack frames by in-lining most method code.
>> >
>> > Another, maybe separate problem is that, the
>> > flag BoundsChecking#BOUNDS_CHECKING_ENABLED is final and initialized in
>> a
>> > static block. That means the only reliable way to override it is to
>> > override system properties in the JVM command line. However, for some
>> > scenarios, we do not have access to the command line (e.g. running
>> Flink in
>> > Yarn). I think this deserves a separate issue.
>> >
>> > Best,
>> > Liya Fan
>> >
>> > On Mon, May 6, 2019 at 1:23 PM Jacques Nadeau <jacq...@apache.org>
>> wrote:
>> >
>> >> >
>> >> > Maybe I need to take a closer look at how the other SQL engines are
>> >> using
>> >> > Arrow. To see if they are also bypassing Arrow APIs.
>> >> > I agree that a random user should be able to protect themselves, and
>> >> this
>> >> > is the utmost priority.
>> >> >
>> >> > According to my experience in Flink, JIT cannot optimize away the
>> >> checks,
>> >> > and removing the checks addresses the issue.
>> >> > I want to illustrate this from two points:
>> >> >
>> >> > 1. Theoretical view point: JIT makes optimizations without changing
>> >> > semantics of the code, so it can never remove the checks without
>> >> changing
>> >> > code semantics. To make it simple, if the JIT has witness the engine
>> >> > successfully processed 1,000,000 records, how can it be sure that the
>> >> > 1,000,001th record will be successful?
>> >> >
>> >> > 2. Practical view point: we have evaluated our SQL engine on TPC-H
>> 1TB
>> >> data
>> >> > set. This is really a large number of records. So the JIT must have
>> done
>> >> > all it could to improve the code. According to the performance
>> results,
>> >> > however, it could not eliminate the impact caused checks.
>> >> >
>> >>
>> >> I don't think you're following my point. There are two different
>> points it
>> >> seems like you want to discuss. Let's evaluate each separately:
>> >>
>> >> 1) Bounds checking for safety
>> >> 2) Supposed inefficiency of the call hierarchy.
>> >>
>> >> For #1 we provide a system level property that can disable these. The
>> JVM
>> >> should succesfully optimize away this operation if that flag is set.
>> >> Please
>> >> look at the JIT output to confirm whether this is true.
>> >>
>> >> For #2: We designed things to collapse so the call hierarchy shouldn't
>> be
>> >> a
>> >> problem. Please look at the JIT output to confirm.
>> >>
>> >> Please come with data around #1 and #2 to make an argument for a set of
>> >> changes.
>> >>
>> >> thanks
>> >>
>> >
>>
>

Reply via email to