I am still asking the same question: can you please analyze the assembly the JIT is producing and look to identify why the disabled bounds checking is at 30% and what types of things we can do to address. For example, we have talked before about a bytecode transformer that simply removes the bounds checking when loading Arrow if you want that behavior. If necessary, that may be a big win from a code maintenance standpoint over having duplicate interfaces.
The static block seems like a non-problem. You could simply load another class that system property before loading any Arrow code. If you're proposing a code change to solve your problem today, this seems just as feasible. The other question: in a real algorithm, how much does that 30% matter? Your benchmarks are entirely about this one call whereas real workloads are impacted by many things and the time in writing/reading vectors is miniscule versus other things. On Mon, May 6, 2019 at 1:16 PM Fan Liya <liya.fa...@gmail.com> wrote: > Hi Jacques, > > Thank you so much for your kind reminder. > > To come up with some performance data, I have set up an environment and > run some micro-benchmarks. > The server runs Linux, has 64 cores and has 256 GB memory. > The benchmarks are simple iterations over some double vectors (the source > file is attached): > > @Benchmark > @BenchmarkMode(Mode.AverageTime) > @OutputTimeUnit(TimeUnit.MICROSECONDS) > public void testSafe() { > safeSum = 0; > for (int i = 0; i < VECTOR_LENGTH; i++) { > safeVector.set(i, i + 10.0); > safeSum += safeVector.get(i); > } > } > > @Benchmark > @BenchmarkMode(Mode.AverageTime) > @OutputTimeUnit(TimeUnit.MICROSECONDS) > public void testUnSafe() { > unSafeSum = 0; > for (int i = 0; i < VECTOR_LENGTH; i++) { > unsafeVector.set(i, i + 10.0); > unSafeSum += unsafeVector.get(i); > } > } > > The safe vector in the testSafe benchmark is from the original Arrow > implementation, whereas the unsafe vector in the testUnsafe benchmark is > based on our initial implementation in PR > <https://github.com/apache/arrow/pull/4212> (This is not the final > version. However, we believe much overhead has been removed). > The evaluation is based on JMH framework (thanks to the suggestion from > Jacques Nadeau). The benchmarks are run so many times by the framework that > the effects of JIT are well considered. > > In the first experiment, we use the default configuration (boundary > checking enabled), and the original Arrow vector is about 4 times slower: > > Benchmark Mode Cnt Score Error Units > VectorAPIBenchmarks.testSafe avgt 5 11.546 ± 0.012 us/op > VectorAPIBenchmarks.testUnSafe avgt 5 2.822 ± 0.006 us/op > > In the second experiment, we disable the boundary checking by JVM options: > > -Ddrill.enable_unsafe_memory_access=true > -Darrow.enable_unsafe_memory_access=true > > This time, the original Arrow vector is about 30% slower: > > Benchmark Mode Cnt Score Error Units > VectorAPIBenchmarks.testSafe avgt 5 4.069 ± 0.004 us/op > VectorAPIBenchmarks.testUnSafe avgt 5 2.819 ± 0.005 us/op > > This is a significant improvement, about 2.84x faster compared to when > bound checking is enabled. > However, in our scenario, we would still chose to bypass Arrow APIs > without hesitation, because such memory accesses are so frequent > operations, that a 30% performance degradation will easily cause us lose > edge. > > The results can be attributed to the following factors: > 1. Although the checks have been disabled, we still need to read the flag > and check it repeatedly in the Arrow APIs, which accumulates to large > performance overhead. > 2. There is too much code in the call stacks, compared with the unsafe > API. This will lead to less efficient i-cache, even if JIT can avoids the > cost of stack frames by in-lining most method code. > > Another, maybe separate problem is that, the > flag BoundsChecking#BOUNDS_CHECKING_ENABLED is final and initialized in a > static block. That means the only reliable way to override it is to > override system properties in the JVM command line. However, for some > scenarios, we do not have access to the command line (e.g. running Flink in > Yarn). I think this deserves a separate issue. > > Best, > Liya Fan > > On Mon, May 6, 2019 at 1:23 PM Jacques Nadeau <jacq...@apache.org> wrote: > >> > >> > Maybe I need to take a closer look at how the other SQL engines are >> using >> > Arrow. To see if they are also bypassing Arrow APIs. >> > I agree that a random user should be able to protect themselves, and >> this >> > is the utmost priority. >> > >> > According to my experience in Flink, JIT cannot optimize away the >> checks, >> > and removing the checks addresses the issue. >> > I want to illustrate this from two points: >> > >> > 1. Theoretical view point: JIT makes optimizations without changing >> > semantics of the code, so it can never remove the checks without >> changing >> > code semantics. To make it simple, if the JIT has witness the engine >> > successfully processed 1,000,000 records, how can it be sure that the >> > 1,000,001th record will be successful? >> > >> > 2. Practical view point: we have evaluated our SQL engine on TPC-H 1TB >> data >> > set. This is really a large number of records. So the JIT must have done >> > all it could to improve the code. According to the performance results, >> > however, it could not eliminate the impact caused checks. >> > >> >> I don't think you're following my point. There are two different points it >> seems like you want to discuss. Let's evaluate each separately: >> >> 1) Bounds checking for safety >> 2) Supposed inefficiency of the call hierarchy. >> >> For #1 we provide a system level property that can disable these. The JVM >> should succesfully optimize away this operation if that flag is set. >> Please >> look at the JIT output to confirm whether this is true. >> >> For #2: We designed things to collapse so the call hierarchy shouldn't be >> a >> problem. Please look at the JIT output to confirm. >> >> Please come with data around #1 and #2 to make an argument for a set of >> changes. >> >> thanks >> >