I'm also curious which APIs are particularly problematic for performance. In ARROW-1833 [1] and some related discussions there was the suggestion of adding methods like getUnsafe, so this would be like get(i) [2] but without checking the validity bitmap
[1] : https://issues.apache.org/jira/browse/ARROW-1833 [2]: https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/Float8Vector.java#L99 On Mon, Apr 29, 2019 at 1:05 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > Thanks for the design. Personally, I'm not a huge fan of creating a > parallel classes for every vector type, this ends up being confusing for > developers and adds a lot of boiler plate. I wonder if you could use a > similar approach that the memory module uses for turning bounds checking > on/off [1]. > > Also, I think there was a comment on the JIRA, but are there any benchmarks > to show the expected improvements? My limited understanding is that for > small methods the JVM's JIT should inline them anyways [2] , so it is not > clear how much this will improve performance. > > > Thanks, > Micah > > > [1] > https://github.com/apache/arrow/blob/master/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java > [2] > https://stackoverflow.com/questions/24923040/do-modern-java-compilers-jvm-inline-functions-methods-which-are-called-exactly-f > > On Sun, Apr 28, 2019 at 2:50 AM Fan Liya <liya.fa...@gmail.com> wrote: > > > Hi all, > > > > We are proposing a new set of APIs in Arrow - unsafe vector APIs. The > > general ideas is attached below, and also accessible from our online > > document > > <https://docs.google.com/document/d/13oZFVS1EnNedZd_7udx-h10G2tRTjfgHe2ngp2ZWJ70/edit?usp=sharing>. > > Please give your valuable comments by directly commenting in our online > > document > > <https://docs.google.com/document/d/13oZFVS1EnNedZd_7udx-h10G2tRTjfgHe2ngp2ZWJ70/edit?usp=sharing>, > > or relaying this email thread. > > > > Thank you so much in advance. > > > > Best, > > Liya Fan > > > > Support Fast/Unsafe Vector APIs for Arrow Background > > > > In our effort to support columnar data format in Apache Flink, we chose > > Apache Arrow as the basic data structure. Arrow greatly simplifies the > > support of the columnar data format. However, for many scenarios, we find > > the performance unacceptable. Our investigation shows the reason is that, > > there are too many redundant checks and computations in current Arrow API. > > > > > > > > For example, the following figures shows that in a single call to > > Float8Vector.get(int) method (this is one of the most frequently used APIs > > in Flink computation), there are 20+ method invocations. > > > > > > [image: image.png] > > > > > > > > > > > > There are many other APIs with similar problems. The redundant checks and > > computations impact performance severely. According to our evaluation, the > > performance may degrade by two or three orders of magnitude. > > Our Proposal > > > > For many scenarios, the checks can be avoided, if the application > > developers can guarantee that all checks will pass. So our proposal is to > > provide some light-weight APIs. The APIs are also named *unsafe APIs*, in > > the sense that that skip most of the checks (not safe) to improve the > > performance. > > > > > > > > In the light-weight APIs, we only provide minimum checks, or avoid checks > > at all. The application owner can still develop and debug their code using > > the original safe APIs. Once all bugs have been fixed, they can switch to > > unsafe APIs in the final version of their products and enjoy the high > > performance. > > Our Design > > > > Our goal is to include unsafe vector APIs in Arrow code base, and allow > > our customers switching to the new unsafe APIs, without being aware of it, > > except for the high performance. To achieve this goal, we make the > > following design choices: > > Vector Class Hierarchy > > > > Each unsafe vector is the subclass of the safe vector. For example, the > > unsafe Float8Vector is a subclass of org.apache.arrow.vector.Float8Vector: > > > > > > > > package org.apache.arrow.vector.unsafe; > > > > > > > > public class Float8Vector extends org.apache.arrow.vector.Float8Vector > > > > > > > > So the safe vector acts as a façade of the unsafe vector, and through > > polymorphism, the users may not be aware of which type of vector he/she is > > working with. In addition, the common logics can be reused in the unsafe > > vectors, and we only need to override get/set related methods. > > Vector Creation > > > > We use factory methods to create each type of vectors. Compared with > > vector constructors, the factory methods take one more parameter, the > > vectorType: > > > > > > > > public class VectorFactory { > > > > public static Float8Vector createFloat8Vector(VectorType vectorType, > > String name, BufferAllocator allocator); > > > > } > > > > > > > > VectorType is an enum to separate safe vectors from unsafe ones: > > > > > > > > public enum VectorType { > > > > SAFE, > > > > UNSAFE > > > > } > > > > > > > > With the factory methods, the old way of creating vectors by constructors > > can be gradually depreciated. > > Vector Implementation > > > > As discussed above, unsafe vectors mainly override get/set methods. For > > get methods, we directly operate on the off-heap memory, without any check: > > > > > > > > public double get(int index) { > > > > return > > Double.longBitsToDouble(PlatformDependent.getLong(valueBuffer.memoryAddress() > > + (index << TYPE_LOG2_WIDTH))); > > > > } > > > > > > > > Note that the PlatformDependent API is only 2 stack layers above the > > underlying UNSAFE method call. > > > > > > > > For set methods, we still need to set the validity bit. However, this is > > through an unsafe method that directly sets the bits without checking: > > > > > > > > public void set(int index, double value) { > > > > UnsafeBitVectorHelper.setValidityBitToOne(validityBuffer, index); > > > > PlatformDependent.putLong( > > > > valueBuffer.memoryAddress() + (index << TYPE_LOG2_WIDTH), > > Double.doubleToRawLongBits(value)); > > > > } > > > > > > > > Method UnsafeBitVectorHelper.setValidityBitToOne is the unsafe version of > > BitVectorHelper.setValidityBitToOne that avoids checks. > > > > > > Test Cases > > > > We can reuse existing test cases by employing parameterized test classes > > to test both safe and unsafe vectors. > > Current Progress > > > > We have opened a JIRA for this work item FlINK-5200 > > <https://issues.apache.org/jira/browse/ARROW-5200>, and a PR > > <https://github.com/apache/arrow/pull/4212> with initial implementations > > have been opened. We would appreciate if you could give some comments. > >