Thanks for the design. Personally, I'm not a huge fan of creating a parallel classes for every vector type, this ends up being confusing for developers and adds a lot of boiler plate. I wonder if you could use a similar approach that the memory module uses for turning bounds checking on/off [1].
Also, I think there was a comment on the JIRA, but are there any benchmarks to show the expected improvements? My limited understanding is that for small methods the JVM's JIT should inline them anyways [2] , so it is not clear how much this will improve performance. Thanks, Micah [1] https://github.com/apache/arrow/blob/master/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java [2] https://stackoverflow.com/questions/24923040/do-modern-java-compilers-jvm-inline-functions-methods-which-are-called-exactly-f On Sun, Apr 28, 2019 at 2:50 AM Fan Liya <liya.fa...@gmail.com> wrote: > Hi all, > > We are proposing a new set of APIs in Arrow - unsafe vector APIs. The > general ideas is attached below, and also accessible from our online > document > <https://docs.google.com/document/d/13oZFVS1EnNedZd_7udx-h10G2tRTjfgHe2ngp2ZWJ70/edit?usp=sharing>. > Please give your valuable comments by directly commenting in our online > document > <https://docs.google.com/document/d/13oZFVS1EnNedZd_7udx-h10G2tRTjfgHe2ngp2ZWJ70/edit?usp=sharing>, > or relaying this email thread. > > Thank you so much in advance. > > Best, > Liya Fan > > Support Fast/Unsafe Vector APIs for Arrow Background > > In our effort to support columnar data format in Apache Flink, we chose > Apache Arrow as the basic data structure. Arrow greatly simplifies the > support of the columnar data format. However, for many scenarios, we find > the performance unacceptable. Our investigation shows the reason is that, > there are too many redundant checks and computations in current Arrow API. > > > > For example, the following figures shows that in a single call to > Float8Vector.get(int) method (this is one of the most frequently used APIs > in Flink computation), there are 20+ method invocations. > > > [image: image.png] > > > > > > There are many other APIs with similar problems. The redundant checks and > computations impact performance severely. According to our evaluation, the > performance may degrade by two or three orders of magnitude. > Our Proposal > > For many scenarios, the checks can be avoided, if the application > developers can guarantee that all checks will pass. So our proposal is to > provide some light-weight APIs. The APIs are also named *unsafe APIs*, in > the sense that that skip most of the checks (not safe) to improve the > performance. > > > > In the light-weight APIs, we only provide minimum checks, or avoid checks > at all. The application owner can still develop and debug their code using > the original safe APIs. Once all bugs have been fixed, they can switch to > unsafe APIs in the final version of their products and enjoy the high > performance. > Our Design > > Our goal is to include unsafe vector APIs in Arrow code base, and allow > our customers switching to the new unsafe APIs, without being aware of it, > except for the high performance. To achieve this goal, we make the > following design choices: > Vector Class Hierarchy > > Each unsafe vector is the subclass of the safe vector. For example, the > unsafe Float8Vector is a subclass of org.apache.arrow.vector.Float8Vector: > > > > package org.apache.arrow.vector.unsafe; > > > > public class Float8Vector extends org.apache.arrow.vector.Float8Vector > > > > So the safe vector acts as a façade of the unsafe vector, and through > polymorphism, the users may not be aware of which type of vector he/she is > working with. In addition, the common logics can be reused in the unsafe > vectors, and we only need to override get/set related methods. > Vector Creation > > We use factory methods to create each type of vectors. Compared with > vector constructors, the factory methods take one more parameter, the > vectorType: > > > > public class VectorFactory { > > public static Float8Vector createFloat8Vector(VectorType vectorType, > String name, BufferAllocator allocator); > > } > > > > VectorType is an enum to separate safe vectors from unsafe ones: > > > > public enum VectorType { > > SAFE, > > UNSAFE > > } > > > > With the factory methods, the old way of creating vectors by constructors > can be gradually depreciated. > Vector Implementation > > As discussed above, unsafe vectors mainly override get/set methods. For > get methods, we directly operate on the off-heap memory, without any check: > > > > public double get(int index) { > > return > Double.longBitsToDouble(PlatformDependent.getLong(valueBuffer.memoryAddress() > + (index << TYPE_LOG2_WIDTH))); > > } > > > > Note that the PlatformDependent API is only 2 stack layers above the > underlying UNSAFE method call. > > > > For set methods, we still need to set the validity bit. However, this is > through an unsafe method that directly sets the bits without checking: > > > > public void set(int index, double value) { > > UnsafeBitVectorHelper.setValidityBitToOne(validityBuffer, index); > > PlatformDependent.putLong( > > valueBuffer.memoryAddress() + (index << TYPE_LOG2_WIDTH), > Double.doubleToRawLongBits(value)); > > } > > > > Method UnsafeBitVectorHelper.setValidityBitToOne is the unsafe version of > BitVectorHelper.setValidityBitToOne that avoids checks. > > > Test Cases > > We can reuse existing test cases by employing parameterized test classes > to test both safe and unsafe vectors. > Current Progress > > We have opened a JIRA for this work item FlINK-5200 > <https://issues.apache.org/jira/browse/ARROW-5200>, and a PR > <https://github.com/apache/arrow/pull/4212> with initial implementations > have been opened. We would appreciate if you could give some comments. >