Thanks for the design.   Personally, I'm not a huge fan of creating a
parallel classes for every vector type, this ends up being confusing for
developers and adds a lot of boiler plate.  I wonder if you could use a
similar approach that the memory module uses for turning bounds checking
on/off [1].

Also, I think there was a comment on the JIRA, but are there any benchmarks
to show the expected improvements?  My limited understanding is that for
small methods the JVM's JIT should inline them anyways [2] , so it is not
clear how much this will improve performance.


Thanks,
Micah


[1]
https://github.com/apache/arrow/blob/master/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java
[2]
https://stackoverflow.com/questions/24923040/do-modern-java-compilers-jvm-inline-functions-methods-which-are-called-exactly-f

On Sun, Apr 28, 2019 at 2:50 AM Fan Liya <liya.fa...@gmail.com> wrote:

> Hi all,
>
> We are proposing a new set of APIs in Arrow - unsafe vector APIs. The
> general ideas is attached below, and also accessible from our online
> document
> <https://docs.google.com/document/d/13oZFVS1EnNedZd_7udx-h10G2tRTjfgHe2ngp2ZWJ70/edit?usp=sharing>.
> Please give your valuable comments by directly commenting in our online
> document
> <https://docs.google.com/document/d/13oZFVS1EnNedZd_7udx-h10G2tRTjfgHe2ngp2ZWJ70/edit?usp=sharing>,
> or relaying this email thread.
>
> Thank you so much in advance.
>
> Best,
> Liya Fan
>
> Support Fast/Unsafe Vector APIs for Arrow Background
>
> In our effort to support columnar data format in Apache Flink, we chose
> Apache Arrow as the basic data structure. Arrow greatly simplifies the
> support of the columnar data format. However, for many scenarios, we find
> the performance unacceptable. Our investigation shows the reason is that,
> there are too many redundant checks and computations in current Arrow API.
>
>
>
> For example, the following figures shows that in a single call to
> Float8Vector.get(int) method (this is one of the most frequently used APIs
> in Flink computation),  there are 20+ method invocations.
>
>
> [image: image.png]
>
>
>
>
>
> There are many other APIs with similar problems. The redundant checks and
> computations impact performance severely. According to our evaluation, the
> performance may degrade by two or three orders of magnitude.
> Our Proposal
>
> For many scenarios, the checks can be avoided, if the application
> developers can guarantee that all checks will pass. So our proposal is to
> provide some light-weight APIs. The APIs are also named *unsafe APIs*, in
> the sense that that skip most of the checks (not safe) to improve the
> performance.
>
>
>
> In the light-weight APIs, we only provide minimum checks, or avoid checks
> at all. The application owner can still develop and debug their code using
> the original safe APIs. Once all bugs have been fixed, they can switch to
> unsafe APIs in the final version of their products and enjoy the high
> performance.
> Our Design
>
> Our goal is to include unsafe vector APIs in Arrow code base, and allow
> our customers switching to the new unsafe APIs, without being aware of it,
> except for the high performance. To achieve this goal, we make the
> following design choices:
> Vector Class Hierarchy
>
> Each unsafe vector is the subclass of the safe vector. For example, the
> unsafe Float8Vector is a subclass of org.apache.arrow.vector.Float8Vector:
>
>
>
> package org.apache.arrow.vector.unsafe;
>
>
>
> public class Float8Vector extends org.apache.arrow.vector.Float8Vector
>
>
>
> So the safe vector acts as a façade of the unsafe vector, and through
> polymorphism, the users may not be aware of which type of vector he/she is
> working with. In addition, the common logics can be reused in the unsafe
> vectors, and we only need to override get/set related methods.
> Vector Creation
>
> We use factory methods to create each type of vectors. Compared with
> vector constructors, the factory methods take one more parameter, the
> vectorType:
>
>
>
> public class VectorFactory {
>
>   public static Float8Vector createFloat8Vector(VectorType vectorType,
> String name, BufferAllocator allocator);
>
> }
>
>
>
> VectorType is an enum to separate safe vectors from unsafe ones:
>
>
>
> public enum VectorType {
>
>   SAFE,
>
>   UNSAFE
>
> }
>
>
>
> With the factory methods, the old way of creating vectors by constructors
> can be gradually depreciated.
> Vector Implementation
>
> As discussed above, unsafe vectors mainly override get/set methods. For
> get methods, we directly operate on the off-heap memory, without any check:
>
>
>
> public double get(int index) {
>
>     return
> Double.longBitsToDouble(PlatformDependent.getLong(valueBuffer.memoryAddress()
> + (index << TYPE_LOG2_WIDTH)));
>
> }
>
>
>
> Note that the PlatformDependent API is only 2 stack layers above the
> underlying UNSAFE method call.
>
>
>
> For set methods, we still need to set the validity bit. However, this is
> through an unsafe method that directly sets the bits without checking:
>
>
>
>          public void set(int index, double value) {
>
>       UnsafeBitVectorHelper.setValidityBitToOne(validityBuffer, index);
>
> PlatformDependent.putLong(
>
>             valueBuffer.memoryAddress() + (index << TYPE_LOG2_WIDTH),
> Double.doubleToRawLongBits(value));
>
> }
>
>
>
> Method UnsafeBitVectorHelper.setValidityBitToOne is the unsafe version of
> BitVectorHelper.setValidityBitToOne that avoids checks.
>
>
> Test Cases
>
> We can reuse existing test cases by employing parameterized test classes
> to test both safe and unsafe vectors.
> Current Progress
>
> We have opened a JIRA for this work item FlINK-5200
> <https://issues.apache.org/jira/browse/ARROW-5200>, and a PR
> <https://github.com/apache/arrow/pull/4212> with initial implementations
> have been opened. We would appreciate if you could give some comments.
>

Reply via email to