I'm also curious which APIs are particularly problematic for
performance. In ARROW-1833 [1] and some related discussions there was
the suggestion of adding methods like getUnsafe, so this would be like
get(i) [2] but without checking the validity bitmap

[1] : https://issues.apache.org/jira/browse/ARROW-1833
[2]: 
https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/Float8Vector.java#L99

On Mon, Apr 29, 2019 at 1:05 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Thanks for the design.   Personally, I'm not a huge fan of creating a
> parallel classes for every vector type, this ends up being confusing for
> developers and adds a lot of boiler plate.  I wonder if you could use a
> similar approach that the memory module uses for turning bounds checking
> on/off [1].
>
> Also, I think there was a comment on the JIRA, but are there any benchmarks
> to show the expected improvements?  My limited understanding is that for
> small methods the JVM's JIT should inline them anyways [2] , so it is not
> clear how much this will improve performance.
>
>
> Thanks,
> Micah
>
>
> [1]
> https://github.com/apache/arrow/blob/master/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java
> [2]
> https://stackoverflow.com/questions/24923040/do-modern-java-compilers-jvm-inline-functions-methods-which-are-called-exactly-f
>
> On Sun, Apr 28, 2019 at 2:50 AM Fan Liya <liya.fa...@gmail.com> wrote:
>
> > Hi all,
> >
> > We are proposing a new set of APIs in Arrow - unsafe vector APIs. The
> > general ideas is attached below, and also accessible from our online
> > document
> > <https://docs.google.com/document/d/13oZFVS1EnNedZd_7udx-h10G2tRTjfgHe2ngp2ZWJ70/edit?usp=sharing>.
> > Please give your valuable comments by directly commenting in our online
> > document
> > <https://docs.google.com/document/d/13oZFVS1EnNedZd_7udx-h10G2tRTjfgHe2ngp2ZWJ70/edit?usp=sharing>,
> > or relaying this email thread.
> >
> > Thank you so much in advance.
> >
> > Best,
> > Liya Fan
> >
> > Support Fast/Unsafe Vector APIs for Arrow Background
> >
> > In our effort to support columnar data format in Apache Flink, we chose
> > Apache Arrow as the basic data structure. Arrow greatly simplifies the
> > support of the columnar data format. However, for many scenarios, we find
> > the performance unacceptable. Our investigation shows the reason is that,
> > there are too many redundant checks and computations in current Arrow API.
> >
> >
> >
> > For example, the following figures shows that in a single call to
> > Float8Vector.get(int) method (this is one of the most frequently used APIs
> > in Flink computation),  there are 20+ method invocations.
> >
> >
> > [image: image.png]
> >
> >
> >
> >
> >
> > There are many other APIs with similar problems. The redundant checks and
> > computations impact performance severely. According to our evaluation, the
> > performance may degrade by two or three orders of magnitude.
> > Our Proposal
> >
> > For many scenarios, the checks can be avoided, if the application
> > developers can guarantee that all checks will pass. So our proposal is to
> > provide some light-weight APIs. The APIs are also named *unsafe APIs*, in
> > the sense that that skip most of the checks (not safe) to improve the
> > performance.
> >
> >
> >
> > In the light-weight APIs, we only provide minimum checks, or avoid checks
> > at all. The application owner can still develop and debug their code using
> > the original safe APIs. Once all bugs have been fixed, they can switch to
> > unsafe APIs in the final version of their products and enjoy the high
> > performance.
> > Our Design
> >
> > Our goal is to include unsafe vector APIs in Arrow code base, and allow
> > our customers switching to the new unsafe APIs, without being aware of it,
> > except for the high performance. To achieve this goal, we make the
> > following design choices:
> > Vector Class Hierarchy
> >
> > Each unsafe vector is the subclass of the safe vector. For example, the
> > unsafe Float8Vector is a subclass of org.apache.arrow.vector.Float8Vector:
> >
> >
> >
> > package org.apache.arrow.vector.unsafe;
> >
> >
> >
> > public class Float8Vector extends org.apache.arrow.vector.Float8Vector
> >
> >
> >
> > So the safe vector acts as a façade of the unsafe vector, and through
> > polymorphism, the users may not be aware of which type of vector he/she is
> > working with. In addition, the common logics can be reused in the unsafe
> > vectors, and we only need to override get/set related methods.
> > Vector Creation
> >
> > We use factory methods to create each type of vectors. Compared with
> > vector constructors, the factory methods take one more parameter, the
> > vectorType:
> >
> >
> >
> > public class VectorFactory {
> >
> >   public static Float8Vector createFloat8Vector(VectorType vectorType,
> > String name, BufferAllocator allocator);
> >
> > }
> >
> >
> >
> > VectorType is an enum to separate safe vectors from unsafe ones:
> >
> >
> >
> > public enum VectorType {
> >
> >   SAFE,
> >
> >   UNSAFE
> >
> > }
> >
> >
> >
> > With the factory methods, the old way of creating vectors by constructors
> > can be gradually depreciated.
> > Vector Implementation
> >
> > As discussed above, unsafe vectors mainly override get/set methods. For
> > get methods, we directly operate on the off-heap memory, without any check:
> >
> >
> >
> > public double get(int index) {
> >
> >     return
> > Double.longBitsToDouble(PlatformDependent.getLong(valueBuffer.memoryAddress()
> > + (index << TYPE_LOG2_WIDTH)));
> >
> > }
> >
> >
> >
> > Note that the PlatformDependent API is only 2 stack layers above the
> > underlying UNSAFE method call.
> >
> >
> >
> > For set methods, we still need to set the validity bit. However, this is
> > through an unsafe method that directly sets the bits without checking:
> >
> >
> >
> >          public void set(int index, double value) {
> >
> >       UnsafeBitVectorHelper.setValidityBitToOne(validityBuffer, index);
> >
> > PlatformDependent.putLong(
> >
> >             valueBuffer.memoryAddress() + (index << TYPE_LOG2_WIDTH),
> > Double.doubleToRawLongBits(value));
> >
> > }
> >
> >
> >
> > Method UnsafeBitVectorHelper.setValidityBitToOne is the unsafe version of
> > BitVectorHelper.setValidityBitToOne that avoids checks.
> >
> >
> > Test Cases
> >
> > We can reuse existing test cases by employing parameterized test classes
> > to test both safe and unsafe vectors.
> > Current Progress
> >
> > We have opened a JIRA for this work item FlINK-5200
> > <https://issues.apache.org/jira/browse/ARROW-5200>, and a PR
> > <https://github.com/apache/arrow/pull/4212> with initial implementations
> > have been opened. We would appreciate if you could give some comments.
> >

Reply via email to