Hi all,

We are proposing a new set of APIs in Arrow - unsafe vector APIs. The
general ideas is attached below, and also accessible from our online
document
<https://docs.google.com/document/d/13oZFVS1EnNedZd_7udx-h10G2tRTjfgHe2ngp2ZWJ70/edit?usp=sharing>.
Please give your valuable comments by directly commenting in our online
document
<https://docs.google.com/document/d/13oZFVS1EnNedZd_7udx-h10G2tRTjfgHe2ngp2ZWJ70/edit?usp=sharing>,
or relaying this email thread.

Thank you so much in advance.

Best,
Liya Fan

Support Fast/Unsafe Vector APIs for Arrow Background

In our effort to support columnar data format in Apache Flink, we chose
Apache Arrow as the basic data structure. Arrow greatly simplifies the
support of the columnar data format. However, for many scenarios, we find
the performance unacceptable. Our investigation shows the reason is that,
there are too many redundant checks and computations in current Arrow API.



For example, the following figures shows that in a single call to
Float8Vector.get(int) method (this is one of the most frequently used APIs
in Flink computation),  there are 20+ method invocations.


[image: image.png]





There are many other APIs with similar problems. The redundant checks and
computations impact performance severely. According to our evaluation, the
performance may degrade by two or three orders of magnitude.
Our Proposal

For many scenarios, the checks can be avoided, if the application
developers can guarantee that all checks will pass. So our proposal is to
provide some light-weight APIs. The APIs are also named *unsafe APIs*, in
the sense that that skip most of the checks (not safe) to improve the
performance.



In the light-weight APIs, we only provide minimum checks, or avoid checks
at all. The application owner can still develop and debug their code using
the original safe APIs. Once all bugs have been fixed, they can switch to
unsafe APIs in the final version of their products and enjoy the high
performance.
Our Design

Our goal is to include unsafe vector APIs in Arrow code base, and allow our
customers switching to the new unsafe APIs, without being aware of it,
except for the high performance. To achieve this goal, we make the
following design choices:
Vector Class Hierarchy

Each unsafe vector is the subclass of the safe vector. For example, the
unsafe Float8Vector is a subclass of org.apache.arrow.vector.Float8Vector:



package org.apache.arrow.vector.unsafe;



public class Float8Vector extends org.apache.arrow.vector.Float8Vector



So the safe vector acts as a façade of the unsafe vector, and through
polymorphism, the users may not be aware of which type of vector he/she is
working with. In addition, the common logics can be reused in the unsafe
vectors, and we only need to override get/set related methods.
Vector Creation

We use factory methods to create each type of vectors. Compared with vector
constructors, the factory methods take one more parameter, the vectorType:



public class VectorFactory {

  public static Float8Vector createFloat8Vector(VectorType vectorType,
String name, BufferAllocator allocator);

}



VectorType is an enum to separate safe vectors from unsafe ones:



public enum VectorType {

  SAFE,

  UNSAFE

}



With the factory methods, the old way of creating vectors by constructors
can be gradually depreciated.
Vector Implementation

As discussed above, unsafe vectors mainly override get/set methods. For get
methods, we directly operate on the off-heap memory, without any check:



public double get(int index) {

    return
Double.longBitsToDouble(PlatformDependent.getLong(valueBuffer.memoryAddress()
+ (index << TYPE_LOG2_WIDTH)));

}



Note that the PlatformDependent API is only 2 stack layers above the
underlying UNSAFE method call.



For set methods, we still need to set the validity bit. However, this is
through an unsafe method that directly sets the bits without checking:



         public void set(int index, double value) {

      UnsafeBitVectorHelper.setValidityBitToOne(validityBuffer, index);

PlatformDependent.putLong(

            valueBuffer.memoryAddress() + (index << TYPE_LOG2_WIDTH),
Double.doubleToRawLongBits(value));

}



Method UnsafeBitVectorHelper.setValidityBitToOne is the unsafe version of
BitVectorHelper.setValidityBitToOne that avoids checks.


Test Cases

We can reuse existing test cases by employing parameterized test classes to
test both safe and unsafe vectors.
Current Progress

We have opened a JIRA for this work item FlINK-5200
<https://issues.apache.org/jira/browse/ARROW-5200>, and a PR
<https://github.com/apache/arrow/pull/4212> with initial implementations
have been opened. We would appreciate if you could give some comments.

Reply via email to