Re: Vector Computation Optimization Approaches for AsterixDB

Mike Carey Fri, 13 Jun 2025 07:51:15 -0700

That reminds me - once upon a time, planserialization/distribution/deserialization using Java serialization waskind of an expensive part of our path, when we were trying to shave offcosts for little queries. I wonder if we should look at that againsometime? (Not our most urgent problem, this just reminded me.)


Cheers,


Mike

On 6/13/25 3:30 AM, Wail Alkowaileet wrote:

Quoting Photon
<https://people.eecs.berkeley.edu/~matei/papers/2022/sigmod_photon.pdf>
Paper:

After query planning, DBR launches tasks to execute the stages of the
plan. In a task with Photon, the Photon execution node first serializes the
Photon part of the plan into a Protobuf [6] message. This message is passed
via the Java Native Interface (JNI) [8] to the Photon C++ library, which
deserializes the Protobuf and converts it into a Photon-internal plan.


Let's see what others have done. E.g., Photon, Velox (+ Apache Gluten to
use Velox in Spark), Apache DataFusion Comet (Apache DataFusion is written
in Rust).

On Wed, Jun 11, 2025 at 1:55 AM Calvin Dani<calvinthomas.d...@gmail.com>
wrote:

Yes, Ill look into the JNA project too and explore approach 2 with both
FFM and JNA.

I’ll prototype both approach 1 and 2 and update with a status in here.

On Jun 10, 2025, at 1:50 PM, Ian Maxon<ima...@apache.org> wrote:

The Vector API is in OpenJDK, so I think the licensing should be OK:
https://openjdk.org/jeps/508

The main problem is the fact it isn't a stable API yet, and it relies
on Valhalla. It would be a judgement call on how much we expect it to
change over time, and how difficult it would be to migrate things to
follow those changes. It would also be a bet that by the time
everything is done, these set of JDK features are more or less
stabilized.

Using FFI/JNI would be a more traditional way to go about it. FFI is
new and better than JNI, so if we choose to go with that, it should be
less painful. FFI is a preview feature, which is less risky than an
incubating feature.

There is also the JNA project, which wraps JNI to make it simpler:
https://github.com/java-native-access/jna . I'm assuming most of the
libraries we might want to use are mostly computational, so they
wouldn't have many platform-specific dependencies, just architecture
specific ones. I think it also handles the build aspect of it, which
FFI doesn't directly. Assuming the libraries we would want to use
aren't in libc or otherwise can't be assumed to be present, we would
have to include them in the jar somehow.

On Tue, Jun 10, 2025 at 8:27 AM Mike Carey<dtab...@gmail.com> wrote:

Q:  Are there licensing gotchas with approach 1 (which otherwise sounds
nicer from a maintenance standpoint)? We need to be sure that everything
we use is Apache-okay in terms of licensing.  It would be fun to see
some preliminary numbers on perf, e.g., for KNN, each way, were it as
easy as changing which function(s) to call...  :-)  That would help
quantify the two options (vs. each other and vs. none) too.

On 6/10/25 7:24 AM, Calvin Dani wrote:
Hi,

As part of adding vector functionality to AsterixDB, I have been

exploring

possible optimizations for vector computations. One promising

direction is

leveraging SIMD operations to accelerate these calculations. Although

Java

offers autovectorization to utilize SIMD, this approach requires the
operations to be branchless (i.e., no conditional branching like

if/else),

and it may not always be triggered when vector calculations get

complex.

I have considered two main options for SIMD-enabled vector computation:

1. Java Vector API: Introduced as an incubation feature since Java 17,

the

Vector API is part of the long-term Project Valhalla. While it remains

in

incubation and likely won’t be finalized until Project Valhalla

completes,

the API already supports the basic operations needed for our distance
metrics, such as Euclidean Distance, Manhattan Distance, Cosine

Similarity,

and Dot Product. It also provides a primitive Vector<E> type which

could

serve as a native storage for embeddings.

2. Foreign Function & Memory API: This allows calling optimized C/C++
libraries directly from Java. We could either leverage existing
highly-optimized vector computation libraries or implement our own

native

code. However, packaging and ensuring compatibility of native libraries
across different target platforms may introduce complexity.

If you are aware of other solutions or have feedback on these options,

would appreciate your insights.

Thank you,
Calvin Dani

Re: Vector Computation Optimization Approaches for AsterixDB

Reply via email to