That reminds me - once upon a time, plan
serialization/distribution/deserialization using Java serialization was
kind of an expensive part of our path, when we were trying to shave off
costs for little queries. I wonder if we should look at that again
sometime? (Not our most urgent problem, this just reminded me.)
Cheers,
Mike
On 6/13/25 3:30 AM, Wail Alkowaileet wrote:
Quoting Photon
<https://people.eecs.berkeley.edu/~matei/papers/2022/sigmod_photon.pdf>
Paper:
After query planning, DBR launches tasks to execute the stages of the
plan. In a task with Photon, the Photon execution node first serializes the
Photon part of the plan into a Protobuf [6] message. This message is passed
via the Java Native Interface (JNI) [8] to the Photon C++ library, which
deserializes the Protobuf and converts it into a Photon-internal plan.
Let's see what others have done. E.g., Photon, Velox (+ Apache Gluten to
use Velox in Spark), Apache DataFusion Comet (Apache DataFusion is written
in Rust).
On Wed, Jun 11, 2025 at 1:55 AM Calvin Dani<calvinthomas.d...@gmail.com>
wrote:
Yes, Ill look into the JNA project too and explore approach 2 with both
FFM and JNA.
I’ll prototype both approach 1 and 2 and update with a status in here.
On Jun 10, 2025, at 1:50 PM, Ian Maxon<ima...@apache.org> wrote:
The Vector API is in OpenJDK, so I think the licensing should be OK:
https://openjdk.org/jeps/508
The main problem is the fact it isn't a stable API yet, and it relies
on Valhalla. It would be a judgement call on how much we expect it to
change over time, and how difficult it would be to migrate things to
follow those changes. It would also be a bet that by the time
everything is done, these set of JDK features are more or less
stabilized.
Using FFI/JNI would be a more traditional way to go about it. FFI is
new and better than JNI, so if we choose to go with that, it should be
less painful. FFI is a preview feature, which is less risky than an
incubating feature.
There is also the JNA project, which wraps JNI to make it simpler:
https://github.com/java-native-access/jna . I'm assuming most of the
libraries we might want to use are mostly computational, so they
wouldn't have many platform-specific dependencies, just architecture
specific ones. I think it also handles the build aspect of it, which
FFI doesn't directly. Assuming the libraries we would want to use
aren't in libc or otherwise can't be assumed to be present, we would
have to include them in the jar somehow.
On Tue, Jun 10, 2025 at 8:27 AM Mike Carey<dtab...@gmail.com> wrote:
Q: Are there licensing gotchas with approach 1 (which otherwise sounds
nicer from a maintenance standpoint)? We need to be sure that everything
we use is Apache-okay in terms of licensing. It would be fun to see
some preliminary numbers on perf, e.g., for KNN, each way, were it as
easy as changing which function(s) to call... :-) That would help
quantify the two options (vs. each other and vs. none) too.
On 6/10/25 7:24 AM, Calvin Dani wrote:
Hi,
As part of adding vector functionality to AsterixDB, I have been
exploring
possible optimizations for vector computations. One promising
direction is
leveraging SIMD operations to accelerate these calculations. Although
Java
offers autovectorization to utilize SIMD, this approach requires the
operations to be branchless (i.e., no conditional branching like
if/else),
and it may not always be triggered when vector calculations get
complex.
I have considered two main options for SIMD-enabled vector computation:
1. Java Vector API: Introduced as an incubation feature since Java 17,
the
Vector API is part of the long-term Project Valhalla. While it remains
in
incubation and likely won’t be finalized until Project Valhalla
completes,
the API already supports the basic operations needed for our distance
metrics, such as Euclidean Distance, Manhattan Distance, Cosine
Similarity,
and Dot Product. It also provides a primitive Vector<E> type which
could
serve as a native storage for embeddings.
2. Foreign Function & Memory API: This allows calling optimized C/C++
libraries directly from Java. We could either leverage existing
highly-optimized vector computation libraries or implement our own
native
code. However, packaging and ensuring compatibility of native libraries
across different target platforms may introduce complexity.
If you are aware of other solutions or have feedback on these options,
I
would appreciate your insights.
Thank you,
Calvin Dani