(incubator-graphar-website) branch main updated: doc: add new blog post on integrating graphs into Open Lakehouse (#47)

xiaokang Sat, 18 Oct 2025 10:56:24 -0700

This is an automated email from the ASF dual-hosted git repository.

xiaokang pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-graphar-website.git



The following commit(s) were added to refs/heads/main by this push:
     new a047189  doc: add new blog post on integrating graphs into Open 
Lakehouse (#47)
a047189 is described below

commit a047189c996dd7a265bb1824c1667ffcd209e70e
Author: Sem <[email protected]>
AuthorDate: Mon Sep 29 10:27:58 2025 +0200

    doc: add new blog post on integrating graphs into Open Lakehouse (#47)
    
    * Add new blog post on integrating graphs into Open Lakehouse
    
    This commit introduces a blog post discussing the potential and methodology 
of integrating property graphs into the Open Lakehouse ecosystem. It covers 
storage standards like Apache GraphAr, tools such as GraphFrames, KuzuDB, and 
HugeGraph, and highlights ongoing efforts to align these with existing graph 
processing frameworks. Supporting author details and images are also included.
    
    * truncate
---
 blog/2025-09-03-graphs-openlakehouse.mdx           | 137 +++++++++++++++++++++
 blog/authors.yml                                   |   6 +
 static/img/blog/graph-lakehouse/gf-overview.png    | Bin 0 -> 289437 bytes
 static/img/blog/graph-lakehouse/gf-sql.png         | Bin 0 -> 303222 bytes
 .../img/blog/graph-lakehouse/graphs-lakehouse.png  | Bin 0 -> 991589 bytes
 .../img/blog/graph-lakehouse/pregel-pagerank.png   | Bin 0 -> 921826 bytes
 static/img/blog/graph-lakehouse/property-graph.png | Bin 0 -> 477845 bytes
 7 files changed, 143 insertions(+)

diff --git a/blog/2025-09-03-graphs-openlakehouse.mdx 
b/blog/2025-09-03-graphs-openlakehouse.mdx
new file mode 100644
index 0000000..78f838d
--- /dev/null
+++ b/blog/2025-09-03-graphs-openlakehouse.mdx
@@ -0,0 +1,137 @@
+---
+slug: graphs-openlakehouse
+title: Dreaming of Graphs in the Open Lakehouse
+authors: [ssinchenko]
+tags: [graphar, graphframes, kuzu, lakehouse]
+image: /img/blog/graph-lakehouse/graphs-lakehouse.png
+description: While Open Lakehouse platforms now natively support tables, 
geospatial data, vectors, and more, property graphs are still missing. In the 
age of AI and growing interest in Graph RAG, graphs are becoming especially 
relevant – there’s a need to deliver Knowledge Graphs to RAG systems, with 
standards, ETL, and frameworks for different scenarios. There’s a young 
project, Apache GraphAr (incubating), that aims to define a storage standard. 
For processing, there is a good tooling  [...]
+---
+
+*Initialy published in [Sem Sinchenko's 
blog](https://semyonsinchenko.github.io/ssinchenko/post/dreams-about-graph-in-lakehouse/)*
+
+# TLDR;
+
+While Open Lakehouse platforms now natively support tables, geospatial data, 
vectors, and more, property graphs are still missing. In the age of AI and 
growing interest in Graph RAG, graphs are becoming especially relevant – 
there’s a need to deliver Knowledge Graphs to RAG systems, with standards, ETL, 
and frameworks for different scenarios. There’s a young project, Apache GraphAr 
(incubating), that aims to define a storage standard. For processing, there is 
a good tooling already. Grap [...]
+
+<!--truncate-->
+
+# Introduction
+
+## Open Lakehouse
+
+This isn’t an official definition – just my personal view of the Open 
Lakehouse concept. Open Lakehouse is an approach where data lives in shared 
storage accessible to many systems. The storage can be cloud-based, like Amazon 
S3 or Azure Blob Storage. It can also be on-premises, such as HDFS, Ceph, or 
MinIO. There are even solutions aimed at building hybrid cloud + on-prem 
setups, like IOMETE.
+
+Different data types use their own open storage standards. For tabular data, 
there’s Apache Iceberg, Delta Lake, and Apache Hudi. For vectors, Lance can be 
used. For geospatial data, GeoParquet is one option. Metadata about tables and 
datasets is kept in a central catalog, like Unity Catalog or Hive Metastore. 
This allows tools and engines discover where data lives and how it’s structured.
+
+Different systems are used for different tasks. Apache Spark is great for 
distributed batch processing of terabytes of data – it’s not the fastest, but 
it’s reliable and fault-tolerant. DuckDB is a strong choice for fast, in-memory 
analytics and is also excellent for single-node ETL tasks. For example, if you 
have both huge tables (a good case for Spark) and small ones (where spinning up 
a cluster is overkill), DuckDB is very efficient. Apache Doris and ClickHouse 
are super solutions for [...]
+
+All these tools interact with the catalog. They locate, read, process, and 
write data in standard formats. The catalog also helps with governance and 
access management, keeping metadata up to date and ensuring everything works 
together smoothly.
+
+This concept makes it possible to seamlessly connect different tools, always 
choosing the best one for the job. You don’t have to spin up a cluster just to 
process a 100 MB table, or rent massive single-node instances (with limited 
capacity) to crunch terabytes of data by a tool designed for single-node 
processing. There’s no need to force vector search into pure SQL, either. The 
Open Lakehouse approach lets you mix and match systems, using each where it 
fits best, and keeps everything i [...]
+
+## Property Graphs
+
+This isn’t a textbook definition – just how I personally see the concept of a 
property graph. At its core, a graph is a set of **vertices (nodes)** and 
**edges (links)** connecting them. Edges can be **directed** or **undirected**. 
They can also be **weighted** or unweighted. Both vertices and edges can have 
their own **properties** – key-value pairs with extra information. In practice, 
we often have **different types of vertices and edges**, each with their own 
set of properties.
+
+This leads us to the **Property Graph** model – a graph that contains multiple 
types of vertices and edges, each of which may have different properties.
+
+![Move Fan Social Network](/img/blog/graph-lakehouse/property-graph.png)
+
+To make this concrete, I like the example of a “movie fan social network.” The 
diagram above shows how this looks as a property graph. There are **people** 
and **movies** – two types of vertices, each with their own properties. People 
can **like** movies (undirected edges), **send messages** to each other 
(directed edges), and **follow** directors. There are also **actors** and 
**directors** as separate vertex types. Movies can be connected as sequels. All 
these relationships and entitie [...]
+
+The property graph model is very universal. Take a **banking payments 
network**: there are legal entities, government services, individuals, 
exchanges, and goods. Each is a different vertex type. Payments are directed, 
weighted edges with properties like date and amount. Two legal entities sharing 
a board member form an undirected, unweighted edge with the director’s details 
as properties. This structure is great for compliance, KYC, and anti-fraud. It 
helps to see who is connected to wh [...]
+
+Or consider an **online marketplace**. There are buyers, sellers, and 
products. Buyers purchase products from sellers. Sellers offer many products. 
All of this fits naturally into a property graph. This structure works well for 
recommendation systems. Recommending a product is basically a link prediction 
problem in the graph.
+
+Another example is **Organizational Network Analysis (ONA)**. Companies have 
departments, teams, and people, all connected in different ways. Teams assign 
tasks to each other. People have both formal and informal relationships. There 
are official and real hierarchies. ONA can reveal key employees, process 
bottlenecks, and even predict conflicts. It also helps improve knowledge 
sharing across the organization.
+
+In short, the property graph model is flexible and expressive. It works well 
for many real-world domains.
+
+# Property Graphs in the Open Lakehouse
+
+To make property graphs part of the Open Lakehouse, we need three things: a 
storage standard, a metadata catalog, and tools for different graph tasks. This 
is just my personal view. With tools, everything is fine—there are plenty of 
options for ETL-like processing, analytics, ML/AI, and visualization. But with 
storage standards and catalogs for graphs, things are much less developed. Most 
of the ecosystem is still focused on tables, and there’s no widely adopted open 
format or catalog fo [...]
+
+## Graph Processing tools
+
+Let’s look at three open source tools for working with property graphs. First, 
GraphFrames. For me, this is like Spark for Apache Iceberg. It scales well and 
handles huge, long-running batch jobs with reliable distributed processing. 
Second, Apache HugeGraph (incubating). This is like ClickHouse or Apache Doris 
for Apache Iceberg. You need a separate server, but you get a great tool for 
analysts. They can run interactive graph queries using Gremlin or OpenCypher. 
These are standard query [...]
+
+### GraphFrames
+
+Since I’m currently the most active maintainer of GraphFrames, I know this 
framework best from the inside. I’ve prepared a diagram to show how it works.
+
+![GraphFrames top-level overview](/img/blog/graph-lakehouse/gf-overview.png)
+
+GraphFrames gives users an API and abstractions for working with graphs, 
pattern matching, and running algorithms. Under the hood, all these operations 
are translated into standard relational operations – select, join, group by, 
aggregate – over DataFrames. DataFrames are just data in tabular form. The 
translated logical plan runs on an Apache Spark cluster. The user always gets 
results back as a DataFrame, which is simply a table.
+
+Let’s look at a concrete example – PageRank. This algorithm became famous for 
powering Google Search (fun fact: “Page” is actually the last name of Google 
co-founder Larry Page, not just about web pages). PageRank helps find the most 
“important” nodes in a graph, like ranking web pages by relevance.
+
+![Page Rank in Pregel](/img/blog/graph-lakehouse/pregel-pagerank.png)
+
+In GraphFrames, most algorithms – including PageRank – are built on the Pregel 
framework ([*Malewicz, Grzegorz, et al. "Pregel: a system for large-scale graph 
processing." Proceedings of the 2010 ACM SIGMOD International Conference on 
Management of data. 
2010.*](https://blog.lavaplanets.com/wp-content/uploads/2023/12/p135-malewicz.pdf)).
 We represent the graph as two DataFrames, which you can think of as tables: 
one for edges and one for vertices. The PageRank table is initialized by ass 
[...]
+
+Each iteration of PageRank works like a series of SQL operations. The process 
starts by joining the edges table with the current PageRank values for each 
vertex. This creates a triplets table, where each row contains a source, 
destination, and their current ranks. Next, we generate messages: each source 
sends its rank to its destination. These messages are grouped by destination 
and summed up. Finally, we join the results back to the PageRank table and 
update the rank using a simple form [...]
+
+This whole process is repeated – each step is just a combination of joins, 
group by, and aggregates over tables – until the ranks stop changing much. The 
algorithm converges quickly, usually in about 15–20 iterations. Since it relies 
entirely on SQL operations, running PageRank on an Apache Spark cluster gives 
you excellent horizontal scalability. As long as your tables fit in Spark, you 
can compute PageRank using Pregel. In practice, this means you can almost 
infinitely scale just by ad [...]
+
+### Kuzu DB
+
+I should mention up front that I know KuzuDB only as a user, so everything 
here is based on their documentation and public materials.
+
+KuzuDB is an embedded, in-process graph database designed for speed and 
analytics on a single machine. It stores graph data using a columnar format, 
including a columnar adjacency list—basically, a fast and compressed way to 
represent which nodes are connected. This approach enables extremely fast joins 
and efficient analytical queries, even for complex graph workloads. All data is 
stored on disk in a columnar layout, which helps with both speed and 
compression. As a result, KuzuDB deliv [...]
+
+### Apache HugeGraph (incubating)
+
+I should state upfront that I'm only superficially familiar with Apache 
HugeGraph, and most of what I know about its architecture comes from the 
documentation. Apache HugeGraph is not like Kuzu or GraphFrames, as it requires 
a standalone server to run on.
+
+![Apache HugeGraph 
Architecture](https://hugegraph.apache.org/docs/images/design/architectural-revised.png)
+
+HugeGraph has three main layers. The **Application Layer** includes tools like 
Hubble for visualization, Loader for importing data, command-line tools, and 
Computer – a distributed OLAP engine based on Pregel (yes, that’s the same 
Pregel framework I described in the GraphFrames section). There’s also a client 
library for developers.
+
+The **Graph Engine Layer** is the core server. It exposes a REST API and 
supports both Gremlin and OpenCypher queries. This layer handles both 
transactional (OLTP) and analytical (OLAP) workloads.
+
+The **Storage Layer** is flexible. You can choose different backends, like 
RocksDB, MySQL, or HBase, depending on your needs. This modular design lets you 
scale from embedded setups to the distributed storage. Overall, HugeGraph is 
built for both interactive and analytical graph workloads, with full support 
for Gremlin and OpenCypher.
+
+## Storage: Apache GraphAr (incubating)
+
+The only standard for Property Graph storage that I know of today is **Apache 
GraphAr (incubating)** (GraphAr means Graph Archive). I should add that I’m a 
member of the GraphAr PPMC committee, but I’ve honestly searched for other 
attempts to create a similar standard and haven’t found any.
+
+The core idea behind GraphAr is simple: you can think of it as Delta Lake, but 
for graphs. Data is stored in a columnar format—using Apache Parquet or Apache 
ORC files. Alongside the data, there are human-readable YAML metadata files. 
GraphAr represents Property Graphs as logical **Property Groups** for both 
vertices and edges. Each group has its own schema (properties), keys for 
vertices, and attributes like edge directionality. Internally, there are unique 
LONG indices for vertices and [...]
+
+![Apache GraphAr Property Graph model example](/img/docs/property_graph.png)
+
+The optimization principle is based on the fact that real-world graphs are 
usually clustered. By sorting data by vertex, we achieve excellent compression 
in Parquet, since properties and their values tend to be similar within 
clusters. Metadata, grouping, and data sorting open up a lot of room for query 
optimization. The optimizer can push down entire groups and skip reading files 
using min-max metadata in Parquet headers.
+
+![Apache GraphAr Vertex Storage Model](/img/docs/vertex_physical_table.png)
+
+Right now, there’s a reference C++ API for GraphAr. There’s also an Apache 
Spark and PySpark API, which I’m actively helping to develop. A standalone Java 
API (not tied to Apache Spark) is in progress. There’s even a CLI tool, modeled 
after Parquet Inspector, built on top of the C++ API.
+
+![Apache GraphAr Edges Storage Model](/img/docs/edge_logical_table.png)
+
+** Brining all together
+
+Bringing everything together, the architecture looks surprisingly clean and 
modular.
+
+![Possible integration of Property Graphs to the Open 
Lakehouse](/img/blog/graph-lakehouse/graphs-lakehouse.png)
+
+At the center is the storage standard - GraphAr - which acts as the “Delta 
Lake for graphs.” All graph data, whether from batch analytics or interactive 
workloads, is stored in a unified, open format. Around this core, different 
tools plug in depending on your needs. GraphFrames provides scalable, 
distributed analytics and ETL on Spark. KuzuDB offers fast, in-process 
analytics for single-node workloads. HugeGraph covers the interactive, OLAP, 
and OLTP scenarios with full support for Grem [...]
+
+This modular approach means you can mix and match tools as your requirements 
change. For example, you might use Spark and GraphFrames for heavy ETL and 
analytics, then switch to KuzuDB for fast, local exploration, or to HugeGraph 
for interactive graph queries and visualization. The open storage format 
ensures that your data remains portable and future-proof, no matter which 
engine you choose.
+
+Now for the current status. After several years in maintenance mode, I and 
other enthusiasts have revived GraphFrames and started releasing new versions. 
The brand new []Property Graph 
model](https://github.com/graphframes/graphframes/blob/master/core/src/main/scala/org/graphframes/propertygraph/PropertyGraphFrame.scala)
 that is fully aligned with GraphAr is already merged and will be available on 
the next release. After that it will be easy to add a support of reading and 
writing to Gra [...]
+
+For KuzuDB, the prototype is already 
[exists](https://github.com/kuzudb/kuzu/issues/5903)
+
+As for HugeGraph, the team is waiting for GraphAr’s Java API to stabilize. 
Once that happens, they’re interested in integrating with the format as well.
+
+# Bonus Part 1: catalog integration
+
+In theory, you could also add a catalog layer to this picture. Options like 
Apache Polaris or other Iceberg Catalogs probably won’t work – they’re focused 
only on tables and Apache Iceberg. But solutions like Unity Catalog seem much 
more promising and could likely be extended to support graphs. I even [asked 
about this in the Unity 
repo](https://github.com/unitycatalog/unitycatalog/discussions/252), though I 
haven’t received a response yet. Since Unity is written in Java, I could 
potenti [...]
+
+# Bonus Part 2: Apache DataFusion integration
+
+Right now, I’m also working on bringing graphs into the Apache DataFusion 
ecosystem by rewriting GraphFrames in Rust. I’ve just started, but the [Core 
Pregel](https://github.com/SemyonSinchenko/graphframes-rs/blob/main/src/pregel.rs)
 engine is already done – and it even passes tests. The next step is to 
implement the graph algorithms themselves.
+
+# Conclusion
+
+In the end, I dream of seeing graphs become a natural part of the Open 
Lakehouse ecosystem. I believe this would unlock a whole new set of 
possibilities – especially now, in the age of AI, Graph-RAG, and Knowledge 
Graphs. And it’s not just a dream – I’m actively working to make it real.
+I hope it was interesting to read and I will be happy to hear any feedback!
diff --git a/blog/authors.yml b/blog/authors.yml
index 9c738a6..81c3026 100644
--- a/blog/authors.yml
+++ b/blog/authors.yml
@@ -20,3 +20,9 @@ acezen:
   title: Maintainer of GraphAr
   url: https://github.com/acezen
   image_url: https://github.com/acezen.png
+
+ssinchenko:
+  name: Sem Sinchenko
+  title: Maintainer of GraphAr
+  url: https://semyonsinchenko.github.io/ssinchenko/
+  image_url: https://avatars.githubusercontent.com/u/29755009
diff --git a/static/img/blog/graph-lakehouse/gf-overview.png 
b/static/img/blog/graph-lakehouse/gf-overview.png
new file mode 100644
index 0000000..c94ebe4
Binary files /dev/null and b/static/img/blog/graph-lakehouse/gf-overview.png 
differ
diff --git a/static/img/blog/graph-lakehouse/gf-sql.png 
b/static/img/blog/graph-lakehouse/gf-sql.png
new file mode 100644
index 0000000..e057927
Binary files /dev/null and b/static/img/blog/graph-lakehouse/gf-sql.png differ
diff --git a/static/img/blog/graph-lakehouse/graphs-lakehouse.png 
b/static/img/blog/graph-lakehouse/graphs-lakehouse.png
new file mode 100644
index 0000000..a139047
Binary files /dev/null and 
b/static/img/blog/graph-lakehouse/graphs-lakehouse.png differ
diff --git a/static/img/blog/graph-lakehouse/pregel-pagerank.png 
b/static/img/blog/graph-lakehouse/pregel-pagerank.png
new file mode 100644
index 0000000..9f4dd1a
Binary files /dev/null and 
b/static/img/blog/graph-lakehouse/pregel-pagerank.png differ
diff --git a/static/img/blog/graph-lakehouse/property-graph.png 
b/static/img/blog/graph-lakehouse/property-graph.png
new file mode 100644
index 0000000..82b0515
Binary files /dev/null and b/static/img/blog/graph-lakehouse/property-graph.png 
differ


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(incubator-graphar-website) branch main updated: doc: add new blog post on integrating graphs into Open Lakehouse (#47)

Reply via email to