Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

via GitHub Mon, 07 Jul 2025 09:07:15 -0700


comphead commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2190494148



##########
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##########
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself<sup>[1](#footnote1)</sup>.
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).

Review Comment:
   We can expand this section if needed including some more examples like 
   | System / Project               | Index Type                                
              | Description |
   
|--------------------------------|----------------------------------------------------------|-------------|
   | **Apache Iceberg**             | Hidden partitioning, metadata tables, 
external indexes via integrations | Supports partition pruning and metadata 
filtering. External indexing possible via tools like Nessie or OpenMetadata. |
   | **Apache Hudi**                | Bloom filter index, Column stats index, 
Metadata table index | Uses internal/external indexes, such as Bloom filters 
for key lookups and metadata table for faster file indexing. |
   | **Delta Lake**                 | Data skipping with min/max, Z-order 
indexing, custom via OSS | No native general indexing, but Z-ordering and 
external tools like Hyperspace enable indexing. |
   | **Microsoft Hyperspace**       | Covering indexes, Z-order, sorted indexes 
               | Spark-based library for building and maintaining secondary 
indexes on Parquet datasets. |
   | **ClickHouse (w/ Parquet)**    | Skip indexes, minmax, bloom filter        
              | Supports indexing on Parquet input via native skip indexes for 
faster query performance. |
   | **DuckDB**                     | Automatic statistics, zone maps, 
experimental indexing   | Maintains internal stats and supports some persistent 
indexing for Parquet reads. |
   | **Dremio**                     | Reflections (materializations), internal 
column stats    | Builds external materialized views (Reflections) over Parquet 
for acceleration. |
   | **Lucene / Elasticsearch / OpenSearch** | Inverted index, range, spatial   
                    | External systems that can index Parquet content or 
extracted metadata for fast search. |
   | **Varada (Starburst)**         | Bitmaps, adaptive indexes on Presto over 
Parquet         | Built adaptive indexes for Parquet datasets to accelerate 
selective queries. |
   | **Starburst Galaxy / Trino**   | Connector-level support, custom index 
cache (roadmap)    | Some support via caching and pruning; external indexing 
under active development. |
   | **LakeSoul**                   | Z-order, data skipping                    
               | Supports column-aware skipping and optional Z-ordering for 
efficient reads. |
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

Reply via email to