Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

via GitHub Tue, 08 Jul 2025 11:23:56 -0700


kevinjqliu commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2193162144



##########
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##########
@@ -0,0 +1,578 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+
+**Example scenario:**  
+Imagine your data is partitioned by a `Nation` column (dozens of distinct 
values) across thousands of Parquet files. You execute:
+
+```sql
+  SELECT AVG(sales_amount)
+  FROM sales
+  WHERE nation = 'Singapore'
+  GROUP BY year;
+```
+
+Relying on min/max statistics alone isn’t very selective when a file’s Nation 
range spans “Argentina” through “Zimbabwe,” and Bloom filters still incur 
nontrivial I/O to load per file. Instead, you can store—in each file’s footer 
metadata—a compact list of every distinct nation value present. At query time, 
your engine reads just that tiny list to determine which files cannot contain 
'Singapore' and skips them entirely. This yields dramatically better 
file‑pruning performance, all while preserving full compatibility with standard 
Parquet readers.
+
+In this post, we review how indexes are stored in the Apache Parquet format, 
explain the mechanism for storing user-defined indexes, and finally show how to 
read and write a user-defined index using [Apache DataFusion].
+
+[Apache DataFusion]: https://datafusion.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+
+## Introduction
+
+---
+
+Apache Parquet is a popular columnar file format with well understood and 
[production grade libraries for high‑performance analytics]. Features like 
efficient encodings, column pruning, and predicate pushdown work well for many 
common query patterns. Apache DataFusion includes a [highly optimized Parquet 
implementation] and has excellent performance in general. However, some 
production query patterns require more than the statistics included in the 
Parquet format itself<sup>[1](#footnote1)</sup>.
+
+[production grade libraries for high‑performance analytics]: 
https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/
+[highly optimized Parquet implementation]: 
https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/
+
+Many systems improve query performance using *external* indexes or other 
metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] 
uses metadata stored in separate files or an in memory cache, and the 
[parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion 
repository use external files for Parquet pruning (skipping).
+
+External indexes are powerful and widespread, but have some drawbacks:
+
+* **Increased Cost and Operational Complexity:** Additional files and systems 
are needed as well as the original Parquet. 
+* **Synchronization Risks:** The external index may become out of sync with 
the Parquet data if not managed carefully.
+
+These drawbacks have even been cited as justification for new file formats, 
such as Microsoft’s 
[Amudai](https://github.com/microsoft/amudai/blob/main/docs/spec/src/what_about_parquet.md).
+
+**However, Parquet is extensible with user-defined indexes**: Parquet 
tolerates unknown bytes within the file body and permits arbitrary key/value 
pairs in its footer metadata. These two features enable **embedding** 
user-defined indexes directly in the file—no extra files, no format forks, and 
no compatibility breakage. 
+
+[Scan Planning]: 
https://iceberg.apache.org/docs/latest/performance/#scan-planning
+[parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/parquet_index.rs
+[advanced_parquet_index.rs]: 
https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs
+
+## Parquet File Anatomy & Standard Index Structures
+
+---
+
+Logically, Parquet files contain row groups, each with column chunks, which in 
turn contain data pages. Physically, a Parquet file is a sequence of bytes with 
a Thrift-encoded footer metadata containing metadata about the file structure. 
The footer metadata includes the schema, row groups, column chunks, and other 
metadata required to read the file.
+
+The Parquet format includes three main types<sup>[2](#footnote2)</sup> of 
optional index structures:
+
+1. **[Min/Max/Null Count Statistics]** for each chunk in a row group. Used to 
quickly skip row groups that do not match a query predicate. 
+2. **[Page Index]**: Offsets, sizes, and statistics for each data page. Used 
to quickly locate data pages without scanning all pages for a column chunk.
+3. **[Bloom Filters]**: Data structure to quickly determine if a value is 
present in a column chunk without scanning any data pages. Particularly useful 
for equality and `IN` predicates.
+
+[Page Index]: https://parquet.apache.org/docs/file-format/pageindex/
+[Bloom Filters]: https://parquet.apache.org/docs/file-format/bloomfilter/
+[Min/Max/Null Count Statistics]: 
https://github.com/apache/parquet-format/blob/819adce0ec6aa848e56c56f20b9347f4ab50857f/src/main/thrift/parquet.thrift#L263-L266
+
+<!-- Source: 
https://docs.google.com/presentation/d/1aFjTLEDJyDqzFZHgcmRxecCvLKKXV2OvyEpTQFCNZPw
 -->
+
+<img 
src="/blog/images/user-defined-parquet-indexes/standard_index_structures.png" 
width="80%" class="img-responsive" alt="Parquet File layout with standard index 
structures."/>
+
+**Figure 1**: Parquet file layout with standard index structures (as written 
by arrow-rs).
+
+Only the Min/Max/Null Count Statistics are stored inline in the Parquet footer 
metadata. The Page Index and Bloom Filters are stored in the file body before 
the Thrift-encoded footer metadata. The locations of these index structures are 
recorded in the footer metadata, as shown in Figure 1. Parquet readers that do 
not understand these structures simply ignore them.
+
+Modern Parquet writers create these indexes automatically and provide APIs for 
their generation and placement. For example, the [Apache Arrow Rust library] 
provides [Parquet WriterProperties], [EnabledStatistics], and 
[BloomFilterPosition].
+
+[Apache Arrow Rust library]: https://docs.rs/parquet/latest/parquet/file/index/

Review Comment:
   do you mean
   ```suggestion
   [Apache Arrow Rust library]: 
https://docs.rs/parquet/latest/parquet/file/properties/index.html
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

Reply via email to