comphead commented on code in PR #79: URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2190494148
########## content/blog/2025-07-14-user-defined-parquet-indexes.md: ########## @@ -0,0 +1,545 @@ +--- +layout: post +title: Embedding User-Defined Indexes in Apache Parquet Files +date: 2025-07-14 +author: Qi Zhu, Jigao Luo, and Andrew Lamb +categories: [features] +--- +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +It’s a common misconception that [Apache Parquet] files are limited to basic Min/Max/Null Count statistics and Bloom filters, and that adding more advanced indexes requires changing the specification or creating a new file format. In fact, footer metadata and offset-based addressing already provide everything needed to embed user-defined index structures within Parquet files without breaking compatibility with other Parquet readers. + +In this post, we review how indexes are stored in the Apache Parquet format, explain the mechanism for storing user-defined indexes, and finally show how to read and write a user-defined index using [Apache DataFusion]. + +[Apache DataFusion]: https://datafusion.apache.org/ +[Apache Parquet]: https://parquet.apache.org/ + +## Introduction + +--- + +Apache Parquet is a popular columnar file format with well understood and [production grade libraries for high‑performance analytics]. Features like efficient encodings, column pruning, and predicate pushdown work well for many common query patterns. DataFusion includes a [highly optimized Parquet implementation] and has excellent performance in general. However, some production query patterns require more than the statistics included in the Parquet format itself<sup>[1](#footnote1)</sup>. + +[production grade libraries for high‑performance analytics]: https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/ +[highly optimized Parquet implementation]: https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/ + +Many systems improve query performance using *external* indexes or other metadata in addition to Parquet. For example, Apache Iceberg's [Scan Planning] uses metadata stored in separate files or an in memory cache, and the [parquet_index.rs] and [advanced_parquet_index.rs] examples in the DataFusion repository use external files for Parquet pruning (skipping). Review Comment: We can expand this section if needed including some more examples like | System / Project | Index Type | Description | |--------------------------------|----------------------------------------------------------|-------------| | **Apache Iceberg** | Hidden partitioning, metadata tables, external indexes via integrations | Supports partition pruning and metadata filtering. External indexing possible via tools like Nessie or OpenMetadata. | | **Apache Hudi** | Bloom filter index, Column stats index, Metadata table index | Uses internal/external indexes, such as Bloom filters for key lookups and metadata table for faster file indexing. | | **Delta Lake** | Data skipping with min/max, Z-order indexing, custom via OSS | No native general indexing, but Z-ordering and external tools like Hyperspace enable indexing. | | **Microsoft Hyperspace** | Covering indexes, Z-order, sorted indexes | Spark-based library for building and maintaining secondary indexes on Parquet datasets. | | **ClickHouse (w/ Parquet)** | Skip indexes, minmax, bloom filter | Supports indexing on Parquet input via native skip indexes for faster query performance. | | **DuckDB** | Automatic statistics, zone maps, experimental indexing | Maintains internal stats and supports some persistent indexing for Parquet reads. | | **Dremio** | Reflections (materializations), internal column stats | Builds external materialized views (Reflections) over Parquet for acceleration. | | **Lucene / Elasticsearch / OpenSearch** | Inverted index, range, spatial | External systems that can index Parquet content or extracted metadata for fast search. | | **Varada (Starburst)** | Bitmaps, adaptive indexes on Presto over Parquet | Built adaptive indexes for Parquet datasets to accelerate selective queries. | | **Starburst Galaxy / Trino** | Connector-level support, custom index cache (roadmap) | Some support via caching and pruning; external indexing under active development. | | **LakeSoul** | Z-order, data skipping | Supports column-aware skipping and optional Z-ordering for efficient reads. | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org