Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

via GitHub Mon, 07 Jul 2025 21:23:45 -0700


2010YOUY01 commented on code in PR #79:
URL: https://github.com/apache/datafusion-site/pull/79#discussion_r2191443341



##########
content/blog/2025-07-14-user-defined-parquet-indexes.md:
##########
@@ -0,0 +1,545 @@
+---
+layout: post
+title: Embedding User-Defined Indexes in Apache Parquet Files
+date: 2025-07-14
+author: Qi Zhu, Jigao Luo, and Andrew Lamb
+categories: [features]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+It’s a common misconception that [Apache Parquet] files are limited to basic 
Min/Max/Null Count statistics and Bloom filters, and that adding more advanced 
indexes requires changing the specification or creating a new file format. In 
fact, footer metadata and offset-based addressing already provide everything 
needed to embed user-defined index structures within Parquet files without 
breaking compatibility with other Parquet readers.
+

Review Comment:
   I think adding a concrete example here—specifically about the custom DV 
index code example featured in this blog—can help keep readers engaged.
   
   ----
   
   Example scenario:
   
   Suppose you have a dataset roughly partitioned by `Nation` column with 
several dozen cardinality, and the dataset has thousands of partitioned files.
   We have a analytical query with a selective predicate on `Nation` column: 
   ```sql
   SELECT AVG(sales_amount)
   FROM sales
   WHERE nation = 'Singapore'
   GROUP BY year;
   ```
   
   Ideally, you’d like to skip most of those files entirely—but Parquet’s 
built-in min/max statistics might not work when partitions cover a wide range 
of values on the predicate column, and Bloom filters can still incur 
substantial overhead.
   
   In this post, we’ll introduce a custom distinct-value index with code 
example, that lets you efficiently prune away irrelevant files.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

Reply via email to