adrian-thurston opened a new issue, #16200: URL: https://github.com/apache/datafusion/issues/16200
### Is your feature request related to a problem or challenge? The ParquetOpen will load all page metadata for a file, on an all tasks concurrently accessing that file. This can be costly for parquet files with a large number of rows, or a large number of columns, or both. In testing at Influx we have noticed page metadata load time taking in the order of tens of milliseconds for some customer scenarios. We have directly timed this on customer parquet files. We estimate the contribution to query time being about 83% of those times. Some individual page metadata load times: | Write Load | File Size | Row Groups | Columns | Rows | Row Group Compression | Rows of Page Metadata | Page Metdata Load Time | Estimated Query Savings | | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | Telegraf | 110 MB | 11 | 65 | 10,523,008 | 10.3 / 17.4 MB | 67,862 | 9ms | 6ms / 36ms | | Random Datagen | 283 MB | 5 | 11 | 4,620,000 | 61.1 / 66.7 MB | 5,016 | 0.7ms | nil | | Cust A | 144 MB | 50 | 26 | 51,521,481 | 2.9 / 4.5 MB | 132,864 | 16.9ms | 14.1ms / ? | | Cust B | 104 MB | 70 | 19 | 73,158,554 | 1.2 / 2.7 MB | 137,864 | 23.3ms | 19.4ms / ? | | Cust C | 122 MB | 11 | 199 | 10,530,204 | 10.8 / 40.3 MB | 208,156 | 25.4ms | 21.1ms / ? | Note: for the Telegraf and Random Datagen datasets we were able to measure query time savings with our prototype. For customer scenarios we can only estimate. ### Describe the solution you'd like Rather than always loading all page metadata, instead load just file metadata, prune as much as we can, then load only the page metadata needed to execute the query. 1. Read file metadata 2. Prune row groups by range the task is targeting (file group breakdown of the file) 3. Prune row groups by testing predicate against row-group stats 4. Read page metadata only for needed row-groups and columns 5. Prune access plan using minimally loaded page metadata. Psuedo-code looks something like this: ```rust let metadata = ArrowReaderMetadata::load_async_no_page_metadata(&mut reader, …)?; let access_plan = create_initial_plan( … )?; let mut row_groups = RowGroupAccessPlanFilter::new(access_plan); row_groups.prune_by_range(rg_metadata, range); row_groups.prune_by_statistics( … ); let rg_accessed = row_groups.rg_needed(); let cols_accessed = predicate.columns_needed(); metadata.load_async_reduced_page_metadata(&mut reader, rg_accessed, cols_accessed, …)?; access_plan = p.prune_plan_with_page_index( … ); ``` In our prototype we created a sparse page-metadata array. Row-group/column indexes that we don't need were left as `Index::None`. Psuedo-code: ```rust let index = metadata.row_groups().iter() .map(|x| { if self.rg_accessed.as_ref().unwrap()[x.ordinal().unwrap() as usize] { x.columns().iter().enumerate() .map(|(index, c)| { if self.col_accessed.as_ref().unwrap()[index] { match c.column_index_range() { Some(r) => decode_column_index( … ) None => Ok(Index::NONE), } } else { Ok(Index::NONE) } }) .collect::<Result<Vec<_>>>() } else { x.columns().iter() .map(|_| Ok(Index::NONE) ) .collect::<Result<Vec<_>>>() } }) .collect::<Result<Vec<_>>>()?; ``` ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org