adrian-thurston opened a new issue, #16200:
URL: https://github.com/apache/datafusion/issues/16200

   ### Is your feature request related to a problem or challenge?
   
   The ParquetOpen will load all page metadata for a file, on an all tasks 
concurrently accessing that file. This can be costly for parquet files with a 
large number of rows, or a large number of columns, or both.
   
   In testing at Influx we have noticed page metadata load time taking in the 
order of tens of milliseconds for some customer scenarios. We have directly 
timed this on customer parquet files. We estimate the contribution to query 
time being about 83% of those times. 
   
   Some individual page metadata load times:
   
   | Write Load | File Size | Row Groups | Columns | Rows | Row Group 
Compression | Rows of Page Metadata | Page Metdata Load Time | Estimated Query 
Savings |
   | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
   | Telegraf | 110 MB | 11 | 65 | 10,523,008 | 10.3 / 17.4 MB | 67,862 | 
9ms | 6ms / 36ms |
   | Random Datagen | 283 MB | 5 | 11 | 4,620,000 | 61.1 / 66.7 MB | 
5,016 | 0.7ms | nil |
   | Cust A | 144 MB | 50 | 26 | 51,521,481 | 2.9 / 4.5 MB | 132,864 | 
16.9ms | 14.1ms / ? |
   | Cust B | 104 MB | 70 | 19 | 73,158,554 | 1.2 / 2.7 MB | 137,864 | 
23.3ms | 19.4ms / ? |
   | Cust C | 122 MB | 11 | 199 | 10,530,204 | 10.8 / 40.3 MB | 208,156 | 
25.4ms | 21.1ms / ? |
   
   Note: for the Telegraf and Random Datagen datasets we were able to measure 
query time savings with our prototype. For customer scenarios we can only 
estimate.
   
   ### Describe the solution you'd like
   
   Rather than always loading all page metadata, instead load just file 
metadata, prune as much as we can, then load only the page metadata needed to 
execute the query.
   
   1. Read file metadata
   2. Prune row groups by range the task is targeting (file group breakdown of 
the file)
   3. Prune row groups by testing predicate against row-group stats
   4. Read page metadata only for needed row-groups and columns
   5. Prune access plan using minimally loaded page metadata.
   
   Psuedo-code looks something like this:
   
   ```rust
   let metadata = ArrowReaderMetadata::load_async_no_page_metadata(&mut reader, 
…)?;
   let access_plan = create_initial_plan( … )?;
   let mut row_groups = RowGroupAccessPlanFilter::new(access_plan);
   row_groups.prune_by_range(rg_metadata, range);
   row_groups.prune_by_statistics( … );
   let rg_accessed = row_groups.rg_needed();
   let cols_accessed = predicate.columns_needed();
   metadata.load_async_reduced_page_metadata(&mut reader, rg_accessed, 
cols_accessed, …)?;
   access_plan = p.prune_plan_with_page_index( … );
   ```
   
   In our prototype we created a sparse page-metadata array. Row-group/column 
indexes that we don't need were left as `Index::None`. Psuedo-code:
   
   ```rust
   let index = metadata.row_groups().iter()
           .map(|x| {
               if self.rg_accessed.as_ref().unwrap()[x.ordinal().unwrap() as 
usize] {
                   x.columns().iter().enumerate()
                       .map(|(index, c)| {
                           if self.col_accessed.as_ref().unwrap()[index] {
                               match c.column_index_range() {
                                   Some(r) => decode_column_index( … )
                                   None => Ok(Index::NONE),
                               }
                           } else {
                               Ok(Index::NONE)
                           }
                       })
                       .collect::<Result<Vec<_>>>()
               } else {
                   x.columns().iter()
                     .map(|_| Ok(Index::NONE) )
                     .collect::<Result<Vec<_>>>()
               }
           })
           .collect::<Result<Vec<_>>>()?;
   ```
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to