[I] Blog post about parquet vs custom file formats [datafusion]

via GitHub Thu, 22 May 2025 06:29:28 -0700


alamb opened a new issue, #16149:
URL: https://github.com/apache/datafusion/issues/16149


   ### Is your feature request related to a problem or challenge?
   
   https://x.com/andrewlamb1111/status/1925537738360504663
   
   
   > ClickBench keeps me convinced that  Parquet can be quite fast. There is 
only a 2.3x performance difference vs  [@duckdb](https://x.com/duckdb) 's own 
format and unoptimized parquet:  
[https://tinyurl.com/5aexvsfw](https://t.co/NOXq3AAFlk). I am surprised that 
the (closed source) Umbra only reports 3.3x faster than DuckDB on parquet
   
   
   
   
   
   ### Describe the solution you'd like
   
   I would love to make a blog post about how much faster/slower custom file 
formats are compared to parquet. I am typing this ticket now that it is on my 
mind so I don't forget it.
   
   The basic thesis is that 
   * Custom file formats only get you XX% more performance than parquet
   * Many of the historic performance differences are due to engineering 
investment rather than format
   * Parquet has many other benefits (like a very large ecosystem)
   
   ==> therefore parquet is the format that really matters
   
   ### Describe alternatives you've considered
   
   The core of the post would be to compare
   1. A propretary format (like duckdb/umbra)
   2. normal parquet
   3. "optimized parquet"
   
   I think we could basically use the https://github.com/ClickHouse/ClickBench 
dataset and queries (and results from the proprietary systems)
   
   The thing that is needed is to generate "optimized parquet" numbers. 
   
   The [partitioned parquet 
files](https://github.com/ClickHouse/ClickBench?tab=readme-ov-file#data-loading)
 from ClickBench are not optimized. Specifically they:
   1. Are not sorted in any way
   2. Do not have a page index (Offset index)
   3. Use snappy compression
   
   A fun experiment might be to "fix" the clickbench partitioned dataset by 
   1. resorting and writing with page indexes (could use a bunch of DataFusion 
`COPY` commands pretty easily to do this). The sort order should be some subset 
of the predicate columns. Perhaps EventTime and then maybe SearchPhrase / URL. 
   2. disabling compression
   
   
   
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Blog post about parquet vs custom file formats [datafusion]

Reply via email to