clflushopt commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2029924176
########## content/blog/2025-04-10-fastest-tpch-generator.md: ########## @@ -0,0 +1,614 @@ +--- +layout: post +title: tpchgen-rs Worldβs fastest open source TPC-H data generator, written in Rust +date: 2025-04-10 +author: Andrew Lamb, Achraf B, and Sean Smith +categories: [performance] +--- + +<!-- +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at +http://www.apache.org/licenses/LICENSE-2.0 +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +--> + +<style> +/* Table borders */ +table, th, td { + border: 1px solid black; + border-collapse: collapse; +} +th, td { + padding: 3px; +} +</style> + +3 members of the [Apache DataFusion] community used Rust and open source +development to build [tpchgen-rs], a fully open TPC-H data generator over 10x +faster than any other implementation we know of. + +It is now possible to create the TPC-H SF=100 dataset in 72.23 seconds (1.4 GB/s +π) on a Macbook Air M3 with 16GB of memory, compared to the classic `dbgen` +which takes 30 minutes<sup>1</sup> (0.05GB/sec). On the same machine, it takes less than +2 minutes to create all 3.6 GB of SF=100 in [Apache Parquet] format. Review Comment: A less detailed more overview of the timing by running it as a script ``` jmp@comet ~/G/P/tpchgen-rs (main) [1]> time duckdb -init bench.sql -no-stdin -- Loading resources from bench.sql 100% ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ βββββββββββ β Success β β boolean β βββββββββββ€ β 0 rows β βββββββββββ Run Time (s): real 717.838 user 1787.775069 sys 83.976228 100% ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Run Time (s): real 5.471 user 11.592328 sys 12.049458 100% ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Run Time (s): real 1015.519 user 456.350229 sys 104.793712 Run Time (s): real 0.010 user 0.002420 sys 0.002541 100% ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Run Time (s): real 921.422 user 75.619220 sys 21.138167 100% ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Run Time (s): real 2.647 user 12.597744 sys 1.699293 100% ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ Run Time (s): real 17.637 user 38.532114 sys 52.235873 Run Time (s): real 0.109 user 0.000459 sys 0.000680 Run Time (s): real 0.300 user 0.610369 sys 0.809258 ________________________________________________________ Executed in 44.71 mins fish external usr time 39.72 mins 0.16 millis 39.72 mins sys time 4.63 mins 1.44 millis 4.63 mins ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org