This is an automated email from the ASF dual-hosted git repository.

jiayu pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/sedona-spatialbench.git

commit 256fd197952b3007dce937645dc6b86cbf6fb19a
Author: Pranav Toggi <[email protected]>
AuthorDate: Mon Jul 7 15:42:26 2025 -0700

    update readme for tpchgen-rs
---
 tpchgen-rs-readme.md | 145 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 145 insertions(+)

diff --git a/tpchgen-rs-readme.md b/tpchgen-rs-readme.md
new file mode 100644
index 0000000..9bc0807
--- /dev/null
+++ b/tpchgen-rs-readme.md
@@ -0,0 +1,145 @@
+# tpchgen-rs
+
+[![Apache licensed][license-badge]][license-url]
+[![Build Status][actions-badge]][actions-url]
+
+[license-badge]: https://img.shields.io/badge/license-Apache%20v2-blue.svg
+[license-url]: https://github.com/clflushopt/tpchgen-rs/blob/main/LICENSE
+[actions-badge]: 
https://github.com/clflushopt/tpchgen-rs/actions/workflows/rust.yml/badge.svg
+[actions-url]: 
https://github.com/clflushopt/tpchgen-rs/actions?query=branch%3Amain
+
+Blazing fast [TPCH] benchmark data generator, in pure Rust with zero 
dependencies.
+
+[TPCH]: https://www.tpc.org/tpch/
+
+## Features
+
+1. Blazing Speed 🚀
+2. Obsessively Tested 📋
+3. Fully parallel, streaming, constant memory usage 🧠
+
+## Try it now!
+
+### Install Using Python
+Install this tool with Python:
+```shell
+pip install tpchgen-cli
+```
+
+```shell
+# create Scale Factor 10 (3.6GB, 8 files, 60M rows in lineitem) in 5 seconds 
on a modern laptop
+tpchgen-cli -s 10 --format=parquet
+```
+
+### Install Using Rust
+[Install Rust](https://www.rust-lang.org/tools/install) and this tool:
+
+```shell
+curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
+cargo install tpchgen-cli
+```
+
+```shell
+# create Scale Factor 10 (3.6GB, 8 files, 60M rows in lineitem) in 5 seconds 
on a modern laptop
+tpchgen-cli -s 10 --format=parquet
+```
+
+Or watch this [awesome demo](https://www.youtube.com/watch?v=UYIC57hlL14) 
recorded by [@alamb](https://github.com/alamb)
+and the companion blog post in the [Datafusion 
blog](https://datafusion.apache.org/blog/2025/04/10/fastest-tpch-generator/).
+
+### Examples
+
+```shell
+
+# Create a scale factor 10 dataset in the native table format.
+tpchgen-cli -s 10 --output-dir sf10
+
+# Create a scale factor 1 dataset in Parquet format.
+tpchgen-cli -s 1 --output-dir sf1-parquet --format=parquet
+
+# Create a scale factor 1 (default) partitioned dataset for the region, 
nation, orders
+# and customer tables.
+tpchgen-cli --tables region,nation,orders,customer --output-dir 
sf1-partitioned --parts 10 --part 2
+
+# Create a scale factor 1 partitioned into separate folders.
+#
+# Each folder will have a single partition of rows, the partition size will 
depend on the scale
+# factor. For tables that have less rows than the minimum partition size like 
"nation" or "region"
+# the generator will produce the same file in each part.
+#
+# $ md5sum part-*/{nation,region}.tbl
+# 2f588e0b7fa72939b498c2abecd9fbbe  part-1/nation.tbl
+# 2f588e0b7fa72939b498c2abecd9fbbe  part-2/nation.tbl
+# c235841b00d29ad4f817771fcc851207  part-1/region.tbl
+# c235841b00d29ad4f817771fcc851207  part-2/region.tbl
+for PART in `seq 1 2`; do
+  mkdir part-$PART
+  tpchgen-cli --tables region,nation,orders,customer --output-dir part-$PART 
--parts 10 --part $PART
+done
+```
+
+## Performance
+
+| Scale Factor | `tpchgen-cli` | DuckDB     | DuckDB (proprietary) |
+| ------------ | ------------- | ---------- | -------------------- |
+| 1            | `0:02.24`     | `0:12.29`  | `0:10.68`            |
+| 10           | `0:09.97`     | `1:46.80`  | `1:41.14`            |
+| 100          | `1:14.22`     | `17:48.27` | `16:40.88`           |
+| 1000         | `10:26.26`    | N/A (OOM)  | N/A (OOM)            |
+
+- DuckDB (proprietary) is the time required to create TPCH data using the
+  proprietary DuckDB format
+- Creating Scale Factor 1000 data in DuckDB [requires 647 GB of memory],
+  which is why it is not included in the table above.
+
+[required 647 GB of memory]: 
https://duckdb.org/docs/stable/extensions/tpch.html#resource-usage-of-the-data-generator
+
+Times to create TPCH tables in Parquet format using `tpchgen-cli` and `duckdb` 
for various scale factors.
+
+![Parquet Generation Performance](parquet-performance.png)
+
+[`tpchgen-cli`](./tpchgen-cli/README.md) is more than 10x faster than the next
+fastest TPCH generator we know of. On a 2023 Mac M3 Max laptop, it easily
+generates data faster than can be written to SSD. See
+[BENCHMARKS.md](./benchmarks/BENCHMARKS.md) for more details on performance and
+benchmarking.
+
+## Testing
+
+This crate has extensive tests to ensure correctness and produces exactly the
+same, byte-for-byte output as the original [`dbgen`] implementation. We compare
+the output of this crate with [`dbgen`] as part of every checkin. See
+[TESTING.md](TESTING.md) for more details on testing methodology
+
+## Crates
+
+- [`tpchgen`](tpchgen): the core data generator logic for TPC-H. It has no
+  dependencies and is easy to embed in other Rust project. 
+
+- [`tpchgen-arrow`](tpchgen-arrow) generates TPC-H data in [Apache Arrow]
+  format. It depends on the arrow-rs library
+
+- [`tpchgen-cli`](tpchgen-cli) is a [`dbgen`] compatible CLI tool that 
generates
+  benchmark dataset using multiple processes.
+
+[Apache Arrow]: https://arrow.apache.org/
+[`dbgen`]: https://github.com/electrum/tpch-dbgen
+
+## Contributing
+
+Pull requests are welcome. For major changes, please open an issue first for
+discussion. See our [contributors guide](CONTRIBUTING.md) for more details.
+
+## Architecture
+
+Please see [architecture guide](ARCHITECTURE.md) for details on how the code
+is structured.
+
+## License
+
+The project is licensed under the [APACHE 2.0](LICENSE) license.
+
+## References
+
+- The TPC-H Specification, see the specification 
[page](https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp).
+- The Original `dbgen` Implementation you must submit an official request to 
access the software `dbgen` at their official 
[website](https://www.tpc.org/tpch/)

Reply via email to