andygrove commented on issue #1583:
URL: 
https://github.com/apache/datafusion-comet/issues/1583#issuecomment-2772820500

   Here are some rough notes on creating the dataset in S3.
   
   - create s3 bucket
   - create `m7i.4xlarge` ec2 instance with 2 TB EBS volume
   
   ```
   sudo yum install -y docker git python3-pip
   
   sudo systemctl start docker
   sudo systemctl enable docker
   sudo usermod -aG docker ec2-user
   newgrp docker
   
   docker pull ghcr.io/scalytics/tpch-docker:main
   
   pip3 install datafusion
   
   git clone https://github.com/apache/datafusion-benchmarks.git
   cd datafusion-benchmarks/tpch
   nohup python3 tpchgen.py generate --scale-factor 1000 --partitions 64 &
   
   docker ps
   du -h -d 1 data
   
   sudo chown -R ec2-user:docker data
   nohup python3 convert.py convert --scale-factor 1000 --partitions 64 &
   
   cd data
   rm *.tbl.*
   aws s3 cp . s3://your-bucket-name/top-level-folder/ --recursive
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to