Smotrov opened a new issue, #10897:
URL: https://github.com/apache/datafusion/issues/10897
I'm using Rust, meanwhile I'm new to DataFusion.
I need to repartition big dataset which is hundreds of GB. It is stored on
S3 as multiple compressed packet files.
It should be partitioned by the value of a column. Here is what I'm doing
```RUST
// Define the partitioned Listing Table
let listing_options = ListingOptions::new(file_format)
.with_table_partition_cols(part)
.with_target_partitions(1)
.with_file_extension(".ndjson.zst");
ctx.register_listing_table(
"data",
format!("s3://{BUCKET_NAME}/data_lake/data_warehouse"),
listing_options,
Some(schema),
None,
)
.await?;
let df = ctx
.sql(
r#"
SELECT
SUBSTRING("OriginalRequest", 9, 3) as dep, *
FROM data
WHERE
/*partitions predicates here*/
"#,
)
.await?;
let s3 = AmazonS3Builder::new()
.with_bucket_name(save_bucket_name)
.with_region(REGION)
.build()?;
// Register the S3 store in DataFusion context
let path = format!("s3://{save_bucket_name}");
let s3_url = Url::parse(&path).unwrap();
let arc_s3 = Arc::new(s3);
ctx.runtime_env()
.register_object_store(&s3_url, arc_s3.clone());
// Write the data as JSON partitioned by `dep`
let output_path = "s3://my_bucket/output/json/";
//write as JSON to s3
let options = DataFrameWriteOptions::new()
.with_partition_by(vec!["dep".to_string()]);
let mut json_options = JsonOptions::default();
json_options.compression = CompressionTypeVariant::ZSTD;
df
.write_json(&output_path, options, Some(json_options))
.await?;
```
Will it swallow all memory and fail or it will be running in a kind on
streaming format?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]