GitHub user gwik closed a discussion: High latency loading parquet file on GCS.
Hi,
First, let me thank you for this amazing project.
I was experimenting with reading parquet from GCS and the performance looks
very poor compared to downloading the file and loading from it disk.
In the example below loading my 10MB parquet file takes around 10s, although
downloading the file takes less than a second. Then it is much faster to
download then load the file from disk (1s + ~70ms).
I added tracing to the object store and I see more that 506 `get_opts` calls.
I was wondering why there was so many range requests, and what determine the
chunk size ?
Is it controlled by some parameters? or is it the consequence of the how the
file was written? (Record batches).
Thanks for your help.
```rust
let store = Arc::new({
GoogleCloudStorageBuilder::new()
.with_bucket_name(BUCKET_NAME)
.build()?
});
let url = format!("gs://{BUCKET_NAME}/").parse()?;
let registry = DefaultObjectStoreRegistry::new();
registry.register_store(&url, store);
let runtime_config =
RuntimeConfig::default().with_object_store_registry(Arc::new(registry));
let runtime_env = RuntimeEnv::new(runtime_config)?;
let ctx = SessionContext::new_with_config_rt(SessionConfig::default(),
Arc::new(runtime_env));
let df = timeit!(
"read parquet",
ctx.read_parquet(
args.input,
ParquetReadOptions {
// file_sort_order: vec![vec![col("time").sort(true, true)]],
parquet_pruning: true.into(),
..Default::default()
},
)
.await?
);
println!("{schema}", schema = df.schema());
let df = timeit!(
"projection",
df.select(vec![
cast(
col("time"),
DataType::Timestamp(TimeUnit::Millisecond, None),
)
.alias("time"),
col("asset_id"),
col("asset_type"),
col("value"),
])?
.cache()
.await?
);
```
GitHub link: https://github.com/apache/datafusion/discussions/8058
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]