GitHub user gwik closed a discussion: High latency loading parquet file on GCS.

Hi,

First, let me thank you for this amazing project.

I was experimenting with reading parquet from GCS and the performance looks 
very poor compared to downloading the file and loading from it disk.

In the example below loading my 10MB parquet file takes around 10s, although 
downloading the file takes less than a second. Then it is much faster to 
download then load the file from disk (1s + ~70ms).

I added tracing to the object store and I see more that 506 `get_opts` calls.

I was wondering why there was so many range requests, and what determine the 
chunk size ?
Is it controlled by some parameters? or is it the consequence of the how the 
file was written? (Record batches).

Thanks for your help.

```rust
    let store = Arc::new({
        GoogleCloudStorageBuilder::new()
            .with_bucket_name(BUCKET_NAME)
            .build()?
    });
    let url = format!("gs://{BUCKET_NAME}/").parse()?;
    let registry = DefaultObjectStoreRegistry::new();
    registry.register_store(&url, store);
    let runtime_config = 
RuntimeConfig::default().with_object_store_registry(Arc::new(registry));
    let runtime_env = RuntimeEnv::new(runtime_config)?;

    let ctx = SessionContext::new_with_config_rt(SessionConfig::default(), 
Arc::new(runtime_env));
    let df = timeit!(
        "read parquet",
        ctx.read_parquet(
            args.input,
            ParquetReadOptions {
                // file_sort_order: vec![vec![col("time").sort(true, true)]],
                parquet_pruning: true.into(),
                ..Default::default()
            },
        )
        .await?
    );

    println!("{schema}", schema = df.schema());

    let df = timeit!(
        "projection",
        df.select(vec![
            cast(
                col("time"),
                DataType::Timestamp(TimeUnit::Millisecond, None),
            )
            .alias("time"),
            col("asset_id"),
            col("asset_type"),
            col("value"),
        ])?
        .cache()
        .await?
    );

```

GitHub link: https://github.com/apache/datafusion/discussions/8058

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to