Re: Structured Streaming Initial Listing Issue

刘唯 Mon, 12 May 2025 11:27:39 -0700

That 1073.3 MiB isn't too much bigger than spark.driver.maxResultSize,
can't you just increase that config with a larger number?


/ Wei

Anastasiia Sokhova <anastasiia.sokh...@honic.eu.invalid> 于2025年4月16日周三
03:37写道：

> Dear Spark Community,
>
>
>
> I run a Structured Streaming Query to read json files from S3 into an Ic
> eberg table.  This is my query:
>
>
>
> ```python
>
> stream_reader = (
>
>     spark_session.readStream.format("json")
>
>     .schema(schema)
>
>     .option("maxFilesPerTrigger", 256_000)
>
>     .option("basePath", f"s3a://test-bucket/root_dir/")
>
>     .load(f"s3a://test-bucket/root_dir/2025/04/")
>
>     .coalesce(8)
>
>     .withColumn("object_key", input_file_name())
>
> )
>
> stream = (
>
>     stream_reader.writeStream.queryName(f"test_stream")
>
>     .format("iceberg")
>
>     .outputMode("append")
>
>     .option("checkpointLocation", f"s3a://ttest-bucket/checkpoin
> ts/{uuid.uuid4()}/")
>
>     .trigger(processingTime="10 seconds")
>
>     .toTable(target_table_full_name)
>
> )
>
> ```
>
>
>
> My data on S3 has this structure:
>
> ```
>
> root_dir/
>
> └── 2025/
>
>     └── 04/
>
>         ├── 15/
>
>         │   ├── 123e4567-e89b-12d3-a456-426614174003.json
>
>         │   └── 123e4567-e89b-12d3-a456-426614174004.json
>
>         ├── 16/
>
>         │   ├── 123e4567-e89b-12d3-a456-426614174000.json
>
>         │   ├── 123e4567-e89b-12d3-a456-426614174001.json
>
>         │   └── 123e4567-e89b-12d3-a456-426614174002.json
>
>         └── 17/
>
>             ├── 123e4567-e89b-12d3-a456-426614174005.json
>
>             └── 123e4567-e89b-12d3-a456-426614174006.json
>
> ```
>
>
>
> These are millions of 1.5KB files.
>
>
>
> I encounter issues with the initial listing: when I start the stream I see
> this log:
>
> ```
>
> Total size of serialized results of 51 tasks (1073.3 MiB) is bigger than
> spark.driver.maxResultSize (1024.0 MiB)
>
> ```
>
>
>
> 51 is the number of sub directories I have in my test setup. It seems that
> Spark recognises sub directories as partitions, and does the listing per
> partition, but in the end still aggregates everything. This error happens
> for total 5.5 mln files.
>
> Setting maxFilesPerTrigger does not help to limit this initial listing ei
> ther.
>
>
>
> Please, give me a hint in how to handle this initial listing for pot
> entially billions of files.
>
>
>
> My setup is a standalone Spark 3.5.1 cluster with Spark Connect.
>
>
>
> Best regards,
>
> Anastasiia
>

Re: Structured Streaming Initial Listing Issue

Reply via email to