unsubscribe
On Sat, Feb 17, 2024 at 3:04 AM Рамик И wrote:
>
> Hi
> I'm using Spark Streaming to read from Kafka and write to S3. Sometimes I
> get errors when writing org.apache.hadoop.fs.FileAlreadyExistsException.
>
> Spark version: 3.5.0
> scala version : 2.13.8
> Cluster: k8s
>
> libraryDep
Hi Shay,
maybe this is related to the small number of output rows (1,250) of the
last exchange step that consume those 60GB shuffle data.
Looks like your outer transformation is something like
df.groupBy($"id").agg(collect_list($"prop_name"))
Have you tried adding a repartition as an attempt t