Hi Raghavendra,
Yes, we are trying to reduce the number of files in delta as well (the
small file problem [0][1]).
We already have a scheduled app to compact files, but the number of
files is still large, at 14K files per day.
[0]: https://docs.delta.io/latest/optimizations-oss.html#language-pyt
Hi all on user@spark:
We are looking for advice and suggestions on how to tune the
.repartition() parameter.
We are using Spark Streaming on our data pipeline to consume messages
and persist them to a Delta Lake
(https://delta.io/learn/getting-started/).
We read messages from a Kafka topic, then
Hi,
What is the purpose for which you want to use repartition() .. to reduce
the number of files in delta?
Also note that there is an alternative option of using coalesce() instead
of repartition().
--
Raghavendra
On Thu, Oct 5, 2023 at 10:15 AM Shao Yang Hong
wrote:
> Hi all on user@spark:
>
>
Hi all on user@spark:
We are looking for advice and suggestions on how to tune the
.repartition() parameter.
We are using Spark Streaming on our data pipeline to consume messages
and persist them to a Delta Lake
(https://delta.io/learn/getting-started/).
We read messages from a Kafka topic, then
Hello,
Due to the way Spark implements shuffle, a loss of an executor sometimes
results in the recomputation of partitions that were lost
The definition of a *partition* is the tuple ( RDD-ids, partition id )
RDD-ids is a sequence of RDD ids
In our system, we define the unit of work performed fo