FYI, SPARK-30294 is merged and will be available in Spark 3.1.0. This
reduces the number of temp files for the state store to half when you use
streaming aggregation.
1. https://issues.apache.org/jira/browse/SPARK-30294
On Thu, Oct 8, 2020 at 11:55 AM Jungtaek Lim
wrote:
> I can't spend too muc
I can't spend too much time on explaining one by one. I strongly encourage
you to do a deep-dive instead of just looking around as you want to know
about "details" - that's how open source works.
I'll go through a general explanation instead of replying inline; probably
I'd write a blog doc if the
Hi Jungtaek,
*> I meant the subdirectory inside the directory you're providing as
"checkpointLocation", as there're several directories in that directory...*
There are two:
*my-spark-checkpoint-dir/MainApp*
created by sparkSession.sparkContext().setCheckpointDir()
contains only empty subdir wit
Replied inline.
On Tue, Oct 6, 2020 at 6:07 AM Sergey Oboguev wrote:
> Hi Jungtaek,
>
> Thanks for your response.
>
> *> you'd want to dive inside the checkpoint directory and have separate
> numbers per top-subdirectory*
>
> All the checkpoint store numbers are solely for the subdirectory set b
Hi Jungtaek,
Thanks for your response.
*> you'd want to dive inside the checkpoint directory and have separate
numbers per top-subdirectory*
All the checkpoint store numbers are solely for the subdirectory set by
option("checkpointLocation", .. checkpoint dir for writer ... )
Other subdirector
First of all, you'd want to divide these numbers by the number of
micro-batches, as file creations in checkpoint directory would occur
similarly per micro-batch.
Second, you'd want to dive inside the checkpoint directory and have
separate numbers per top-subdirectory.
After that we can see whether
I am trying to run a Spark structured streaming program simulating basic
scenario of ingesting events and calculating aggregates on a window with
watermark, and I am observing an inordinate amount of disk IO Spark
performs.
The basic structure of the program is like this:
sparkSession = SparkSes