Re: Excessive disk IO with Spark structured streaming

2020-11-05 Thread Jungtaek Lim
FYI, SPARK-30294 is merged and will be available in Spark 3.1.0. This reduces the number of temp files for the state store to half when you use streaming aggregation. 1. https://issues.apache.org/jira/browse/SPARK-30294 On Thu, Oct 8, 2020 at 11:55 AM Jungtaek Lim wrote: > I can't spend too muc

Re: Excessive disk IO with Spark structured streaming

2020-10-07 Thread Jungtaek Lim
I can't spend too much time on explaining one by one. I strongly encourage you to do a deep-dive instead of just looking around as you want to know about "details" - that's how open source works. I'll go through a general explanation instead of replying inline; probably I'd write a blog doc if the

Re: Excessive disk IO with Spark structured streaming

2020-10-07 Thread Sergey Oboguev
Hi Jungtaek, *> I meant the subdirectory inside the directory you're providing as "checkpointLocation", as there're several directories in that directory...* There are two: *my-spark-checkpoint-dir/MainApp* created by sparkSession.sparkContext().setCheckpointDir() contains only empty subdir wit

Re: Excessive disk IO with Spark structured streaming

2020-10-05 Thread Jungtaek Lim
Replied inline. On Tue, Oct 6, 2020 at 6:07 AM Sergey Oboguev wrote: > Hi Jungtaek, > > Thanks for your response. > > *> you'd want to dive inside the checkpoint directory and have separate > numbers per top-subdirectory* > > All the checkpoint store numbers are solely for the subdirectory set b

Re: Excessive disk IO with Spark structured streaming

2020-10-05 Thread Sergey Oboguev
Hi Jungtaek, Thanks for your response. *> you'd want to dive inside the checkpoint directory and have separate numbers per top-subdirectory* All the checkpoint store numbers are solely for the subdirectory set by option("checkpointLocation", .. checkpoint dir for writer ... ) Other subdirector

Re: Excessive disk IO with Spark structured streaming

2020-10-05 Thread Jungtaek Lim
First of all, you'd want to divide these numbers by the number of micro-batches, as file creations in checkpoint directory would occur similarly per micro-batch. Second, you'd want to dive inside the checkpoint directory and have separate numbers per top-subdirectory. After that we can see whether

Excessive disk IO with Spark structured streaming

2020-10-04 Thread Sergey Oboguev
I am trying to run a Spark structured streaming program simulating basic scenario of ingesting events and calculating aggregates on a window with watermark, and I am observing an inordinate amount of disk IO Spark performs. The basic structure of the program is like this: sparkSession = SparkSes