If you are talking about total number of objects the state can hold, that depends on the executor memory you have on your cluster apart from rest of the memory required for processing. The state is stored in hdfs and retrieved while processing the next events. If you maintain million objects with each 20 bytes , it would be 20MB, which is pretty reasonable to maintain in a executor allocated with few GB memory. But if you need heavy objects to be stored you need to do the math. And also it will have a cost in transferring this data back and forth to hdfs checkpoint location.
Regards Srini On Tue, May 12, 2020 at 2:48 AM tleilaxu <tleil...@gmail.com> wrote: > Hi, > I am tracking states in my Spark streaming application with > MapGroupsWithStateFunction described here: > https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/streaming/GroupState.html > Which are the limiting factors on the number of states a job can track at > the same time? Is it memory? Could be a bounded data structure in the > internal implementation? Anything else ... > You might have valuable input on this while I am trying to setup and test > this. > > Thanks, > Arnold >