If you are talking about total number of objects the state can hold, that
depends on the executor memory you have on your cluster apart from rest of
the memory required for processing. The state is stored in hdfs and
retrieved while processing the next events.
If you maintain million objects with each 20 bytes , it would be 20MB,
which is pretty reasonable to maintain in a executor allocated with few GB
memory. But if you need heavy objects to be stored you need to do the math.
And also it will have a cost in transferring this data back and forth to
hdfs checkpoint location.

Regards
Srini

On Tue, May 12, 2020 at 2:48 AM tleilaxu <tleil...@gmail.com> wrote:

> Hi,
> I am tracking states in my Spark streaming application with
> MapGroupsWithStateFunction described here:
> https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/streaming/GroupState.html
> Which are the limiting factors on the number of states a job can track at
> the same time? Is it memory? Could be a bounded data structure in the
> internal implementation? Anything else ...
> You might have valuable input on this while I am trying to setup and test
> this.
>
> Thanks,
> Arnold
>

Reply via email to