For now, you'd like to consider using a 3rd party implementation of RocksDB state store (either open source implementations, or commercial one if you use either Databricks or Qubole) if the state doesn't fit the executor memory.
- https://github.com/chermenin/spark-states - https://github.com/qubole/spark-state-store Hopefully, Spark community had a discussion on providing RocksDB state store out of the box and the discussion went positive. Worth noting that even in a happy case Spark community would introduce it in 3.2.0 which takes months (release phase on 3.1 is still ongoing, so probably need to add 6+ months from now). So if you're encountering the problem in production level or waiting for Spark 3.2 is not an option, you still have no option but to try out 3rd party implementations. In the meanwhile I'm planning to look into "state migration" which lets users migrate their state from state store provider A to B. The hopeful plan is to support any arbitrary providers between the two. Thanks, Jungtaek Lim (HeartSaVioR) On Thu, Feb 11, 2021 at 5:01 PM Kuttaiah Robin <kutta...@gmail.com> wrote: > Hello, > > I have a use case where I need to read events(non correlated) from a kafka > topic, then correlate and push forward to another topic. > > I use spark structured streaming with FlatMapGroupsWithStateFunction along > with GroupStateTimeout.ProcessingTimeTimeout() . After each timeout, I do > some correlation on group events and push forward correlated events to > another topic. Non correlated events are stored in the state until they are > correlated in a future set of events. > > There are times where non correlated events are more and the size of non > correlated events in the state are growing too big. > > Does anyone know how to handle this use case or will spark take care of > handling state when it grows big? > > Thanks in advance. > > regards, > Robin Kuttaiah >