Hi, I have a spark streaming app that uses updateStateByKey to maintain a cache JavaPairDStream<String, CacheData>, in which the "String" is the key, and CacheData is a type contains 3 fields: timestamp, index and value.
I want to restrict to total number of objects in the cache and prune the cache based on LRU, i.e., only keep the most recent 50K objects. I've been struggling to come up with a viable solution so far, one approach I thought might be working is: 1. transform the cache DStream by sorting on CacheData.timestamp 2. transform the sorted cache DStream by setting each CacheData.index to a continuous sequence of number starting from 0 3. transform the updated cache DStream by filtering out CacheData whose index is greater than 50k. Step 1 and 3 are straightforward, however with step 2 how could I map CacheData in a DStream and set a "global" sequence number to the index field of each? If this is not a viable solution, in general how would you limit the total size of objects in a DStream maintained by updateStateByKey? Thanks in advance. Frank -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-limit-the-total-number-of-objects-in-a-DStream-maintained-by-updateStateByKey-tp23118.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org