How to limit the total number of objects in a DStream maintained by updateStateByKey?

frank_zhang Tue, 02 Jun 2015 16:18:52 -0700

Hi,

I have a spark streaming app that uses updateStateByKey to maintain a cache
JavaPairDStream<String, CacheData>, in which the "String" is the key, and
CacheData is a type contains 3 fields: timestamp, index and value.


I want to restrict to total number of objects in the cache and prune the
cache based on LRU, i.e., only keep the most recent 50K objects.

I've been struggling to come up with a viable solution so far, one approach
I thought might be working is:

    1. transform the cache DStream by sorting on CacheData.timestamp
    2. transform the sorted cache DStream by setting each CacheData.index to
a continuous sequence of number starting from 0
    3. transform the updated cache DStream by filtering out CacheData whose
index is greater than 50k.

Step 1 and 3 are straightforward, however with step 2 how could I map
CacheData in a DStream and set a "global" sequence number to the index field
of each?

If this is not a viable solution, in general how would you limit the total
size of objects in a DStream maintained by updateStateByKey?

Thanks in advance.
Frank



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-limit-the-total-number-of-objects-in-a-DStream-maintained-by-updateStateByKey-tp23118.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to limit the total number of objects in a DStream maintained by updateStateByKey?

Reply via email to