This should not be happening :| Can you please reproduce this in a smaller program and a dataset that I can use to debug. Cant imaging how this could be happening.
TD On Wed, Aug 6, 2014 at 7:51 AM, Yana Kadiyska <[email protected]> wrote: > Hi folks, hoping someone who works with Streaming can help me out. > > I have the following snippet: > > val stateDstream = > data.map(x => (x, 1)) > .updateStateByKey[State](updateFunc) > > stateDstream.saveAsTextFiles(checkpointDirectory, "partitions_test") > > where data is a RDD of > > case class StateKey(host:String,hour:String,customer:String) > > when I dump out the stream, I see duplicate values in the same partition > (I've bolded the keys that are identical): > > (StateKey(foo.com.br,2014-07-22-18,16),State(43,2014-08-06T14:05:29.831Z)) > (*StateKey*(www.abcd.com > ,2014-07-22-22,25),State(2564,2014-08-06T14:05:29.831Z)) > (StateKey(bar.com,2014-07-04-20,29),State(77,2014-08-06T14:05:29.831Z)) > (*StateKey*(www.abcd.com > ,2014-07-22-22,25),State(1117,2014-08-06T14:05:29.831Z)) > > > I was under the impression that on each batch, the stream will contain a > single RDD with Key-Value pairs, reflecting the latest state of each key. > Am I misunderstanding this? Or is the key equality somehow failing? > > Any tips on this appreciated... > > PS. For completeness State is > case class State(val count:Integer,val update_date:DateTime) > > > >
