That's definitely a good supplement to the current Spark Streaming, I've heard many guys want to process the data using log time. Looking forward to the code.
Thanks Jerry -----Original Message----- From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: Thursday, January 29, 2015 10:33 AM To: Tobias Pfeiffer Cc: YaoPau; user Subject: Re: reduceByKeyAndWindow, but using log timestamps instead of clock seconds Ohhh nice! Would be great if you can share us some code soon. It is indeed a very complicated problem and there is probably no single solution that fits all usecases. So having one way of doing things would be a great reference. Looking forward to that! On Wed, Jan 28, 2015 at 4:52 PM, Tobias Pfeiffer <t...@preferred.jp> wrote: > Hi, > > On Thu, Jan 29, 2015 at 1:54 AM, YaoPau <jonrgr...@gmail.com> wrote: >> >> My thinking is to maintain state in an RDD and update it an persist >> it with each 2-second pass, but this also seems like it could get >> messy. Any thoughts or examples that might help me? > > > I have just implemented some timestamp-based windowing on DStreams > (can't share the code now, but will be published a couple of months > ahead), although with the assumption that items are in correct order. > The main challenge (rather technical) was to keep proper state across > RDD boundaries and to tell the state "you can mark this partial window > from the last interval as 'complete' now" without shuffling too much > data around. For example, if there are some empty intervals, you don't > know when the next item to go into the partial window will arrive, or > if there will be one at all. I guess if you want to have out-of-order > tolerance, that will become even trickier, as you need to define and > think about some timeout for partial windows in your state... > > Tobias > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org