Yeah... Have not tried it, but if you set the slidingDuration == windowDuration that should prevent overlaps.
Gino B. > On Jun 8, 2014, at 8:25 AM, Jeremy Lee <unorthodox.engine...@gmail.com> wrote: > > I read it more carefully, and window() might actually work for some other > stuff like logs. (assuming I can have multiple windows with entirely > different attributes on a single stream..) > > Thanks for that! > > >> On Sun, Jun 8, 2014 at 11:11 PM, Jeremy Lee <unorthodox.engine...@gmail.com> >> wrote: >> Yes.. but from what I understand that's a "sliding window" so for a window >> of (60) over (1) second DStreams, that would save the entire last minute of >> data once per second. That's more than I need. >> >> I think what I'm after is probably updateStateByKey... I want to mutate data >> structures (probably even graphs) as the stream comes in, but I also want >> that state to be persistent across restarts of the application, (Or parallel >> version of the app, if possible) So I'd have to save that structure >> occasionally and reload it as the "primer" on the next run. >> >> I was almost going to use HBase or Hive, but they seem to have been >> deprecated in 1.0.0? Or just late to the party? >> >> Also, I've been having trouble deleting hadoop directories.. the old "two >> line" examples don't seem to work anymore. I actually managed to fill up the >> worker instances (I gave them tiny EBS) and I think I crashed them. >> >> >> >>> On Sat, Jun 7, 2014 at 10:23 PM, Gino Bustelo <lbust...@gmail.com> wrote: >>> Have you thought of using window? >>> >>> Gino B. >>> >>> > On Jun 6, 2014, at 11:49 PM, Jeremy Lee <unorthodox.engine...@gmail.com> >>> > wrote: >>> > >>> > >>> > It's going well enough that this is a "how should I in 1.0.0" rather than >>> > "how do i" question. >>> > >>> > So I've got data coming in via Streaming (twitters) and I want to >>> > archive/log it all. It seems a bit wasteful to generate a new HDFS file >>> > for each DStream, but also I want to guard against data loss from crashes, >>> > >>> > I suppose what I want is to let things build up into "superbatches" over >>> > a few minutes, and then serialize those to parquet files, or similar? Or >>> > do i? >>> > >>> > Do I count-down the number of DStreams, or does Spark have a preferred >>> > way of scheduling cron events? >>> > >>> > What's the best practise for keeping persistent data for a streaming app? >>> > (Across restarts) And to clean up on termination? >>> > >>> > >>> > -- >>> > Jeremy Lee BCompSci(Hons) >>> > The Unorthodox Engineers >> >> >> >> -- >> Jeremy Lee BCompSci(Hons) >> The Unorthodox Engineers > > > > -- > Jeremy Lee BCompSci(Hons) > The Unorthodox Engineers