Yes.. but from what I understand that's a "sliding window" so for a window of (60) over (1) second DStreams, that would save the entire last minute of data once per second. That's more than I need.
I think what I'm after is probably updateStateByKey... I want to mutate data structures (probably even graphs) as the stream comes in, but I also want that state to be persistent across restarts of the application, (Or parallel version of the app, if possible) So I'd have to save that structure occasionally and reload it as the "primer" on the next run. I was almost going to use HBase or Hive, but they seem to have been deprecated in 1.0.0? Or just late to the party? Also, I've been having trouble deleting hadoop directories.. the old "two line" examples don't seem to work anymore. I actually managed to fill up the worker instances (I gave them tiny EBS) and I think I crashed them. On Sat, Jun 7, 2014 at 10:23 PM, Gino Bustelo <lbust...@gmail.com> wrote: > Have you thought of using window? > > Gino B. > > > On Jun 6, 2014, at 11:49 PM, Jeremy Lee <unorthodox.engine...@gmail.com> > wrote: > > > > > > It's going well enough that this is a "how should I in 1.0.0" rather > than "how do i" question. > > > > So I've got data coming in via Streaming (twitters) and I want to > archive/log it all. It seems a bit wasteful to generate a new HDFS file for > each DStream, but also I want to guard against data loss from crashes, > > > > I suppose what I want is to let things build up into "superbatches" over > a few minutes, and then serialize those to parquet files, or similar? Or do > i? > > > > Do I count-down the number of DStreams, or does Spark have a preferred > way of scheduling cron events? > > > > What's the best practise for keeping persistent data for a streaming > app? (Across restarts) And to clean up on termination? > > > > > > -- > > Jeremy Lee BCompSci(Hons) > > The Unorthodox Engineers > -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers