I read it more carefully, and window() might actually work for some other stuff like logs. (assuming I can have multiple windows with entirely different attributes on a single stream..)
Thanks for that! On Sun, Jun 8, 2014 at 11:11 PM, Jeremy Lee <unorthodox.engine...@gmail.com> wrote: > Yes.. but from what I understand that's a "sliding window" so for a window > of (60) over (1) second DStreams, that would save the entire last minute of > data once per second. That's more than I need. > > I think what I'm after is probably updateStateByKey... I want to mutate > data structures (probably even graphs) as the stream comes in, but I also > want that state to be persistent across restarts of the application, (Or > parallel version of the app, if possible) So I'd have to save that > structure occasionally and reload it as the "primer" on the next run. > > I was almost going to use HBase or Hive, but they seem to have been > deprecated in 1.0.0? Or just late to the party? > > Also, I've been having trouble deleting hadoop directories.. the old "two > line" examples don't seem to work anymore. I actually managed to fill up > the worker instances (I gave them tiny EBS) and I think I crashed them. > > > > On Sat, Jun 7, 2014 at 10:23 PM, Gino Bustelo <lbust...@gmail.com> wrote: > >> Have you thought of using window? >> >> Gino B. >> >> > On Jun 6, 2014, at 11:49 PM, Jeremy Lee <unorthodox.engine...@gmail.com> >> wrote: >> > >> > >> > It's going well enough that this is a "how should I in 1.0.0" rather >> than "how do i" question. >> > >> > So I've got data coming in via Streaming (twitters) and I want to >> archive/log it all. It seems a bit wasteful to generate a new HDFS file for >> each DStream, but also I want to guard against data loss from crashes, >> > >> > I suppose what I want is to let things build up into "superbatches" >> over a few minutes, and then serialize those to parquet files, or similar? >> Or do i? >> > >> > Do I count-down the number of DStreams, or does Spark have a preferred >> way of scheduling cron events? >> > >> > What's the best practise for keeping persistent data for a streaming >> app? (Across restarts) And to clean up on termination? >> > >> > >> > -- >> > Jeremy Lee BCompSci(Hons) >> > The Unorthodox Engineers >> > > > > -- > Jeremy Lee BCompSci(Hons) > The Unorthodox Engineers > -- Jeremy Lee BCompSci(Hons) The Unorthodox Engineers