Yes.. but from what I understand that's a "sliding window" so for a window
of (60) over (1) second DStreams, that would save the entire last minute of
data once per second. That's more than I need.

I think what I'm after is probably updateStateByKey... I want to mutate
data structures (probably even graphs) as the stream comes in, but I also
want that state to be persistent across restarts of the application, (Or
parallel version of the app, if possible) So I'd have to save that
structure occasionally and reload it as the "primer" on the next run.

I was almost going to use HBase or Hive, but they seem to have been
deprecated in 1.0.0? Or just late to the party?

Also, I've been having trouble deleting hadoop directories.. the old "two
line" examples don't seem to work anymore. I actually managed to fill up
the worker instances (I gave them tiny EBS) and I think I crashed them.



On Sat, Jun 7, 2014 at 10:23 PM, Gino Bustelo <lbust...@gmail.com> wrote:

> Have you thought of using window?
>
> Gino B.
>
> > On Jun 6, 2014, at 11:49 PM, Jeremy Lee <unorthodox.engine...@gmail.com>
> wrote:
> >
> >
> > It's going well enough that this is a "how should I in 1.0.0" rather
> than "how do i" question.
> >
> > So I've got data coming in via Streaming (twitters) and I want to
> archive/log it all. It seems a bit wasteful to generate a new HDFS file for
> each DStream, but also I want to guard against data loss from crashes,
> >
> > I suppose what I want is to let things build up into "superbatches" over
> a few minutes, and then serialize those to parquet files, or similar? Or do
> i?
> >
> > Do I count-down the number of DStreams, or does Spark have a preferred
> way of scheduling cron events?
> >
> > What's the best practise for keeping persistent data for a streaming
> app? (Across restarts) And to clean up on termination?
> >
> >
> > --
> > Jeremy Lee  BCompSci(Hons)
> >   The Unorthodox Engineers
>



-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers

Reply via email to