Re: Best practise for 'Streaming' dumps?

Gino Bustelo Sun, 08 Jun 2014 06:45:08 -0700

Yeah... Have not tried it, but if you set the slidingDuration == windowDuration 
that should prevent overlaps.


Gino B.

> On Jun 8, 2014, at 8:25 AM, Jeremy Lee <unorthodox.engine...@gmail.com> wrote:
> 
> I read it more carefully, and window() might actually work for some other 
> stuff like logs. (assuming I can have multiple windows with entirely 
> different attributes on a single stream..) 
> 
> Thanks for that!
> 
> 
>> On Sun, Jun 8, 2014 at 11:11 PM, Jeremy Lee <unorthodox.engine...@gmail.com> 
>> wrote:
>> Yes.. but from what I understand that's a "sliding window" so for a window 
>> of (60) over (1) second DStreams, that would save the entire last minute of 
>> data once per second. That's more than I need.
>> 
>> I think what I'm after is probably updateStateByKey... I want to mutate data 
>> structures (probably even graphs) as the stream comes in, but I also want 
>> that state to be persistent across restarts of the application, (Or parallel 
>> version of the app, if possible) So I'd have to save that structure 
>> occasionally and reload it as the "primer" on the next run.
>> 
>> I was almost going to use HBase or Hive, but they seem to have been 
>> deprecated in 1.0.0? Or just late to the party?
>> 
>> Also, I've been having trouble deleting hadoop directories.. the old "two 
>> line" examples don't seem to work anymore. I actually managed to fill up the 
>> worker instances (I gave them tiny EBS) and I think I crashed them.
>> 
>> 
>> 
>>> On Sat, Jun 7, 2014 at 10:23 PM, Gino Bustelo <lbust...@gmail.com> wrote:
>>> Have you thought of using window?
>>> 
>>> Gino B.
>>> 
>>> > On Jun 6, 2014, at 11:49 PM, Jeremy Lee <unorthodox.engine...@gmail.com> 
>>> > wrote:
>>> >
>>> >
>>> > It's going well enough that this is a "how should I in 1.0.0" rather than 
>>> > "how do i" question.
>>> >
>>> > So I've got data coming in via Streaming (twitters) and I want to 
>>> > archive/log it all. It seems a bit wasteful to generate a new HDFS file 
>>> > for each DStream, but also I want to guard against data loss from crashes,
>>> >
>>> > I suppose what I want is to let things build up into "superbatches" over 
>>> > a few minutes, and then serialize those to parquet files, or similar? Or 
>>> > do i?
>>> >
>>> > Do I count-down the number of DStreams, or does Spark have a preferred 
>>> > way of scheduling cron events?
>>> >
>>> > What's the best practise for keeping persistent data for a streaming app? 
>>> > (Across restarts) And to clean up on termination?
>>> >
>>> >
>>> > --
>>> > Jeremy Lee  BCompSci(Hons)
>>> >   The Unorthodox Engineers
>> 
>> 
>> 
>> -- 
>> Jeremy Lee  BCompSci(Hons)
>>   The Unorthodox Engineers
> 
> 
> 
> -- 
> Jeremy Lee  BCompSci(Hons)
>   The Unorthodox Engineers

Re: Best practise for 'Streaming' dumps?

Reply via email to