If I may add my contribution to this discussion if I understand well your
question...

DStream is discretized stream. It discretized the data stream over windows
of time (according to the project code I've read and paper too). so when
you write:

JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new
Duration(60 * 60 * 1000)); //1 hour

It means you are discretizing over a 1h window. Each batch so each RDD of
the dstream will collect data for 1h before going to next RDD.
So if you want to have more RDD, you should reduce batch size/duration...

Pascal


On Thu, Mar 20, 2014 at 7:51 AM, Tathagata Das
<tathagata.das1...@gmail.com>wrote:

> That is a good question. If I understand correctly, you need multiple RDDs
> from a DStream in *every batch*. Can you elaborate on why do you need
> multiple RDDs every batch?
>
> TD
>
>
> On Wed, Mar 19, 2014 at 10:20 PM, Sanjay Awatramani <sanjay_a...@yahoo.com
> > wrote:
>
>> Hi,
>>
>> As I understand, a DStream consists of 1 or more RDDs. And foreachRDD
>> will run a given func on each and every RDD inside a DStream.
>>
>> I created a simple program which reads log files from a folder every hour:
>> JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new
>> Duration(60 * 60 * 1000)); //1 hour
>> JavaDStream<String> obj = stcObj.textFileStream("/Users/path/to/Input");
>>
>> When the interval is reached, Spark reads all the files and creates one
>> and only one RDD (as i verified from a sysout inside foreachRDD).
>>
>> The streaming doc at a lot of places gives an indication that many
>> operations (e.g. flatMap) on a DStream are applied individually to a RDD
>> and the resulting DStream consists of the mapped RDDs in the same number as
>> the input DStream.
>> ref:
>> https://spark.apache.org/docs/latest/streaming-programming-guide.html#dstreams
>>
>> If that is the case, how can i generate a scenario where in I have
>> multiple RDDs inside a DStream in my example ?
>>
>> Regards,
>> Sanjay
>>
>
>

Reply via email to