Thanks TD.
BTW - If I have input file ~ 250 GBs - Is there any guideline on whether to use:
* a single input (250 GB) (in this case is there any max upper bound)
or
* split into 1000 files each of 250 MB (hdfs block size is 250 MB) or
* a multiple of hdfs block size.
Mans
On Friday, July 11, 2014 4:38 PM, Tathagata Das <[email protected]>
wrote:
The model for file stream is to pick up and process new files written
atomically (by move) into a directory. So your file is being processed in a
single batch, and then its waiting for any new files to be written into that
directory.
TD
On Fri, Jul 11, 2014 at 11:46 AM, M Singh <[email protected]> wrote:
So, is it expected for the process to generate stages/tasks even after
processing a file ?
>
>
>Also, is there a way to figure out the file that is getting processed and when
>that process is complete ?
>
>
>Thanks
>
>
>
>
>On Friday, July 11, 2014 1:51 PM, Tathagata Das <[email protected]>
>wrote:
>
>
>
>Whenever you need to do a shuffle=based operation like reduceByKey,
>groupByKey, join, etc., the system is essentially redistributing the data
>across the cluster and it needs to know how many parts should it divide the
>data into. Thats where the default parallelism is used.
>
>
>TD
>
>
>
>On Fri, Jul 11, 2014 at 3:16 AM, M Singh <[email protected]> wrote:
>
>Hi TD:
>>
>>
>>The input file is on hdfs.
>>
>>
>>The file is approx 2.7 GB and when the process starts, there are 11 tasks
>>(since hdfs block size is 256M) for processing and 2 tasks for reduce by key.
>> After the file has been processed, I see new stages with 2 tasks that
>>continue to be generated. I understand this value (2) is the default value
>>for spark.default.parallelism but don't quite understand how is the value
>>determined for generating tasks for reduceByKey, how is it used besides
>>reduceByKey and what should be the optimal value for this.
>>
>>
>>Thanks.
>>
>>
>>
>>On Thursday, July 10, 2014 7:24 PM, Tathagata Das
>><[email protected]> wrote:
>>
>>
>>
>>How are you supplying the text file?
>>
>>
>>
>>On Wed, Jul 9, 2014 at 11:51 AM, M Singh <[email protected]> wrote:
>>
>>Hi Folks:
>>>
>>>
>>>
>>>I am working on an application which uses spark streaming (version 1.1.0
>>>snapshot on a standalone cluster) to process text file and save counters in
>>>cassandra based on fields in each row. I am testing the application in two
>>>modes:
>>>
>>> * Process each row and save the counter in cassandra. In this scenario
>>> after the text file has been consumed, there is no task/stages seen in the
>>> spark UI.
>>>
>>> * If instead I use reduce by key before saving to cassandra, the spark
>>> UI shows continuous generation of tasks/stages even after processing the
>>> file has been completed.
>>>
>>>I believe this is because the reduce by key requires merging of data from
>>>different partitions. But I was wondering if anyone has any
>>>insights/pointers for understanding this difference in behavior and how to
>>>avoid generating tasks/stages when there is no data (new file) available.
>>>
>>>
>>>Thanks
>>>
>>>Mans
>>
>>
>>
>
>
>