The depends on your requirements. If you want to process the 250 GB input
file as a "stream" to emulate the stream of data, then it should be split
into files (such that event ordering is maintained in those splits, if
necessary). And then those splits should be moved one-by-one in the
directory mo
Thanks TD.
BTW - If I have input file ~ 250 GBs - Is there any guideline on whether to use:
* a single input (250 GB) (in this case is there any max upper bound)
or
* split into 1000 files each of 250 MB (hdfs block size is 250 MB) or
* a multiple of hdfs block size.
The model for file stream is to pick up and process new files written
atomically (by move) into a directory. So your file is being processed in a
single batch, and then its waiting for any new files to be written into
that directory.
TD
On Fri, Jul 11, 2014 at 11:46 AM, M Singh wrote:
> So, is
So, is it expected for the process to generate stages/tasks even after
processing a file ?
Also, is there a way to figure out the file that is getting processed and when
that process is complete ?
Thanks
On Friday, July 11, 2014 1:51 PM, Tathagata Das
wrote:
Whenever you need to do a s
Whenever you need to do a shuffle=based operation like reduceByKey,
groupByKey, join, etc., the system is essentially redistributing the data
across the cluster and it needs to know how many parts should it divide the
data into. Thats where the default parallelism is used.
TD
On Fri, Jul 11, 201
Hi TD:
The input file is on hdfs.
The file is approx 2.7 GB and when the process starts, there are 11 tasks
(since hdfs block size is 256M) for processing and 2 tasks for reduce by key.
After the file has been processed, I see new stages with 2 tasks that continue
to be generated. I underst
How are you supplying the text file?
On Wed, Jul 9, 2014 at 11:51 AM, M Singh wrote:
> Hi Folks:
>
> I am working on an application which uses spark streaming (version 1.1.0
> snapshot on a standalone cluster) to process text file and save counters in
> cassandra based on fields in each row. I
Hi Folks:
I am working on an application which uses spark streaming (version 1.1.0
snapshot on a standalone cluster) to process text file and save counters in
cassandra based on fields in each row. I am testing the application in two
modes:
* Process each row and save the counter i