Re: Processing multiple files in parallel

2014-08-19 Thread SK
Without the sc.union, my program crashes with the following error: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndInde

Re: Processing multiple files in parallel

2014-08-19 Thread Sean Owen
sc.textFile already returns just one RDD for all of your files. The sc.union is unnecessary, although I don't know if it's adding any overhead. The data is certainly processed in parallel and how it is parallelized depends on where the data is -- how many InputSplits Hadoop produces for them. If y