Without the sc.union, my program crashes with the following error:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due
to stage failure: Master removed our application: FAILED at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndInde
sc.textFile already returns just one RDD for all of your files. The
sc.union is unnecessary, although I don't know if it's adding any
overhead. The data is certainly processed in parallel and how it is
parallelized depends on where the data is -- how many InputSplits
Hadoop produces for them.
If y