Collect all your rdds from single files into List>, then call
context.union(context.emptyRdd(), YOUR_LIST); Otherwise, on greater number
of elements to union, you will get stack overflow exception.
On Wed, Sep 16, 2015 at 10:17 PM, Shawn Carroll
wrote:
> Your loop is deciding the files to proces
Your loop is deciding the files to process and then you are unioning the
data on each iteration. If you change it to load all the files at the same
time and let spark sort it out you should be much faster.
Untested:
val rdd = sc.textFile("medline15n00*.xml")
val words = rdd.flatMap( x=> x.split