Re: problem with a very simple word count program

2015-09-16 Thread Alexander Krasheninnikov
Collect all your rdds from single files into List>, then call context.union(context.emptyRdd(), YOUR_LIST); Otherwise, on greater number of elements to union, you will get stack overflow exception. On Wed, Sep 16, 2015 at 10:17 PM, Shawn Carroll wrote: > Your loop is deciding the files to proces

Re: problem with a very simple word count program

2015-09-16 Thread Shawn Carroll
Your loop is deciding the files to process and then you are unioning the data on each iteration. If you change it to load all the files at the same time and let spark sort it out you should be much faster. Untested: val rdd = sc.textFile("medline15n00*.xml") val words = rdd.flatMap( x=> x.split