Steve, Something like this will do I think => sc.parallelize(1 to 1000, 1000).flatMap(x => 1 to 100000)
the above will launch 1000 tasks (maps), with each task creating 10^5 numbers (total of 100 million elements) On Mon, Dec 8, 2014 at 6:17 PM, Steve Lewis <lordjoe2...@gmail.com> wrote: > assume I don't care about values which may be created in a later map - in > scala I can say > val rdd = sc.parallelize(1 to 1000000000, numSlices = 1000) > but in Java JavaSparkContext can only paralellize a List - limited to > Integer,MAX_VALUE elements and required to exist in memory - the best I can > do on memory is to build my own List based on a BitSet. > Is there a JIRA asking for JavaSparkContext.parallelize to take an > Iterable or an Iterator? > I am trying to make an RDD with at least 100 million elements and if > possible several billion to test performance issues on a large application >