Steve, Something like this will do I think => sc.parallelize(1 to 1000,
1000).flatMap(x => 1 to 100000)

the above will launch 1000 tasks (maps), with each task creating 10^5
numbers (total of 100 million elements)


On Mon, Dec 8, 2014 at 6:17 PM, Steve Lewis <lordjoe2...@gmail.com> wrote:

>  assume I don't care about values which may be created in a later map - in
> scala I can say
> val rdd = sc.parallelize(1 to 1000000000, numSlices = 1000)
> but in Java JavaSparkContext can only paralellize a List - limited to
> Integer,MAX_VALUE elements and required to exist in memory - the best I can
> do on memory is to build my own List based on a BitSet.
> Is there a JIRA asking for JavaSparkContext.parallelize to take an
> Iterable or an Iterator?
> I am trying to make an RDD with at least 100 million elements and if
> possible several billion to test performance issues on a large application
>

Reply via email to