I'm not sure this is helpful but could you use?: https://github.com/dfdx/Spark.jl
css.csail.mit.edu/6.824/2014/projects/dennisw.pdf I'm not familiar with PySpark or the above, so I'm not sure what the problem with scalability is or if this helps.. On Thursday, November 3, 2016 at 1:45:31 AM UTC, Harish Kumar wrote: > > I have a RDD with 10K columns and 70 million rows, 70 MM rows will be > grouped into 2000-3000 groups based on a key attribute. I followed below > steps > > 1. Julia and Pyspark linked using pyjulia package > 2. 70 MM rd is groupByKey > def juliaCall(x): > <<convert x (list of rows) to list of list inputdata>> > j = julia.Julia() > jcode = """ """ > calc= j.eval(jcode ) > result = calc(inputdata) > > RDD.groupBy(key).map(lambda x: juliaCall(x)) > > It works fine foe Key (or group) with 50K records, but my each group got > 100K to 3M records. in such cases Shuffle will be more and it will fail. > Can anyoone guide me to over code this issue > I have cluster of 10 nodes, each node is of 116GB and 16cores. Standalone > mode and i allocated only 10 cores per node. > > Any help? >