I'm not sure this is helpful but could you use?:

https://github.com/dfdx/Spark.jl

css.csail.mit.edu/6.824/2014/projects/dennisw.pdf

I'm not familiar with PySpark or the above, so I'm not sure what the 
problem with scalability is or if this helps..


On Thursday, November 3, 2016 at 1:45:31 AM UTC, Harish Kumar wrote:
>
> I have a RDD with 10K columns and 70 million rows,  70 MM rows will be 
> grouped into 2000-3000 groups based on a key attribute. I followed below 
> steps 
>
> 1. Julia and Pyspark linked using pyjulia package
> 2. 70 MM rd is groupByKey
>     def juliaCall(x):
>       <<convert x (list of rows) to  list of list inputdata>>
>        j = julia.Julia()
>        jcode = """     """
>        calc= j.eval(jcode )
>       result = calc(inputdata)
>
>       RDD.groupBy(key).map(lambda x: juliaCall(x))
>
> It works fine foe Key (or group) with 50K records, but my each group got 
> 100K to 3M records. in such cases Shuffle will be more and it will fail. 
> Can anyoone guide me to over code this issue
> I have cluster of 10 nodes, each node is of 116GB and 16cores. Standalone 
> mode and i allocated only 10 cores per node. 
>
> Any help?
>

Reply via email to