Spark Random Forest Memory issues

2016-02-19 Thread Ewan Higgs
Hi all, Back in september there was a bunch of machine learning profile results published here: https://github.com/szilard/benchm-ml/ Spark's Random Forest seemed to fall down with memory issues at about 10m entries: https://github.com/szilard/benchm-ml/blob/master/2-rf/5c-spark-crash.txt It

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-12-08 Thread Ewan Higgs
when it is done. On Mon, Dec 7, 2015 at 1:42 PM, Ewan Higgs wrote: Jonathan, Did you ever get to the bottom of this? I have some users working with Spark in a classroom setting and our example notebooks run into problems where there is so much spilled to disk that they run out of quota. A 1.5G

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-12-07 Thread Ewan Higgs
Jonathan, Did you ever get to the bottom of this? I have some users working with Spark in a classroom setting and our example notebooks run into problems where there is so much spilled to disk that they run out of quota. A 1.5G input set becomes >30G of spilled data on disk. I looked into how I

Re: Checkpointing not removing shuffle files from local disk

2015-12-03 Thread Ewan Higgs
Hi all, We are running a class with Pyspark notebook for data analysis. Some of the books are fairly long and have a lot of operations. Through the course of the notebook, the shuffle storage expands considerably and often exceeds quota (e.g. 1.5GB input expands to 24GB in shuffle files). Closing a

Re: How to increase the Json parsing speed

2015-08-28 Thread Ewan Higgs
Hi Gavin, You can increase the speed by choosing a better encoding. A little bit of ETL goes a long way. e.g. As you're working with Spark SQL you probably have a tabular format. So you could use CSV so you don't need to parse the field names on each entry (and it will also reduce the file s

Re: RECEIVED SIGNAL 15: SIGTERM

2015-07-13 Thread Ewan Higgs
cess small files. I do have 32 threads per executor on some tasks but 32meg for stack & thread overhead should do. Maybe the issue is sockets or some mem leak of network communication. On 13/07/15 09:15, Ewan Higgs wrote: It depends on how large the xml files are and how you're processing t

Re: Spark TeraSort source request

2015-04-13 Thread Ewan Higgs
ote: Thank you for your response Ewan. I quickly looked yesterday and it was there, but today at work I tried to open it again to start working on it, but it appears to be removed. Is this correct? Thanks, Tom On 12 April 2015 at 06:58, Ewan Higgs <mailto:ewan.hi...@ugent.be>> wrote:

Re: Spark TeraSort source request

2015-04-12 Thread Ewan Higgs
Hi all. The code is linked from my repo: https://github.com/ehiggs/spark-terasort " This is an example Spark program for running TeraSort benchmarks. It is based on work from Reynold Xin's branch , but it is not the same TeraSort program that curren