subject:"Problem with take vs. takeSample in PySpark"

Re: Problem with take vs. takeSample in PySpark

2015-08-10 Thread Davies Liu

I tested this in master (1.5 release), it worked as expected (changed spark.driver.maxResultSize to 10m), >>> len(sc.range(10).map(lambda i: '*' * (1<<23) ).take(1)) 1 >>> len(sc.range(10).map(lambda i: '*' * (1<<24) ).take(1)) 15/08/10 10:45:55 ERROR TaskSetManager: Total size of serialized resul

Problem with take vs. takeSample in PySpark

2015-08-10 Thread David Montague

Hi all, I am getting some strange behavior with the RDD take function in PySpark while doing some interactive coding in an IPython notebook. I am running PySpark on Spark 1.2.0 in yarn-client mode if that is relevant. I am using sc.wholeTextFiles and pandas to load a collection of .csv files int