from:"Hossein"

Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions

2016-02-28 Thread Hossein Vatani

Hi, Affects Version/s:1.6.0 Component/s:PySpark I faced below exception when I tried to run http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=filter#pyspark.sql.SQLContext.jsonRDD samples: Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpar

Re: SparkR Error in sparkR.init(master=“local”) in RStudio

2015-10-06 Thread Hossein

Have you built the Spark jars? Can you run the Spark Scala shell? --Hossein On Tuesday, October 6, 2015, Khandeshi, Ami wrote: > > Sys.setenv(SPARKR_SUBMIT_ARGS="--verbose sparkr-shell") > > Sys.setenv(SPARK_PRINT_LAUNCH_COMMAND=1) > > > > sc <- sparkR.

Re: Loading CSV to DataFrame and saving it into Parquet for speedup

2015-06-05 Thread Hossein

Why not letting SparkSQL deal with parallelism? When using SparkSQL data sources you can control parallelism by specifying mapred.min.split.size and mapred.max.split.size in your Hadoop configuration. You can then repartition your data as you wish and save it as Parquet. --Hossein On Thu, May 28

Re: Is it possible to read file head in each partition?

2014-07-29 Thread Hossein

You can use SparkContext.wholeTextFile(). Please note that the documentation suggests: "Small files are preferred, large file is also allowable, but may cause bad performance." --Hossein On Tue, Jul 29, 2014 at 9:21 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrot