Why ShuffleManager.registerShuffle takes shuffleId since ShuffleDependency has it too?

2016-12-28 Thread Jacek Laskowski
Hi, While reviewing ShuffleManager I've noticed that registerShuffle method [1] takes shuffleId and ShuffleDependency which seems a code duplication since ShuffleDependency has shuffleId. Any reason for having shuffleId specified explicitly? [1] https://github.com/apache/spark/blob/master/core/

Why is spark.shuffle.sort.bypassMergeThreshold 200?

2016-12-28 Thread Jacek Laskowski
Hi, I'm wondering what's so special about 200 to have it the default value of spark.shuffle.sort.bypassMergeThreshold? Is this arbitrary number? Is there any theory behind it? Is the number of partitions in Spark SQL, i.e. 200, somehow related to spark.shuffle.sort.bypassMergeThreshold? scala>

Apache Hive with Spark Configuration

2016-12-28 Thread Chetan Khatri
Hello Users / Developers, I am using Hive 2.0.1 with MySql as a Metastore, can you tell me which version is more compatible with Spark 2.0.2 ? THanks

Error: at sqlContext.createDataFrame with RDD and Schema

2016-12-28 Thread Chetan Khatri
Hello Spark Community, I am reading HBase table from Spark and getting RDD but now i wants to convert RDD of Spark Rows and want to convert to DF. *Source Code:* bin/spark-shell --packages it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3 --conf spark.hbase.host=127.0.0.1 import it.nerdamme

Re: Error: at sqlContext.createDataFrame with RDD and Schema

2016-12-28 Thread Chetan Khatri
Resolved above error by creating SparkSession val spark = SparkSession.builder().appName("Hbase - Spark POC").getOrCreate() Error after: spark.sql("SELECT * FROM student").show() But while doing show() action on Dataframe throws below error: scala> sqlContext.sql("select * from student").show(

ml word2vec finSynonyms return type

2016-12-28 Thread Asher Krim
Hey all, I would like to propose changing the return type of `findSynonyms` in ml's Word2Vec : def findSynonyms(word: String, num: Int): DataFrame = { val spark = SparkSes

RE: Shuffle intermidiate results not being cached

2016-12-28 Thread Liang-Chi Hsieh
The shuffle data can be reused only if you use the same RDD. When you union x1's RDD and x2's RDD in first iteration, and union x1's RDD and x2's RDD and x3's RDD in 2nd iteration, how do you think they are the same RDD? I just use the previous example code to show that you should not recompute a

RE: Shuffle intermidiate results not being cached

2016-12-28 Thread Liang-Chi Hsieh
var totalTime: Long = 0 var allDF: DataFrame = null for { x <- dataframes } { val timeLen = time { allDF = if (allDF == null) x else { allDF.union(x).groupBy("cat1", "cat2").agg(sum($"v").alias("v")) } } println(s"Took $timeLen miliseconds") totalTime

Re: Error: at sqlContext.createDataFrame with RDD and Schema

2016-12-28 Thread Liang-Chi Hsieh
Your schema is all fields are string: > val stdSchema= StructType(stdSchemaString.split(",").map(fieldName => > StructField(fieldName, StringType, true))) But looks like you have integer columns in the RDD? Chetan Khatri wrote > Resolved above error by creating SparkSession > > val spark = Sp

Re: Dependency Injection and Microservice development with Spark

2016-12-28 Thread Miguel Morales
Hi Not sure about Spring boot but trying to use DI libraries you'll run into serialization issues.I've had luck using an old version of Scaldi. Recently though I've been passing the class types as arguments with default values. Then in the spark code it gets instantiated. So you're basic

Re: Why is spark.shuffle.sort.bypassMergeThreshold 200?

2016-12-28 Thread Liang-Chi Hsieh
This https://github.com/apache/spark/pull/1799 seems the first PR to introduce this number. But there is no explanation about the number. Jacek Laskowski wrote > Hi, > > I'm wondering what's so special about 200 to have it the default value > of spark.shuffle.sort.bypassMergeThreshold? > > Is