If you have used spark-sas7bdat package to transform SAS data set to Spark, please be aware

2016-10-27 Thread Shi Yu
I found some main issues and wrote it on my blog: https://eilianyu.wordpress.com/2016/10/27/be-aware-of-hidden-data-errors-using-spark-sas7bdat-pacakge-to-ingest-sas-datasets-to-spark/

Best practice of complicated SQL query in Spark/Hive

2016-10-06 Thread Shi Yu
Hello, I wonder what is the state-of-art best practice to achieve best performance running complicated SQL query today in 2016? I am new to this topic and have read about Hive on Tez Spark on Hive Spark SQL 2.0 (It seems Spark 2.0 supports complicated nest query) The documentation I read sugge

Spark Beginner Question

2016-07-26 Thread Shi Yu
Hello, *Question 1: *I am new to Spark. I am trying to train classification model on Spark DataFrame. I am using PySpark. And aFrame object in df:ted a Spark DataFrame object in df: from pyspark.sql.types import * query = """select * from table""" df = sqlContext.sql(query) My question is how

Re: obtain cluster assignment in K-means

2015-02-12 Thread Shi Yu
a single Vector, 2nd is RDD[Vector] > > Robin > > On 12 Feb 2015, at 06:37, Shi Yu wrote: > > Hi there, > > I am new to spark. When training a model using K-means using the following > code, how do I obtain the cluster assignment in the next step? > > > val cluster

obtain cluster assignment in K-means

2015-02-11 Thread Shi Yu
Hi there, I am new to spark. When training a model using K-means using the following code, how do I obtain the cluster assignment in the next step? val clusters = KMeans.train(parsedData, numClusters, numIterations) I searched around many examples but they mostly calculate the WSSSE. I am sti