Pyspark not running the sqlContext in Pycharm

2018-03-01 Thread rhettbutler
I hope someone could help with this problem I am having. I have previously setup a VM in windows using CENTOS, with hadoop and spark (all in singlenode) and it was working perfectly. I am now running a multinode setup with another computer, both running CENTOS standalone. I have installed hadoop s

Can I get my custom spark strategy to run last?

2018-03-01 Thread Keith Chapman
Hi, I'd like to write a custom Spark strategy that runs after all the existing Spark strategies are run. Looking through the Spark code it seems like the custom strategies are prepended to the list of strategies in Spark. Is there a way I could get it to run last? Regards, Keith. http://keith-ch

Re: K Means Clustering Explanation

2018-03-01 Thread Christoph Brücke
Hi Matt, I see. You could use the trained model to predict the cluster id for each training point. Now you should be able to create a dataset with your original input data and the associated cluster id for each data point in the input data. Now you can group this dataset by cluster id and aggregat

Re: parquet vs orc files

2018-03-01 Thread Sushrut Ikhar
To add, schema evaluation is better for parquet compared to orc (at the cost of a bit slowness) as orc is truly index based; especially useful in case you would want to delete some column later. Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar

Re: Using Thrift with Dataframe

2018-03-01 Thread Sushrut Ikhar
https://github.com/airbnb/airbnb-spark-thrift Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar On Thu, Mar 1, 2018 at 6:05 AM, Nikhil Goyal wrote: > Hi guys, > > I have a RDD of thrift struct. I want to convert it into a dataframe.

K Means Clustering Explanation

2018-03-01 Thread Matt Hicks
I'm using K Means clustering for a project right now, and it's working very well.  However, I'd like to determine from the clusters what information distinctions define each cluster so I can explain the "reasons" data fits into a specific cluster. Is there a proper way to do this in Spark ML?