Re: SparkSQL + Tableau Connector

2015-02-11 Thread Silvio Fiorito
Hey Todd, I don’t have an app to test against the thrift server, are you able to define custom SQL without using Tableau’s schema query? I guess it’s not possible to just use SparkSQL temp tables, you may have to use permanent Hive tables that are actually in the metastore so Tableau can discov

RE: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Yang, Yuhao
Check spark/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala It can be used through sliding(windowSize: Int) in spark/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala Yuhao From: Mark Hamstra [mailto:m...@clearstorydata.com] Sent: Thursday, February 12, 2015 7:0

feeding DataFrames into predictive algorithms

2015-02-11 Thread Sandy Ryza
Hey All, I've been playing around with the new DataFrame and ML pipelines APIs and am having trouble accomplishing what seems like should be a fairly basic task. I have a DataFrame where each column is a Double. I'd like to turn this into a DataFrame with a features column and a label column tha

Re: Strongly Typed SQL in Spark

2015-02-11 Thread jay vyas
Ah, nevermind, I just saw http://spark.apache.org/docs/1.2.0/sql-programming-guide.html (language integrated queries) which looks quite similar to what i was thinking about. I'll give that a whirl... On Wed, Feb 11, 2015 at 7:40 PM, jay vyas wrote: > Hi spark. is there anything in the works fo

Re: feeding DataFrames into predictive algorithms

2015-02-11 Thread Michael Armbrust
It sounds like you probably want to do a standard Spark map, that results in a tuple with the structure you are looking for. You can then just assign names to turn it back into a dataframe. Assuming the first column is your label and the rest are features you can do something like this: val df =

Re: feeding DataFrames into predictive algorithms

2015-02-11 Thread Patrick Wendell
I think there is a minor error here in that the first example needs a "tail" after the seq: df.map { row => (row.getDouble(0), row.toSeq.tail.map(_.asInstanceOf[Double])) }.toDataFrame("label", "features") On Wed, Feb 11, 2015 at 7:46 PM, Michael Armbrust wrote: > It sounds like you probably w

Re: Spark SQL - Point lookup optimisation in SchemaRDD?

2015-02-11 Thread nitin
I was able to resolve this use case (Thanks Cheng Lian) where I wanted to launch executor on just the specific partition while also getting the batch pruning optimisations of Spark SQL by doing following :- val query = sql("SELECT * FROM cac hedTable WHERE key = 1") val plannedRDD = query.queryExe

Re: how to debug this kind of error, e.g. "lost executor"?

2015-02-11 Thread Praveen Garg
Try increasing the value of spark.yarn.executor.memoryOverhead. It’s default value is 384mb in spark 1.1. This error generally comes when your process usage exceed your max allocation. Use following property to increase memory overhead. From: Yifan LI mailto:iamyifa...@gmail.com>> Date: Friday,

Re: Datastore HDFS vs Cassandra

2015-02-11 Thread Mike Trienis
Thanks everyone for your responses. I'll definitely think carefully about the data models, querying patterns and fragmentation side-effects. Cheers, Mike. On Wed, Feb 11, 2015 at 1:14 AM, Franc Carter wrote: > > I forgot to mention that if you do decide to use Cassandra I'd highly > recommend j

how to avoid Spark and Hive log from Application log

2015-02-11 Thread sachin Singh
Hi, Please can somebody help ,how to avoid Spark and Hive log from Application log, I mean both spark and hive are using log4j property file , I have configured log4j.property file as per my application as under but its printing Spark and hive console logging also,please suggest its urgent for me,

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-11 Thread fightf...@163.com
Hi, Really have no adequate solution got for this issue. Expecting any available analytical rules or hints. Thanks, Sun. fightf...@163.com From: fightf...@163.com Date: 2015-02-09 11:56 To: user; dev Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

PySpark 1.2 Hadoop version mismatch

2015-02-11 Thread Michael Nazario
Hi Spark users, I seem to be having this consistent error which I have been trying to reproduce and narrow down the problem. I've been running a PySpark application on Spark 1.2 reading avro files from Hadoop. I was consistently seeing the following error: py4j.protocol.Py4JJavaError: An error

RE: PySpark 1.2 Hadoop version mismatch

2015-02-11 Thread Michael Nazario
I also forgot some other information. I have made this error go away by making my pyspark application use spark-1.1.1-bin-cdh4 for the driver, but communicate with a spark 1.2 master and worker. It's not a good workaround, so I would like to have the driver also be spark 1.2 Michael ___

Unable to query hive tables from spark

2015-02-11 Thread kundan kumar
I want to create/access the hive tables from spark. I have placed the hive-site.xml inside the spark/conf directory. Even though it creates a local metastore in the directory where I run the spark shell and exists with an error. I am getting this error when I try to create a new hive table. Even

Extract hour from Timestamp in Spark SQL

2015-02-11 Thread Wush Wu
Dear all, I am new to Spark SQL and have no experience of Hive. I tried to use the built-in Hive Function to extract the hour from timestamp in spark sql, but got : "java.util.NoSuchElementException: key not found: hour" How should I extract the hour from timestamp? And I am very confusing abou

Re: Strongly Typed SQL in Spark

2015-02-11 Thread Felix C
As far as from my tests, language integrated query in spark isn't type safe, ie. query.where('cost == "foo") Would compile and return nothing. If you want type safety, perhaps you want to map the SchemaRDD to a RDD of Product (your type, not scala.Product) --- Original Message --- From: "jay

Spark SQL release

2015-02-11 Thread Agarwal, Shagun
Looks like latest SparkSQL(1.2.1) release is still alpha. Any idea about stable release? Thanks Shagun

obtain cluster assignment in K-means

2015-02-11 Thread Shi Yu
Hi there, I am new to spark. When training a model using K-means using the following code, how do I obtain the cluster assignment in the next step? val clusters = KMeans.train(parsedData, numClusters, numIterations) I searched around many examples but they mostly calculate the WSSSE. I am sti

Re: iteratively modifying an RDD

2015-02-11 Thread Rok Roskar
yes, sorry i wasn't clear -- I still have to trigger the calculation of the RDD at the end of each iteration. Otherwise all of the lookup tables are shipped to the cluster at the same time resulting in memory errors. Therefore this becomes several map jobs instead of one and each consecutive map

Re: Can spark job server be used to visualize streaming data?

2015-02-11 Thread Felix C
What kind of data do you have? Kafka is a popular source to use with spark streaming. But, spark streaming also support reading from a file. Its called basic source https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers --- Original Message --- From: "

Re: PySpark 1.2 Hadoop version mismatch

2015-02-11 Thread Akhil Das
Did you have a look at http://spark.apache.org/docs/1.2.0/building-spark.html I think you can simply download the source and build for your hadoop version as: mvn -Dhadoop.version=2.0.0-mr1-cdh4.7.0 -DskipTests clean package Thanks Best Regards On Thu, Feb 12, 2015 at 11:45 AM, Michael Nazario

Re: Can spark job server be used to visualize streaming data?

2015-02-11 Thread Su She
Hello Felix, I am already streaming in very simple data using Kafka (few messages / second, each record only has 3 columns...really simple, but looking to scale once I connect everything). I am processing it in Spark Streaming and am currently writing word counts to hdfs. So the part where I am co

Streaming scheduling delay

2015-02-11 Thread Tim Smith
On Spark 1.2 (have been seeing this behaviour since 1.0), I have a streaming app that consumes data from Kafka and writes it back to Kafka (different topic). My big problem has been Total Delay. While execution time is usually https://github.com/apache/spark/blob/master/core/src/main/scala/org/apac

Re: Can't access remote Hive table from spark

2015-02-11 Thread guxiaobo1982
Hi Zhan, Yes, I found there is a hdfs account, which is created by Ambari, but what's the password for this account, how can I login under this account? Can I just change the password for the hdfs account? Regards, -- Original -- From: "Zhan Zhang";; Send

Re: How to do broadcast join in SparkSQL

2015-02-11 Thread Dima Zhiyanov
Thank you! The Hive solution seemed more like a workaround. I was wondering if a native Spark Sql support for computing statistics for Parquet files would be available Dima Sent from my iPhone > On Feb 11, 2015, at 3:34 PM, Ted Yu wrote: > > See earlier thread: > http://search-hadoop.com/

Re: Streaming scheduling delay

2015-02-11 Thread Tim Smith
Just read the thread "Are these numbers abnormal for spark streaming?" and I think I am seeing similar results - that is - increasing the window seems to be the trick here. I will have to monitor for a few hours/days before I can conclude (there are so many knobs/dials). On Wed, Feb 11, 2015 at

<    1   2