Hey Todd,
I don’t have an app to test against the thrift server, are you able to define
custom SQL without using Tableau’s schema query? I guess it’s not possible to
just use SparkSQL temp tables, you may have to use permanent Hive tables that
are actually in the metastore so Tableau can discov
Check spark/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala
It can be used through sliding(windowSize: Int) in
spark/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala
Yuhao
From: Mark Hamstra [mailto:m...@clearstorydata.com]
Sent: Thursday, February 12, 2015 7:0
Hey All,
I've been playing around with the new DataFrame and ML pipelines APIs and
am having trouble accomplishing what seems like should be a fairly basic
task.
I have a DataFrame where each column is a Double. I'd like to turn this
into a DataFrame with a features column and a label column tha
Ah, nevermind, I just saw
http://spark.apache.org/docs/1.2.0/sql-programming-guide.html (language
integrated queries) which looks quite similar to what i was thinking
about. I'll give that a whirl...
On Wed, Feb 11, 2015 at 7:40 PM, jay vyas
wrote:
> Hi spark. is there anything in the works fo
It sounds like you probably want to do a standard Spark map, that results
in a tuple with the structure you are looking for. You can then just
assign names to turn it back into a dataframe.
Assuming the first column is your label and the rest are features you can
do something like this:
val df =
I think there is a minor error here in that the first example needs a
"tail" after the seq:
df.map { row =>
(row.getDouble(0), row.toSeq.tail.map(_.asInstanceOf[Double]))
}.toDataFrame("label", "features")
On Wed, Feb 11, 2015 at 7:46 PM, Michael Armbrust
wrote:
> It sounds like you probably w
I was able to resolve this use case (Thanks Cheng Lian) where I wanted to
launch executor on just the specific partition while also getting the batch
pruning optimisations of Spark SQL by doing following :-
val query = sql("SELECT * FROM cac
hedTable WHERE key = 1")
val plannedRDD = query.queryExe
Try increasing the value of spark.yarn.executor.memoryOverhead. It’s default
value is 384mb in spark 1.1. This error generally comes when your process usage
exceed your max allocation. Use following property to increase memory overhead.
From: Yifan LI mailto:iamyifa...@gmail.com>>
Date: Friday,
Thanks everyone for your responses. I'll definitely think carefully about
the data models, querying patterns and fragmentation side-effects.
Cheers, Mike.
On Wed, Feb 11, 2015 at 1:14 AM, Franc Carter
wrote:
>
> I forgot to mention that if you do decide to use Cassandra I'd highly
> recommend j
Hi,
Please can somebody help ,how to avoid Spark and Hive log from Application
log,
I mean both spark and hive are using log4j property file ,
I have configured log4j.property file as per my application as under but its
printing Spark and hive console logging also,please suggest its urgent for
me,
Hi,
Really have no adequate solution got for this issue. Expecting any available
analytical rules or hints.
Thanks,
Sun.
fightf...@163.com
From: fightf...@163.com
Date: 2015-02-09 11:56
To: user; dev
Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap for
large data sets
Hi Spark users,
I seem to be having this consistent error which I have been trying to reproduce
and narrow down the problem. I've been running a PySpark application on Spark
1.2 reading avro files from Hadoop. I was consistently seeing the following
error:
py4j.protocol.Py4JJavaError: An error
I also forgot some other information. I have made this error go away by making
my pyspark application use spark-1.1.1-bin-cdh4 for the driver, but communicate
with a spark 1.2 master and worker. It's not a good workaround, so I would like
to have the driver also be spark 1.2
Michael
___
I want to create/access the hive tables from spark.
I have placed the hive-site.xml inside the spark/conf directory. Even
though it creates a local metastore in the directory where I run the spark
shell and exists with an error.
I am getting this error when I try to create a new hive table. Even
Dear all,
I am new to Spark SQL and have no experience of Hive.
I tried to use the built-in Hive Function to extract the hour from
timestamp in spark sql, but got : "java.util.NoSuchElementException: key
not found: hour"
How should I extract the hour from timestamp?
And I am very confusing abou
As far as from my tests, language integrated query in spark isn't type safe, ie.
query.where('cost == "foo")
Would compile and return nothing.
If you want type safety, perhaps you want to map the SchemaRDD to a RDD of
Product (your type, not scala.Product)
--- Original Message ---
From: "jay
Looks like latest SparkSQL(1.2.1) release is still alpha.
Any idea about stable release?
Thanks
Shagun
Hi there,
I am new to spark. When training a model using K-means using the
following code, how do I obtain the cluster assignment in the next
step?
val clusters = KMeans.train(parsedData, numClusters, numIterations)
I searched around many examples but they mostly calculate the WSSSE.
I am sti
yes, sorry i wasn't clear -- I still have to trigger the calculation of the RDD
at the end of each iteration. Otherwise all of the lookup tables are shipped to
the cluster at the same time resulting in memory errors. Therefore this becomes
several map jobs instead of one and each consecutive map
What kind of data do you have? Kafka is a popular source to use with spark
streaming.
But, spark streaming also support reading from a file. Its called basic source
https://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers
--- Original Message ---
From: "
Did you have a look at
http://spark.apache.org/docs/1.2.0/building-spark.html
I think you can simply download the source and build for your hadoop
version as:
mvn -Dhadoop.version=2.0.0-mr1-cdh4.7.0 -DskipTests clean package
Thanks
Best Regards
On Thu, Feb 12, 2015 at 11:45 AM, Michael Nazario
Hello Felix,
I am already streaming in very simple data using Kafka (few messages /
second, each record only has 3 columns...really simple, but looking to
scale once I connect everything). I am processing it in Spark Streaming and
am currently writing word counts to hdfs. So the part where I am co
On Spark 1.2 (have been seeing this behaviour since 1.0), I have a
streaming app that consumes data from Kafka and writes it back to Kafka
(different topic). My big problem has been Total Delay. While execution
time is usually https://github.com/apache/spark/blob/master/core/src/main/scala/org/apac
Hi Zhan,
Yes, I found there is a hdfs account, which is created by Ambari, but what's
the password for this account, how can I login under this account?
Can I just change the password for the hdfs account?
Regards,
-- Original --
From: "Zhan Zhang";;
Send
Thank you!
The Hive solution seemed more like a workaround. I was wondering if a native
Spark Sql support for computing statistics for Parquet files would be available
Dima
Sent from my iPhone
> On Feb 11, 2015, at 3:34 PM, Ted Yu wrote:
>
> See earlier thread:
> http://search-hadoop.com/
Just read the thread "Are these numbers abnormal for spark streaming?" and
I think I am seeing similar results - that is - increasing the window seems
to be the trick here. I will have to monitor for a few hours/days before I
can conclude (there are so many knobs/dials).
On Wed, Feb 11, 2015 at
101 - 126 of 126 matches
Mail list logo