Hi all,
Recently I've ran into a scenario to conduct two sample tests between all
paired combination of columns of an RDD. But the networking load and
generation of pair-wise computation is too time consuming. That has puzzled
me for a long time. I want to conduct Wilcoxon rank-sum test
(http://en
Any resolution to this? I am having the same problem.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/zip-files-submitted-with-py-files-disappear-from-hdfs-after-a-while-on-EMR-tp22342p22919.html
Sent from the Apache Spark User List mailing list archive at N
Any resolution to this? Im having the same problem.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-YARN-on-AWS-EMR-Issues-finding-file-on-hdfs-tp10214p22918.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
ah, that explains it, many thanks!
On Sat, May 16, 2015 at 7:41 PM, Yana Kadiyska
wrote:
> oh...metastore_db location is not controlled by
> hive.metastore.warehouse.dir -- one is the location of your metastore DB,
> the other is the physical location of your stored data. Checkout this SO
> thre
Is anyone else having issues when building spark from git?
I created a jira ticket with a Docker file that reproduces the issue.
The error:
/spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:56:
error: not found: type Type
protected Type type() { retu
oh...metastore_db location is not controlled by
hive.metastore.warehouse.dir -- one is the location of your metastore DB,
the other is the physical location of your stored data. Checkout this SO
thread:
http://stackoverflow.com/questions/13624893/metastore-db-created-wherever-i-run-hive
On Sat, M
Hi Ayan and Helena,
I've considered using Cassandra/HBase but ended up opting to save to worker
hdfs because I want to take advantage of the data locality since the data
will often be loaded to Spark for further processing. I was also under the
impression that saving to filesystem (instead of db)
Hi Justin
The CrossValidatorExample here
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/CrossValidatorExample.scala
is a good example of how to set up an ML Pipeline for extracting a model
with the best parameter set.
You set up the pipeline as in
I am not a hive expert and know enough on this. Give it a shot and post
back the result. I am sure experts will help.
Best
Ayan
On 17 May 2015 02:40, "Sourav Mazumder" wrote:
> Hi Ayan,
>
> Thanks for your response.
>
> In my case the constraint is I have to use Hive 0.14 for some other
> usecase
Hi
First up, this is probably not a good idea, because you are not getting any
extra information, but you are binding yourself with a fixed schema (ie you
must need to know how many countries you are expecting, and of course,
additional country means change in code)
Having said that, this is a SQ
Hi Ayan,
Thanks for your response.
In my case the constraint is I have to use Hive 0.14 for some other
usecases.
I believe the incompatibility is at the thrift server level (the hiveserver
2 which comes with hive). If I use Hive 0.13 hiverserver 2 in the same node
as of spark master should that
Here is from documentation:
Spark SQL is designed to be compatible with the Hive Metastore, SerDes and
UDFs. Currently Spark SQL is based on Hive 0.12.0 and 0.13.1.
On Sun, May 17, 2015 at 1:48 AM, ayan guha wrote:
> Hi
>
> Try with Hive 0.13. If I am not wrong, Hive 0.14 is not supported yet,
Hi
Try with Hive 0.13. If I am not wrong, Hive 0.14 is not supported yet,
definitely not with 1.2.1 :)
On Sun, May 17, 2015 at 1:14 AM, smazumder
wrote:
> HI,
>
> I'm trying to execute simple sql statement from spark-shell
>
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) - Thi
Try like this
SELECT name, case when ts>0 then price else 0 end from table
On Sun, May 17, 2015 at 12:21 AM, Antony Mayi
wrote:
> Hi,
>
> is it expected I can't reference a column inside of IF statement like this:
>
> sctx.sql("SELECT name, IF(ts>0, price, 0) FROM table").collect()
>
> I get an
HI,
I'm trying to execute simple sql statement from spark-shell
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) - This one
executes properly.
Next I'm trying -
sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
This keeps on trying to connect Metastore but c
Your parenthesis don't look right as you're embedding the filter on the
Row.fromSeq().
Try this:
val trainRDD = rawTrainData
.filter(!_.isEmpty)
.map(rawRow => Row.fromSeq(rawRow.split(",")))
.filter(_.length == 15)
.map(_.toString).map(_.trim)
-Don
On Fr
Hi,
is it expected I can't reference a column inside of IF statement like this:
sctx.sql("SELECT name, IF(ts>0, price, 0) FROM table").collect()
I get an error:
org.apache.spark.sql.AnalysisException: unresolved operator 'Project [name#0,if
((CAST(ts#1, DoubleType) > CAST(0, DoubleType))) price#2
Gave it another try - it seems that it picks up the variable and prints out
the correct value, but still puts the metatore_db folder in the current
directory, regardless.
On Sat, May 16, 2015 at 1:13 PM, Tamas Jambor wrote:
> Thank you for the reply.
>
> I have tried your experiment, it seems th
Consider using cassandra with spark streaming and timeseries, cassandra has
been doing time series for years.
Here’s some snippets with kafka streaming and writing/reading the data back:
https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrwea
Thank you for the reply.
I have tried your experiment, it seems that it does not print the settings
out in spark-shell (I'm using 1.3 by the way).
Strangely I have been experimenting with an SQL connection instead, which
works after all (still if I go to spark-shell and try to print out the SQL
s
Hi
If you asked to any DB developer, s/he would tell you the construct:
select userid,time,state,
rank() over (partition by userId order by time desc) r
from event) where r=1
I am not sure if Dataframe supports it, though I am sure we can extend
functions to implement it.
But here is one not us
Hello,
I am using MLPipeline. I would like to extract the best parameter found by
CrossValidator. But I cannot find much document about how to do it. Can
anyone give me some pointers?
Thanks.
Justin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Getting
What Spark release are you using ?
Can you check driver log to see if there is some clue there ?
Thanks
On Sat, May 16, 2015 at 12:01 AM, xiaohe lan wrote:
> Hi,
>
> I have a 5 nodes yarn cluster, I used spark-submit to submit a simple app.
>
> spark-submit --master yarn target/scala-2.10/sim
All - this issue showed up when I was tearing down a spark context and
creating a new one. Often, I was unable to then write to HDFS due to this
error. I subsequently switched to a different implementation where instead
of tearing down and re initializing the spark context I'd instead submit a
sepa
For the Spark Streaming app, if you want a particular action inside a
foreachRDD to go to a particular pool, then make sure you set the pool
within the foreachRDD function. E.g.
dstream.foreachRDD { rdd =>
rdd.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1") //
set the pool
r
Thanks Ayan. Can we rebroadcast after updating in the driver?
Thanks
NB.
On Fri, May 15, 2015 at 6:40 PM, ayan guha wrote:
> Hi
>
> broadcast variables are shipped for the first time it is accessed in a
> transformation to the executors used by the transformation. It will NOT
> updated subsequ
Hi,
I have a 5 nodes yarn cluster, I used spark-submit to submit a simple app.
spark-submit --master yarn target/scala-2.10/simple-project_2.10-1.0.jar
--class scala.SimpleApp --num-executors 5
I have set the number of executor to 5, but from sparkui I could see only
two executors and it ran ve
27 matches
Mail list logo