How can I do pair-wise computation between RDD feature columns?

2015-05-16 Thread yaochunnan
Hi all, Recently I've ran into a scenario to conduct two sample tests between all paired combination of columns of an RDD. But the networking load and generation of pair-wise computation is too time consuming. That has puzzled me for a long time. I want to conduct Wilcoxon rank-sum test (http://en

Re: zip files submitted with --py-files disappear from hdfs after a while on EMR

2015-05-16 Thread jaredtims
Any resolution to this? I am having the same problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/zip-files-submitted-with-py-files-disappear-from-hdfs-after-a-while-on-EMR-tp22342p22919.html Sent from the Apache Spark User List mailing list archive at N

RE: Running Spark/YARN on AWS EMR - Issues finding file on hdfs?

2015-05-16 Thread jaredtims
Any resolution to this? Im having the same problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-YARN-on-AWS-EMR-Issues-finding-file-on-hdfs-tp10214p22918.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: store hive metastore on persistent store

2015-05-16 Thread Tamas Jambor
ah, that explains it, many thanks! On Sat, May 16, 2015 at 7:41 PM, Yana Kadiyska wrote: > oh...metastore_db location is not controlled by > hive.metastore.warehouse.dir -- one is the location of your metastore DB, > the other is the physical location of your stored data. Checkout this SO > thre

Problem building master on 2.11

2015-05-16 Thread Fernando O.
Is anyone else having issues when building spark from git? I created a jira ticket with a Docker file that reproduces the issue. The error: /spark/network/shuffle/src/main/java/org/apache/spark/network/shuffle/protocol/UploadBlock.java:56: error: not found: type Type protected Type type() { retu

Re: store hive metastore on persistent store

2015-05-16 Thread Yana Kadiyska
oh...metastore_db location is not controlled by hive.metastore.warehouse.dir -- one is the location of your metastore DB, the other is the physical location of your stored data. Checkout this SO thread: http://stackoverflow.com/questions/13624893/metastore-db-created-wherever-i-run-hive On Sat, M

Re: Grouping and storing unordered time series data stream to HDFS

2015-05-16 Thread Nisrina Luthfiyati
Hi Ayan and Helena, I've considered using Cassandra/HBase but ended up opting to save to worker hdfs because I want to take advantage of the data locality since the data will often be loaded to Spark for further processing. I was also under the impression that saving to filesystem (instead of db)

Re: Getting the best parameter set back from CrossValidatorModel

2015-05-16 Thread Ram Sriharsha
Hi Justin The CrossValidatorExample here https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/CrossValidatorExample.scala is a good example of how to set up an ML Pipeline for extracting a model with the best parameter set. You set up the pipeline as in

Re: Spark SQL is not able to connect to hive metastore

2015-05-16 Thread ayan guha
I am not a hive expert and know enough on this. Give it a shot and post back the result. I am sure experts will help. Best Ayan On 17 May 2015 02:40, "Sourav Mazumder" wrote: > Hi Ayan, > > Thanks for your response. > > In my case the constraint is I have to use Hive 0.14 for some other > usecase

Re: How to reshape RDD/Spark DataFrame

2015-05-16 Thread ayan guha
Hi First up, this is probably not a good idea, because you are not getting any extra information, but you are binding yourself with a fixed schema (ie you must need to know how many countries you are expecting, and of course, additional country means change in code) Having said that, this is a SQ

Re: Spark SQL is not able to connect to hive metastore

2015-05-16 Thread Sourav Mazumder
Hi Ayan, Thanks for your response. In my case the constraint is I have to use Hive 0.14 for some other usecases. I believe the incompatibility is at the thrift server level (the hiveserver 2 which comes with hive). If I use Hive 0.13 hiverserver 2 in the same node as of spark master should that

Re: Spark SQL is not able to connect to hive metastore

2015-05-16 Thread ayan guha
Here is from documentation: Spark SQL is designed to be compatible with the Hive Metastore, SerDes and UDFs. Currently Spark SQL is based on Hive 0.12.0 and 0.13.1. On Sun, May 17, 2015 at 1:48 AM, ayan guha wrote: > Hi > > Try with Hive 0.13. If I am not wrong, Hive 0.14 is not supported yet,

Re: Spark SQL is not able to connect to hive metastore

2015-05-16 Thread ayan guha
Hi Try with Hive 0.13. If I am not wrong, Hive 0.14 is not supported yet, definitely not with 1.2.1 :) On Sun, May 17, 2015 at 1:14 AM, smazumder wrote: > HI, > > I'm trying to execute simple sql statement from spark-shell > > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) - Thi

Re: IF in SQL statement

2015-05-16 Thread ayan guha
Try like this SELECT name, case when ts>0 then price else 0 end from table On Sun, May 17, 2015 at 12:21 AM, Antony Mayi wrote: > Hi, > > is it expected I can't reference a column inside of IF statement like this: > > sctx.sql("SELECT name, IF(ts>0, price, 0) FROM table").collect() > > I get an

Spark SQL is not able to connect to hive metastore

2015-05-16 Thread smazumder
HI, I'm trying to execute simple sql statement from spark-shell val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) - This one executes properly. Next I'm trying - sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)") This keeps on trying to connect Metastore but c

Re: Spark sql and csv data processing question

2015-05-16 Thread Don Drake
Your parenthesis don't look right as you're embedding the filter on the Row.fromSeq(). Try this: val trainRDD = rawTrainData .filter(!_.isEmpty) .map(rawRow => Row.fromSeq(rawRow.split(","))) .filter(_.length == 15) .map(_.toString).map(_.trim) -Don On Fr

IF in SQL statement

2015-05-16 Thread Antony Mayi
Hi, is it expected I can't reference a column inside of IF statement like this: sctx.sql("SELECT name, IF(ts>0, price, 0) FROM table").collect() I get an error: org.apache.spark.sql.AnalysisException: unresolved operator 'Project [name#0,if ((CAST(ts#1, DoubleType) > CAST(0, DoubleType))) price#2

Re: store hive metastore on persistent store

2015-05-16 Thread Tamas Jambor
Gave it another try - it seems that it picks up the variable and prints out the correct value, but still puts the metatore_db folder in the current directory, regardless. On Sat, May 16, 2015 at 1:13 PM, Tamas Jambor wrote: > Thank you for the reply. > > I have tried your experiment, it seems th

Re: Grouping and storing unordered time series data stream to HDFS

2015-05-16 Thread Helena Edelson
Consider using cassandra with spark streaming and timeseries, cassandra has been doing time series for years. Here’s some snippets with kafka streaming and writing/reading the data back: https://github.com/killrweather/killrweather/blob/master/killrweather-app/src/main/scala/com/datastax/killrwea

Re: store hive metastore on persistent store

2015-05-16 Thread Tamas Jambor
Thank you for the reply. I have tried your experiment, it seems that it does not print the settings out in spark-shell (I'm using 1.3 by the way). Strangely I have been experimenting with an SQL connection instead, which works after all (still if I go to spark-shell and try to print out the SQL s

Re: Custom Aggregate Function for DataFrame

2015-05-16 Thread ayan guha
Hi If you asked to any DB developer, s/he would tell you the construct: select userid,time,state, rank() over (partition by userId order by time desc) r from event) where r=1 I am not sure if Dataframe supports it, though I am sure we can extend functions to implement it. But here is one not us

Getting the best parameter set back from CrossValidatorModel

2015-05-16 Thread Justin Yip
Hello, I am using MLPipeline. I would like to extract the best parameter found by CrossValidator. But I cannot find much document about how to do it. Can anyone give me some pointers? Thanks. Justin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Getting

Re: number of executors

2015-05-16 Thread Ted Yu
What Spark release are you using ? Can you check driver log to see if there is some clue there ? Thanks On Sat, May 16, 2015 at 12:01 AM, xiaohe lan wrote: > Hi, > > I have a 5 nodes yarn cluster, I used spark-submit to submit a simple app. > > spark-submit --master yarn target/scala-2.10/sim

Re: SaveAsTextFile brings down data nodes with IO Exceptions

2015-05-16 Thread Ilya Ganelin
All - this issue showed up when I was tearing down a spark context and creating a new one. Often, I was unable to then write to HDFS due to this error. I subsequently switched to a different implementation where instead of tearing down and re initializing the spark context I'd instead submit a sepa

Re: Spark Fair Scheduler for Spark Streaming - 1.2 and beyond

2015-05-16 Thread Tathagata Das
For the Spark Streaming app, if you want a particular action inside a foreachRDD to go to a particular pool, then make sure you set the pool within the foreachRDD function. E.g. dstream.foreachRDD { rdd => rdd.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1") // set the pool r

Re: Broadcast variables can be rebroadcast?

2015-05-16 Thread N B
Thanks Ayan. Can we rebroadcast after updating in the driver? Thanks NB. On Fri, May 15, 2015 at 6:40 PM, ayan guha wrote: > Hi > > broadcast variables are shipped for the first time it is accessed in a > transformation to the executors used by the transformation. It will NOT > updated subsequ

number of executors

2015-05-16 Thread xiaohe lan
Hi, I have a 5 nodes yarn cluster, I used spark-submit to submit a simple app. spark-submit --master yarn target/scala-2.10/simple-project_2.10-1.0.jar --class scala.SimpleApp --num-executors 5 I have set the number of executor to 5, but from sparkui I could see only two executors and it ran ve