Re: yarn does not accept job in cluster mode

2014-09-28 Thread Akhil Das
Can you try running the spark-shell in yarn-cluster mode? ./bin/spark-shell --master yarn-client Read more over here http://spark.apache.org/docs/1.0.0/running-on-yarn.html Thanks Best Regards On Sun, Sep 28, 2014 at 7:08 AM, jamborta wrote: > hi all, > > I have a job that works ok in yarn-cl

Re: Using one sql query's result inside another sql query

2014-09-28 Thread Cheng Lian
This workaround looks good to me. In this way, all queries are still executed lazily within a single DAG, and Spark SQL is capable to optimize the query plan as a whole. On 9/29/14 11:26 AM, twinkle sachdeva wrote: Thanks Cheng. For the time being , As a work around, I had applied the schema

Re: Using one sql query's result inside another sql query

2014-09-28 Thread twinkle sachdeva
Thanks Cheng. For the time being , As a work around, I had applied the schema to Queryresult1, and then registered the result as temp table. Although that works, but I was not sure of performance impact, as that might block some optimisation in some scenarios. This flow (on spark 1.1 ) works: r

Re: Kinesis receiver & spark streaming partition

2014-09-28 Thread Wei Liu
Chris, Think I will check back with you to see if you made progress on this issue. Any good news so far? Thanks. Once again, I really appreciate you look into this issue. Thanks, Wei On Thu, Aug 28, 2014 at 4:44 PM, Chris Fregly wrote: > great question, wei. this is very important to understa

Re: spark multi-node cluster

2014-09-28 Thread codeoedoc
Figured this out... documented here and hope can help others: http://koobehub.wordpress.com/2014/09/29/spark-the-standalone-cluster-deployment/ On Sun, Sep 28, 2014 at 12:36 AM, codeoedoc wrote: > Hi guys, > > This is a spark fresh user... > > I'm trying to setup a spark cluster with multiple no

Re: Spark SQL question: how to control the storage level of cached SchemaRDD?

2014-09-28 Thread Michael Armbrust
You might consider instead storing the data using saveAsParquetFile and then querying that after running sqlContext.parquetFile(...).registerTempTable(...). On Sun, Sep 28, 2014 at 6:43 PM, Michael Armbrust wrote: > This is not possible until https://github.com/apache/spark/pull/2501 is > merged

Re: Spark SQL question: how to control the storage level of cached SchemaRDD?

2014-09-28 Thread Michael Armbrust
This is not possible until https://github.com/apache/spark/pull/2501 is merged. On Sun, Sep 28, 2014 at 6:39 PM, Haopu Wang wrote: > Thanks for the response. From Spark Web-UI's Storage tab, I do see > cached RDD there. > > > > But the storage level is "Memory Deserialized 1x Replicated". How

Spark SQL question: how to control the storage level of cached SchemaRDD?

2014-09-28 Thread Haopu Wang
Thanks for the response. From Spark Web-UI's Storage tab, I do see cached RDD there. But the storage level is "Memory Deserialized 1x Replicated". How can I change the storage level? Because I have a big table there. Thanks! From: Cheng Lian [mailto:l

Re: driver memory management

2014-09-28 Thread Reynold Xin
The storage fraction only limits the amount of memory used for storage. It doesn't actually limit anything else. I.e you can use all the memory if you want in collect. On Sunday, September 28, 2014, Brad Miller wrote: > Hi All, > > I am interested to collect() a large RDD so that I can run a lea

Spark meetup on Oct 15 in NYC

2014-09-28 Thread Reynold Xin
Hi Spark users and developers, Some of the most active Spark developers (including Matei Zaharia, Michael Armbrust, Joseph Bradley, TD, Paco Nathan, and me) will be in NYC for Strata NYC. We are working with the Spark NYC meetup group and Bloomberg to host a meetup event. This might be the event w

Re: view not supported in spark thrift server?

2014-09-28 Thread Du Li
Thanks, Michael, for your quick response. View is critical for my project that is migrating from shark to spark SQL. I have implemented and tested everything else. It would be perfect if view could be implemented soon. Du From: Michael Armbrust mailto:mich...@databricks.com>> Date: Sunday, Se

Re: view not supported in spark thrift server?

2014-09-28 Thread Michael Armbrust
Views are not supported yet. Its not currently on the near term roadmap, but that can change if there is sufficient demand or someone in the community is interested in implementing them. I do not think it would be very hard. Michael On Sun, Sep 28, 2014 at 11:59 AM, Du Li wrote: > > Can anyb

view not supported in spark thrift server?

2014-09-28 Thread Du Li
Can anybody confirm whether or not view is currently supported in spark? I found “create view translate” in the blacklist of HiveCompatibilitySuite.scala and also the following scenario threw NullPointerException on beeline/thriftserver (1.1.0). Any plan to support it soon? > create table src

driver memory management

2014-09-28 Thread Brad Miller
Hi All, I am interested to collect() a large RDD so that I can run a learning algorithm on it. I've noticed that when I don't increase SPARK_DRIVER_MEMORY I can run out of memory. I've also noticed that it looks like the same fraction of memory is reserved for storage on the driver as on the work

Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-28 Thread Du Li
It turned out a bug in my code. In the select clause the list of fields is misaligned with the schema of the target table. As a consequence the map data couldn’t be cast to some other type in the schema. Thanks anyway. On 9/26/14, 8:08 PM, "Cheng Lian" wrote: >Would you mind to provide the DDL

Re: How to do broadcast join in SparkSQL

2014-09-28 Thread Jianshi Huang
Yes, looks like it can only be controlled by the parameter spark.sql.autoBroadcastJoinThreshold, which is a little bit weird to me. How am I suppose to know the exact bytes of a table? Let me specify the join algorithm is preferred I think. Jianshi On Sun, Sep 28, 2014 at 11:57 PM, Ted Yu wrote

Re: How to do broadcast join in SparkSQL

2014-09-28 Thread Ted Yu
Have you looked at SPARK-1800 ? e.g. see sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala Cheers On Sun, Sep 28, 2014 at 1:55 AM, Jianshi Huang wrote: > I cannot find it in the documentation. And I have a dozen dimension tables > to (left) join... > > > Cheers, > -- > Jianshi Huang

[SF Machine Learning meetup] talk by Prof. C J Lin, large-scale linear classification: status and changllenges

2014-09-28 Thread Chester At Work
All Sorry this is spark related, but I thought some of you in San Francisco might be interested in this talk. We announced this talk recently, it will be at the end of next month (oct) http://www.meetup.com/sfmachinelearning/events/208078582/ Prof CJ Lin is famous for his work on libsvm an

[MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-28 Thread Yanbo Liang
Hi We have used LogisticRegression with two different optimization method SGD and LBFGS in MLlib. With the same dataset and the same training and test split, but get different weights vector. For example, we use spark-1.1.0/data/mllib/sample_binary_classification_data.txt as our training and test

Re: Build spark with Intellij IDEA 13

2014-09-28 Thread Yi Tian
Hi If you want IDEA compile your spark project (version 1.0.0 and above), you should do it with following steps. 1 clone spark project 2 use mvn to compile your spark project ( because you need the generated avro source file in flume-sink module) 3 open spark/pom.xml with IDEA 4 check profiles

How to do broadcast join in SparkSQL

2014-09-28 Thread Jianshi Huang
I cannot find it in the documentation. And I have a dozen dimension tables to (left) join... Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: spark multi-node cluster

2014-09-28 Thread codeoedoc
BTW, I'm using standalone deployment (The name standalone deployment for cluster, is kind of misleading... I think the doc needs to be updated. It's not really standalone, but plain spark only deployment) Thx, cody On Sun, Sep 28, 2014 at 12:36 AM, codeoedoc wrote: > Hi guys, > > This is a spa

Re: Re: problem with patitioning

2014-09-28 Thread qinwei
Thank you for your reply, and your tips on code refactoring is helpful, after a second look on the code, the casts and null check is really unnecessary. qinwei  From: Sean OwenDate: 2014-09-28 15:03To: qinweiCC: userSubject: Re: problem with patitioning(Most of this code is not relevant t

spark multi-node cluster

2014-09-28 Thread codeoedoc
Hi guys, This is a spark fresh user... I'm trying to setup a spark cluster with multiple nodes, starting with 2. With one node, it is working fine. When I get a slave node, slave is able to register to the master node. However when I launch a spark shell, and when the executor is launched on the

Re: problem with patitioning

2014-09-28 Thread Sean Owen
(Most of this code is not relevant to the question and can be refactored too. The casts and null checks look unnecessary.) You are unioning RDDs so you have a result with the sum of their partitions. The number of partitions is really a hint to Hadoop only so it is not even necessarily 3 x 1920.