Is Spark streaming suitable for our architecture?

2014-10-23 Thread Albert Vila
Hi I'm evaluating Spark streaming to see if it fits to scale or current architecture. We are currently downloading and processing 6M documents per day from online and social media. We have a different workflow for each type of document, but some of the steps are keyword extraction, language detec

Dynamically loaded Spark-stream consumer

2014-10-23 Thread Jianshi Huang
I have a use case that I need to continuously ingest data from Kafka stream. However apart from ingestion (to HBase), I also need to compute some metrics (i.e. avg for last min, etc.). The problem is that it's very likely I'll continuously add more metrics and I don't want to restart my spark prog

How to access objects declared and initialized outside the call() method of JavaRDD

2014-10-23 Thread Localhost shell
Hey All, I am unable to access objects declared and initialized outside the call() method of JavaRDD. In the below code snippet, call() method makes a fetch call to C* but since javaSparkContext is defined outside the call method scope so compiler give a compilation error. stringRdd.foreach(new

RE: About "Memory usage" in the Spark UI

2014-10-23 Thread Haopu Wang
Patrick, thanks for the response. May I ask more questions? I'm running a Spark Streaming application which receives data from socket and does some transformations. The event injection rate is too high so the processing duration is larger than batch interval. So I see "Could not compu

Which is better? One spark app listening to 10 topics vs. 10 spark apps each listening to 1 topic

2014-10-23 Thread Jianshi Huang
The Kafka stream has 10 topics and the data rate is quite high (~ 100K/s per topic). Which configuration do you recommend? - 1 Spark app consuming all Kafka topics - 10 separate Spark app each consuming one topic Assuming they have the same resource pool. Cheers, -- Jianshi Huang LinkedIn: jia

NoClassDefFoundError on ThreadFactoryBuilder in Intellij

2014-10-23 Thread Stephen Boesch
After having checked out from master/head the following error occurs when attempting to run any test in Intellij Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/util/concurrent/ThreadFactoryBuilder at org.apache.spark.util.Utils$.(Utils.scala:648) There appears to be

Re: How to access objects declared and initialized outside the call() method of JavaRDD

2014-10-23 Thread Sean Owen
In Java, javaSparkContext would have to be declared final in order for it to be accessed inside an inner class like this. But this would still not work as the context is not serializable. You should rewrite this so you are not attempting to use the Spark context inside an RDD. On Thu, Oct 23, 20

Re: Solving linear equations

2014-10-23 Thread Sean Owen
The 0 vector is a trivial solution. Is the data big, such that it can't be computed on one machine? if so I assume this system is over-determined. You can use a decomposition to find a least-squares solution, but the SVD is overkill and in any event distributed decompositions don't exist in the pro

Re: Is Spark streaming suitable for our architecture?

2014-10-23 Thread Jayant Shekhar
Hi Albert, Have a couple of questions: - You mentioned near real-time. What exactly is your SLA for processing each document? - Which crawler are you using and are you looking to bring in Hadoop into your overall workflow. You might want to read up on how network traffic is minimiz

SparkSQL and columnar data

2014-10-23 Thread Marius Soutier
Hi guys, another question: what’s the approach to working with column-oriented data, i.e. data with more than 1000 columns. Using Parquet for this should be fine, but how well does SparkSQL handle the big amount of columns? Is there a limit? Should we use standard Spark instead? Thanks for any

Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Jianshi Huang
Upvote for the multitanency requirement. I'm also building a data analytic platform and there'll be multiple users running queries and computations simultaneously. One of the paint point is control of resource size. Users don't really know how much nodes they need, they always use as much as possi

Spark Cassandra Connector proper usage

2014-10-23 Thread Ashic Mahtab
I'm looking to use spark for some ETL, which will mostly consist of "update" statements (a column is a set, that'll be appended to, so a simple insert is likely not going to work). As such, it seems like issuing CQL queries to import the data is the best option. Using the Spark Cassandra Connect

Re: Is Spark streaming suitable for our architecture?

2014-10-23 Thread Albert Vila
Hi Jayant, On 23 October 2014 11:14, Jayant Shekhar wrote: > Hi Albert, > > Have a couple of questions: > >- You mentioned near real-time. What exactly is your SLA for >processing each document? > > The minimum the best :). Right now it's between 30s - 5m, but I would like to have someth

what's the best way to initialize an executor?

2014-10-23 Thread Darin McBeath
I have some code that I only need to be executed once per executor in my spark application.  My current approach is to do something like the following: scala> xmlKeyPair.foreachPartition(i => XPathProcessor.init("ats", "Namespaces/NamespaceContext")) So, If I understand correctly, the XPathProces

RE: Spark Cassandra Connector proper usage

2014-10-23 Thread Ashic Mahtab
Hi Gerard, Thanks for the response. Here's the scenario: The target cassandra schema looks like this: create table foo ( id text primary key, bar int, things set ) The source in question is a Sql Server source providing the necessary data. The source goes over the same "id" multiple ti

Aggregation Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-23 Thread arthur.hk.c...@gmail.com
Hi, I got $TreeNodeException, few questions: Q1) How should I do aggregation in SparK? Can I use aggregation directly in SQL? or Q1) Should I use SQL to load the data to form RDD then use scala to do the aggregation? Regards Arthur MySQL (good one, without aggregation): sqlContext.sql("SELEC

Re: Spark Streaming Applications

2014-10-23 Thread Saiph Kappa
What is the application about? I couldn't find any proper description regarding the purpose of killrweather ( I mean, other than just integrating Spark with Cassandra). Do you know if the slides of that tutorial are available somewhere? Thanks! On Wed, Oct 22, 2014 at 6:58 PM, Sameer Farooqui wr

Re: Aggregation Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-23 Thread Yin Huai
Hello Arthur, You can use do aggregations in SQL. How did you create LINEITEM? Thanks, Yin On Thu, Oct 23, 2014 at 8:54 AM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > I got $TreeNodeException, few questions: > Q1) How should I do aggregation in SparK? Can I use aggre

Re: Spark Cassandra Connector proper usage

2014-10-23 Thread Gerard Maas
Hi Ashic, At the moment I see two options: 1) You could use the CassandraConnector object to execute your specialized query. The recommended pattern is to to that within a rdd.foreachPartition(...) in order to amortize DB connection setup over the number of elements in on partition. Something

RE: Spark Cassandra Connector proper usage

2014-10-23 Thread Ashic Mahtab
Hi Gerard, I've gone with option 1, and seems to be working well. Option 2 is also quite interesting. Thanks for your help in this. Regards, Ashic. From: gerard.m...@gmail.com Date: Thu, 23 Oct 2014 17:07:56 +0200 Subject: Re: Spark Cassandra Connector proper usage To: as...@live.com CC: user@sp

Re: what's the best way to initialize an executor?

2014-10-23 Thread Sean Owen
It sounds like your code already does its initialization at most once per JVM, and that's about as good as it gets. Each partition asks for init in a thread-safe way and the first request succeeds. On Thu, Oct 23, 2014 at 1:41 PM, Darin McBeath wrote: > I have some code that I only need to be exe

Re: How to set hadoop native library path in spark-1.1

2014-10-23 Thread Christophe Préaud
Hi, Try the --driver-library-path option of spark-submit, e.g.: /opt/spark/bin/spark-submit --driver-library-path /opt/hadoop/lib/native (...) Regards, Christophe. On 21/10/2014 20:44, Pradeep Ch wrote: > Hi all, > > Can anyone tell me how to set the native library path in Spark. > > Right not

Re: Is Spark streaming suitable for our architecture?

2014-10-23 Thread Jayant Shekhar
Hi Albert, Since your latency requirement are around 1-2m, spark streaming should be a good solution. You may also want to check out if streaming and processing in Flume and writing out the results to HDFS, would suffice. > crawling, keyword extraction, language detection, indexation > On each st

Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2014-10-23 Thread Michael Campbell
Can you list what your fix was so others can benefit? On Wed, Oct 22, 2014 at 8:15 PM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > I have managed to resolve it because a wrong setting. Please ignore this . > > Regards > Arthur > > On 23 Oct, 2014, at 5:14 am, arthur.hk.c

Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Marcelo Vanzin
You may want to take a look at https://issues.apache.org/jira/browse/SPARK-3174. On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang wrote: > Upvote for the multitanency requirement. > > I'm also building a data analytic platform and there'll be multiple users > running queries and computations simult

Re: Setting only master heap

2014-10-23 Thread Andrew Or
Yeah, as Sameer commented, there is unfortunately not an equivalent `SPARK_MASTER_MEMORY` that you can set. You can work around this by starting the master and the slaves separately with different settings of SPARK_DAEMON_MEMORY each time. AFAIK there haven't been any major changes in the standalo

Re: Solving linear equations

2014-10-23 Thread Shivaram Venkataraman
We have been working on large scale QR decompositions which would fit this problem well -- TSQR [https://github.com/amplab/ml-matrix/blob/master/src/main/scala/edu/berkeley/cs/amplab/mlmatrix/TSQR.scala] for instance can be used to solve the least squares system if you have more equations than vari

Re: Shuffle issues in the current master

2014-10-23 Thread Andrew Or
To add to Aaron's response, `spark.shuffle.consolidateFiles` only applies to hash-based shuffle, so you shouldn't have to set it for sort-based shuffle. And yes, since you changed neither `spark.shuffle.compress` nor `spark.shuffle.spill.compress` you can't possibly have run into what #2890 fixes.

Re: How to access objects declared and initialized outside the call() method of JavaRDD

2014-10-23 Thread Localhost shell
Bang On Sean Before sending the issue mail, I was able to remove the compilation error by making it final but then got the Caused by: java.io.NotSerializableException: org.apache.spark.api.java.JavaSparkContext (As you mentioned) Now regarding your suggestion of changing the business logic, 1.

Re: How to access objects declared and initialized outside the call() method of JavaRDD

2014-10-23 Thread Jayant Shekhar
+1 to Sean. Is it possible to rewrite your code to not use SparkContext in RDD. Or why does javaFunctions() need the SparkContext. On Thu, Oct 23, 2014 at 10:53 AM, Localhost shell < universal.localh...@gmail.com> wrote: > Bang On Sean > > Before sending the issue mail, I was able to remove the

Re: Sharing spark context across multiple spark sql cli initializations

2014-10-23 Thread Sadhan Sood
Thanks Michael, you saved me a lot of time! On Wed, Oct 22, 2014 at 6:04 PM, Michael Armbrust wrote: > The JDBC server is what you are looking for: > http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server > > On Wed, Oct 22, 2014 at 11:10 AM, Sadhan Sood >

Re: SparkSQL , best way to divide data into partitions?

2014-10-23 Thread Michael Armbrust
Spark SQL now supports Hive style dynamic partitioning: https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions This is a new feature so you'll have to build master or wait for 1.2. On Wed, Oct 22, 2014 at 7:03 PM, raymond wrote: > Hi > > I have a json file that can be load b

unable to make a custom class as a key in a pairrdd

2014-10-23 Thread Jaonary Rabarisoa
Hi all, I have the following case class that I want to use as a key in a key-value rdd. I defined the equals and hashCode methode but it's not working. What I'm doing wrong ? *case class PersonID(id: String) {* * override def hashCode = id.hashCode* * override def equals(other: Any) = o

Re: streaming join sliding windows

2014-10-23 Thread Tathagata Das
Can you define neighbor sliding windows with an example? On Wed, Oct 22, 2014 at 12:29 PM, Josh J wrote: > Hi, > > How can I join neighbor sliding windows in spark streaming? > > Thanks, > Josh >

Re: Setting only master heap

2014-10-23 Thread Nan Zhu
h… my observation is that, master in Spark 1.1 has higher frequency of GC…… Also, before 1.1, I never encounter GC overtime in Master, after upgrade to 1.1, I have met for 2 times (we upgrade soon after 1.1 release)…. Best, -- Nan Zhu On Thursday, October 23, 2014 at 1:08 PM, Andre

Re: Aggregation Error: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-23 Thread arthur.hk.c...@gmail.com
HI, My step to create LINEITEM: $HADOOP_HOME/bin/hadoop fs -mkdir /tpch/lineitem $HADOOP_HOME/bin/hadoop fs -copyFromLocal lineitem.tbl /tpch/lineitem/ Create external table lineitem (L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DI

Re: How to access objects declared and initialized outside the call() method of JavaRDD

2014-10-23 Thread Localhost shell
Hey Jayant, In my previous mail, I have mentioned a github gist *https://gist.github.com/rssvihla/6577359860858ccb0b33 *which is doing very similar to what I want to do but its using scala language for spark. Hence my question (reiterating f

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-23 Thread Tathagata Das
Hey Gerard, This is a very good question! *TL;DR: *The performance should be same, except in case of shuffle-based operations where the number of reducers is not explicitly specified. Let me answer in more detail by dividing the set of DStream operations into three categories. *1. Map-like oper

Re: Spark Streaming Applications

2014-10-23 Thread Tathagata Das
Cc'ing Helena for more information on this. TD On Thu, Oct 23, 2014 at 6:30 AM, Saiph Kappa wrote: > What is the application about? I couldn't find any proper description > regarding the purpose of killrweather ( I mean, other than just integrating > Spark with Cassandra). Do you know if the sl

JavaHiveContext class not found error. Help!!

2014-10-23 Thread nitinkak001
I am trying to run the below Hive query on Yarn. I am using Cloudera 5.1. What can I do to make this work? /"SELECT * FROM table_name DISTRIBUTE BY GEO_REGION, GEO_COUNTRY SORT BY IP_ADDRESS, COOKIE_ID";/ Below is the stack trace: Exception in thread "Thread-4" java.lang.reflect.InvocationTarget

Re: How to access objects declared and initialized outside the call() method of JavaRDD

2014-10-23 Thread lordjoe
What I have been doing is building a JavaSparkContext the first time it is needed and keeping it as a ThreadLocal - All my code uses SparkUtilities.getCurrentContext(). On a Slave machine you build a new context and don't have to serialize it The code is in a large project at https://code.google.c

Re: JavaHiveContext class not found error. Help!!

2014-10-23 Thread Marcelo Vanzin
Hello there, This is more of a question for the cdh-users list, but in any case... In CDH 5.1 we skipped packaging of the Hive module in SparkSQL. That has been fixed in CDH 5.2, so if it's possible for you I'd recommend upgrading. On Thu, Oct 23, 2014 at 2:53 PM, nitinkak001 wrote: > I am tryin

spark is running extremely slow with larger data set, like 2G

2014-10-23 Thread xuhongnever
my spark version is 1.1.0 pre-build with hadoop 1.x my code is implemented in python trying to covert a graph data set in edge list to adjacency list spark is running in standalone mode It runs well with a small data set like soc-liveJournal1, about 1G Then I run it on 25G twitter graph, one tas

Re: spark is running extremely slow with larger data set, like 2G

2014-10-23 Thread xuhongnever
my code is here: from pyspark import SparkConf, SparkContext def Undirect(edge): vector = edge.strip().split('\t') if(vector[0].isdigit()): return [(vector[0], vector[1])] return [] conf = SparkConf() conf.setMaster("spark://compute-0-14:7077") conf.setAppName("adjacency

Spark 1.1.0 and Hive 0.12.0 Compatibility Issue

2014-10-23 Thread Arthur . hk . chan
Hi My Spark is 1.1.0 and Hive is 0.12, I tried to run the same query in both Hive-0.12.0 then Spark-1.1.0, HiveQL works while SparkSQL failed. hive> select l_orderkey, sum(l_extendedprice*(1-l_discount)) as revenue, o_orderdate, o_shippriority from customer c join orders o on c.c_mktsegment

Spark 1.1.0 and Hive 0.12.0 Compatibility Issue

2014-10-23 Thread arthur.hk.c...@gmail.com
(Please ignore if duplicated) Hi, My Spark is 1.1.0 and Hive is 0.12, I tried to run the same query in both Hive-0.12.0 then Spark-1.1.0, HiveQL works while SparkSQL failed. hive> select l_orderkey, sum(l_extendedprice*(1-l_discount)) as revenue, o_orderdate, o_shippriority from customer c

Re: unable to make a custom class as a key in a pairrdd

2014-10-23 Thread Niklas Wilcke
Hi Jao, I don't really know why this doesn't work but I have two hints. You don't need to override hashCode and equals. The modifier case is doing that for you. Writing case class PersonID(id: String) would be enough to get the class you want I think. If I change the type of the id param to Int

Exceptions not caught?

2014-10-23 Thread ankits
Hi, I'm running a spark job and encountering an exception related to thrift. I wanted to know where this is being thrown, but the stack trace is completely useless. So I started adding try catches, to the point where my whole main method that does everything is surrounded with a try catch. Even the

Re: Exceptions not caught?

2014-10-23 Thread Ted Yu
Can you show the stack trace ? Also, how do you catch exceptions ? Did you specify TProtocolException ? Cheers On Thu, Oct 23, 2014 at 3:40 PM, ankits wrote: > Hi, I'm running a spark job and encountering an exception related to > thrift. > I wanted to know where this is being thrown, but the

Re: Exceptions not caught?

2014-10-23 Thread ankits
I am simply catching all exceptions (like case e:Throwable => println("caught: "+e) ) Here is the stack trace: 2014-10-23 15:51:10,766 ERROR [] Exception in task 1.0 in stage 1.0 (TID 1) java.io.IOException: org.apache.thrift.protocol.TProtocolException: Required field 'X' is unset! Struct:Y(id:

Re: Spark 1.1.0 and Hive 0.12.0 Compatibility Issue

2014-10-23 Thread Michael Armbrust
Can you show the DDL for the table? It looks like the SerDe might be saying it will produce a decimal type but is actually producing a string. On Thu, Oct 23, 2014 at 3:17 PM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi > > My Spark is 1.1.0 and Hive is 0.12, I tried to run

Re: Exceptions not caught?

2014-10-23 Thread ankits
Also everything is running locally on my box, driver and workers. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Exceptions-not-caught-tp17157p17160.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Exceptions not caught?

2014-10-23 Thread Ted Yu
bq. Required field 'X' is unset! Struct:Y Can you check your class Y and fix the above ? Cheers On Thu, Oct 23, 2014 at 3:55 PM, ankits wrote: > I am simply catching all exceptions (like case e:Throwable => > println("caught: "+e) ) > > Here is the stack trace: > > 2014-10-23 15:51:10,766 ERR

Re: Exceptions not caught?

2014-10-23 Thread ankits
>Can you check your class Y and fix the above ? I can, but this is about catching the exception should it be thrown by any class in the spark job. Why is the exception not being caught? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Exceptions-not-caught-t

Re: Exceptions not caught?

2014-10-23 Thread Marcelo Vanzin
On Thu, Oct 23, 2014 at 3:40 PM, ankits wrote: > 2014-10-23 15:39:50,845 ERROR [] Exception in task 1.0 in stage 1.0 (TID 1) > java.io.IOException: org.apache.thrift.protocol.TProtocolException: This looks like an exception that's happening on an executor and just being reported in the driver's

Re: spark-ec2: deploy customized spark version

2014-10-23 Thread freedafeng
was working on a 'hack'. modified spark-ec2.py (to point to my own spark-ec2 repo at github), and then built a customized spark-ec2 repo (copied from github) then modified the init.sh under spark folder so that the master node downloads the customized spark from s3. the spark was successfully ins

Re: Exceptions not caught?

2014-10-23 Thread Nan Zhu
Hi, Spark dispatches your tasks to the distributed (remote) executors when the task is terminated due to an exception, it will report to the driver with the reason (exception) So, in the driver side, you see the reason of the task failure which actually happened in remote end…so, you cann

Re: About "Memory usage" in the Spark UI

2014-10-23 Thread Tathagata Das
The memory usage of blocks of data received through Spark Streaming is not reflected in the Spark UI. It only shows the memory usage due to cached RDDs. I didnt find a JIRA for this, so I opened a new one. https://issues.apache.org/jira/browse/SPARK-4072 TD On Thu, Oct 23, 2014 at 12:47 AM, Hao

Spark using HDFS data [newb]

2014-10-23 Thread matan
Hi, I would like to verify or correct my understanding of Spark at the conceptual level. As I understand, ignoring streaming mode for a minute, Spark takes some input data (which can be an hdfs file), lets your code transform the data, and ultimately dispatches some computation over the data acr

Re: Spark using HDFS data [newb]

2014-10-23 Thread Marcelo Vanzin
You assessment is mostly correct. I think the only thing I'd reword is the comment about splitting the data, since Spark itself doesn't do that, but read on. On Thu, Oct 23, 2014 at 6:12 PM, matan wrote: > In case I nailed it, how then does it handle a distributed hdfs file? does > it pull all of

Problem packing spark-assembly jar

2014-10-23 Thread Yana Kadiyska
Hi folks, I'm trying to deploy the latest from master branch and having some trouble with the assembly jar. In the spark-1.1 official distribution(I use cdh version), I see the following jars, where spark-assembly-1.1.0-hadoop2.0.0-mr1-cdh4.2.0.jar contains a ton of stuff: datanucleus-api-jdo-3.2

RE: About "Memory usage" in the Spark UI

2014-10-23 Thread Haopu Wang
TD, thanks for the clarification. >From the UI, it looks like the driver also allocate memory to store blocks, >what's the purpose for that because I think driver doesn't need to run tasks? From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: 20

Re: unable to make a custom class as a key in a pairrdd

2014-10-23 Thread Prashant Sharma
Are you doing this in REPL ? Then there is a bug filed for this, I just can't recall the bug ID at the moment. Prashant Sharma On Fri, Oct 24, 2014 at 4:07 AM, Niklas Wilcke < 1wil...@informatik.uni-hamburg.de> wrote: > Hi Jao, > > I don't really know why this doesn't work but I have two hint

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-10-23 Thread Svend
Hi all, (Waking up an old thread just for future reference) We've had a very similar issue just a couple of days ago: executing a spark driver on the same host as where the mesos master runs succeeds, but executing it on our remote dev station hangs fails after mesos report the spark driver dis

Re: Spark 1.0.0 on yarn cluster problem

2014-10-23 Thread firemonk9
Hi, I am facing same problem. My spark-env.sh has below entries yet I see the yarn container with only 1G and yarn only spawns two workers. SPARK_EXECUTOR_CORES=1 SPARK_EXECUTOR_MEMORY=3G SPARK_EXECUTOR_INSTANCES=5 Please let me know if you are able to resolve this issue. Thank you -- Vi

large benchmark sets for MLlib experiments

2014-10-23 Thread Chih-Jen Lin
Hi MLlib users, In August when I gave a talk at Databricks, Xiangrui mentioned the need of large public data for the development of MLlib. At this moment many use problems in libsvm data sets for experiments. The file size of larger ones (e.g., kddb) is about 20-30G. To fullfill the need, we have

Memory requirement of using Spark

2014-10-23 Thread jian.t
Hello, I am new to Spark. I have a basic question about the memory requirement of using Spark. I need to join multiple data sources between multiple data sets. The join is not a straightforward join. The logic is more like: first join T1 on column A with T2, then for all the records that couldn'

Re: Spark 1.0.0 on yarn cluster problem

2014-10-23 Thread Andrew Or
Did you `export` the environment variables? Also, are you running in client mode or cluster mode? If it still doesn't work you can try to set these through the spark-submit command lines --num-executors, --executor-cores, and --executor-memory. 2014-10-23 19:25 GMT-07:00 firemonk9 : > Hi, > >

Re: Spark 1.0.0 on yarn cluster problem

2014-10-23 Thread firemonk9
Yes export worked. Thank you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-on-yarn-cluster-problem-tp7560p17180.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Multitenancy in Spark - within/across spark context

2014-10-23 Thread Evan Chan
Ashwin, I would say the strategies in general are: 1) Have each user submit separate Spark app (each its own Spark Context), with its own resource settings, and share data through HDFS or something like Tachyon for speed. 2) Share a single spark context amongst multiple users, using fair schedul

submit query to spark cluster using spark-sql

2014-10-23 Thread tridib
I want to submit query to spark cluster using spark-sql. I am using hive meta store. It's in HDFS. But when I query it does not look like get submitted to spark cluster. I don't see any entry in master web UI. How can I confirm the behavior? -- View this message in context: http://apache-spark-

Re: submit query to spark cluster using spark-sql

2014-10-23 Thread tridib
Figured it out. spark-sql --master spark://sparkmaster:7077 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/submit-query-to-spark-cluster-using-spark-sql-tp17182p17183.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Spark Bug? job fails to run when given options on spark-submit (but starts and fails without)

2014-10-23 Thread Akhil Das
You can open the application UI (that runs on 4040) and see how much memory is being allocated to the executor tabs and from the environments tab. Thanks Best Regards On Wed, Oct 22, 2014 at 9:55 PM, Holden Karau wrote: > Hi Michael Campbell, > > Are you deploying against yarn or standalone mod