Re: submitted uber-jar not seeing spark-assembly.jar at worker

2014-10-15 Thread Sean Owen
How did you recompile and deploy Spark to your cluster? it sounds like a problem with not getting the assembly deployed correctly, rather than your app. On Tue, Oct 14, 2014 at 10:35 PM, Tamas Sandor wrote: > Hi, > > I'm rookie in spark, but hope someone can help me out. I'm writing an app > that

Re: system.out.println with "--master yarn-cluster"

2014-10-15 Thread vishnu86
Examine the output (replace $YARN_APP_ID in the following with the "application identifier" output by the previous command) (Note: YARN_APP_LOGS_DIR is usually /tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version.) $ cat $YARN_APP_LOGS_DIR/$YARN_APP_ID/container*_01/stdout.

Re: Spark can't find jars

2014-10-15 Thread Christophe Préaud
Hi Jimmy, Did you try my patch? The problem on my side was that the hadoop.tmp.dir (in hadoop core-site.xml) was not handled properly by Spark when it is set on multiple partitions/disks, i.e.: hadoop.tmp.dir file:/d1/yarn/local,file:/d2/yarn/local,file:/d3/yarn/local,file:/d4/yarn/local,

Unit testing jar request

2014-10-15 Thread Jean Charles Jabouille
Hi, we are Spark users and we use some Spark's test classes for our own application unit tests. We use LocalSparkContext and SharedSparkContext. But these classes are not included in the spark-core library. This is a good option as it's not a good idea to include test classes in the runtime ja

Spark on secure HDFS

2014-10-15 Thread Erik van oosten
Hi, We really would like to use Spark but we can’t because we have a secure HDFS environment (Cloudera). I understood https://issues.apache.org/jira/browse/SPARK-2541 contains a patch. Can one of the committers please take a look? Thanks! Erik. — Erik van Oosten http://www.day-to-day-stu

Spark Concepts

2014-10-15 Thread nsareen
Hi ,I'm pretty new to Big Data & Spark both. I've just started POC work on spark and me & my team are evaluating it with other In Memory computing tools such as GridGain, Bigmemory, Aerospike & some others too, specifically to solve two sets of problems.1) Data Storage : Our current application ru

Re: Spark output to s3 extremely slow

2014-10-15 Thread Rafal Kwasny
Hi, How large is the dataset you're saving into S3? Actually saving to S3 is done in two steps: 1) writing temporary files 2) commiting them to proper directory Step 2) could be slow because S3 do not have a quick atomic "move" operation, you have to copy (server side but still takes time) and the

Re: Spark in cluster and errors

2014-10-15 Thread nsareen
Did you manage to solve this issue ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-in-cluster-and-errors-tp16249p16479.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: "Initial job has not accepted any resources" when launching SparkPi example on a worker.

2014-10-15 Thread Akhil Das
Is this the spark uri (spark://host1:7077) that you are seeing in your clusters webui? (http://master-host:8080) top left side of the page. Thanks Best Regards On Wed, Oct 15, 2014 at 12:18 PM, Theodore Si wrote: > Can anyone help me, please? > > 在 10/14/2014 9:58 PM, Theodore Si 写道: > > Hi al

Spark dev environment best practices

2014-10-15 Thread poiuytrez
Hi, I have been working with Spark for a few weeks. I do not yet understand how I should organize my dev and production environment. Currently, I am using the IPython Notebook, I usually write tests scripts on my mac with some very small data. Then when I am ready, I launch my script on servers

[SparkSQL] Convert JavaSchemaRDD to SchemaRDD

2014-10-15 Thread Earthson
I don't know why the JavaSchemaRDD.baseSchemaRDD is private[sql]. And I found that DataTypeConversions is protected[sql]. Finally I find this solution: jrdd.registerTempTable("transform_tmp") jrdd.sqlContext.sql("select * from transform_tmp") Could Any One tell me that: Is it

Re: Default spark.deploy.recoveryMode

2014-10-15 Thread Chitturi Padma
which means the details are not persisted and hence any failures in workers and master wouldnt start the daemons normally ..right ? On Wed, Oct 15, 2014 at 12:17 PM, Prashant Sharma [via Apache Spark User List] wrote: > [Removing dev lists] > > You are absolutely correct about that. > > Prashant

Re: Default spark.deploy.recoveryMode

2014-10-15 Thread Prashant Sharma
So if you need those features you can go ahead and setup one of Filesystem or zookeeper options. Please take a look at: http://spark.apache.org/docs/latest/spark-standalone.html. Prashant Sharma On Wed, Oct 15, 2014 at 3:25 PM, Chitturi Padma < learnings.chitt...@gmail.com> wrote: > which mean

Re: Spark Streaming: Sentiment Analysis of Twitter streams

2014-10-15 Thread Akhil Das
I just ran the same code and it is running perfectly fine on my machine. These are the things on my end: - Spark version: 1.1.0 - Gave full path to the negative and positive files - Set twitter auth credentials in the environment. And here's the code: import org.apache.spark.SparkContext > impor

Re: "Initial job has not accepted any resources" when launching SparkPi example on a worker.

2014-10-15 Thread Malte
Besides the host1 question what can also happen is that you give the worker more memory than available (try a value 1G below the memory available just to be sure for example) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Initial-job-has-not-accepted-any-re

Re: How to make operation like cogrop() , groupbykey() on pair RDD = [ [ ], [ ] , [ ] ]

2014-10-15 Thread Gen
What results do you want? If your pair is like (a, b), where "a" is the key and "b" is the value, you can try rdd1 = rdd1.flatMap(lambda l: l) and then use cogroup. Best Gen -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-make-operation-like-cogrop

Re: Spark Streaming: Sentiment Analysis of Twitter streams

2014-10-15 Thread S Krishna
Hi, I am using 1.1.0. I did set my twitter credentials and I am using the full path. I did not paste this in the public post. I am running on a cluster and getting the exception. Are you running in local or standalone mode? Thanks On Oct 15, 2014 3:20 AM, "Akhil Das" wrote: > I just ran the sam

Re: Spark Streaming: Sentiment Analysis of Twitter streams

2014-10-15 Thread Akhil Das
I ran it in both local and standalone, it worked for me. It does throws a bind exception which is normal since we are using both SparkContext and StreamingContext. Thanks Best Regards On Wed, Oct 15, 2014 at 5:25 PM, S Krishna wrote: > Hi, > > I am using 1.1.0. I did set my twitter credentials

Re: jsonRDD: NoSuchMethodError

2014-10-15 Thread Michael Campbell
How did you resolve it? On Tue, Jul 15, 2014 at 3:50 AM, SK wrote: > The problem is resolved. Thanks. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/jsonRDD-NoSuchMethodError-tp9688p9742.html > Sent from the Apache Spark User List mailing list ar

How to add HBase dependencies and conf with spark-submit?

2014-10-15 Thread Fengyun RAO
We use Spark 1.1, and HBase 0.98.1-cdh5.1.0, and need to read and write an HBase table in Spark program. I notice there are: spark.driver.extraClassPath spark.executor.extraClassPathproperties to manage extra ClassPath, over even an deprecated SPARK_CLASSPATH. The problem is what classpath or jar

SparkSQL: set hive.metastore.warehouse.dir in CLI doesn't work

2014-10-15 Thread Hao Ren
Hi, The following query in sparkSQL 1.1.0 CLI doesn't work. *SET hive.metastore.warehouse.dir=/home/spark/hive/warehouse ; create table test as select v1.*, v2.card_type, v2.card_upgrade_time_black, v2.card_upgrade_time_gold from customer v1 left join customer_loyalty v2 on v1.account_id = v2.ac

Problem executing Spark via JBoss application

2014-10-15 Thread Mehdi Singer
Hi, I have a Spark standalone example application which is working fine. I'm now trying to integrate this application into a J2EE application, deployed on JBoss 7.1.1 and accessed via a web service. The JBoss server is installed on my local machine (Windows 7) and the master Spark is remote (Lin

How to close resources shared in executor?

2014-10-15 Thread Fengyun RAO
In order to share an HBase connection pool, we create an object Object Util { val HBaseConf = HBaseConfiguration.create val Connection= HConnectionManager.createConnection(HBaseConf) } which would be shared among tasks on the same executor. e.g. val result = rdd.map(line => { val table

Re: How to create Track per vehicle using spark RDD

2014-10-15 Thread manasdebashiskar
It is wonderful to see some idea. Now the questions: 1) What is a track segment? Ans) It is the line that contains two adjacent points when all points are arranged by time. Say a vehicle moves (t1, p1) -> (t2, p2) -> (t3, p3). Then the segments are (p1, p2), (p2, p3) when the time ordering is (t1

Re: How to add HBase dependencies and conf with spark-submit?

2014-10-15 Thread Fengyun RAO
+user@hbase 2014-10-15 20:48 GMT+08:00 Fengyun RAO : > We use Spark 1.1, and HBase 0.98.1-cdh5.1.0, and need to read and write an > HBase table in Spark program. > > I notice there are: > spark.driver.extraClassPath > spark.executor.extraClassPathproperties to manage extra ClassPath, over > even

Re: A question about streaming throughput

2014-10-15 Thread danilopds
Ok, I understand. But in both cases the data are in the same processing node. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/A-question-about-streaming-throughput-tp16416p16501.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to close resources shared in executor?

2014-10-15 Thread Ted Yu
Have you tried the following ? val result = rdd.map(line => { val table = Util.Connection.getTable("user") ... Util.Connection.close() } On Wed, Oct 15, 2014 at 6:09 AM, Fengyun RAO wrote: > In order to share an HBase connection pool, we create an object > > Object Util { > val HBaseConf =

Re: How to close resources shared in executor?

2014-10-15 Thread Ted Yu
Pardon me - there was typo in previous email. Calling table.close() is the recommended approach. HConnectionManager does reference counting. When all references to the underlying connection are gone, connection would be released. Cheers On Wed, Oct 15, 2014 at 7:13 AM, Ted Yu wrote: > Have you

Re: How to add HBase dependencies and conf with spark-submit?

2014-10-15 Thread Soumitra Kumar
I am writing to HBase, following are my options: export SPARK_CLASSPATH=/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar spark-submit \ --jars /opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar,/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar,/opt/cloudera/parcels/CDH/lib/hbase

Re: Spark Worker crashing and Master not seeing recovered worker

2014-10-15 Thread Malte
This is still happening to me on mesos. Any workarounds? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Worker-crashing-and-Master-not-seeing-recovered-worker-tp2312p16506.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming: Sentiment Analysis of Twitter streams

2014-10-15 Thread Sean Owen
It looks like you're making the StreamingContext and SparkContext separately from the same conf. Instead, how about passing the SparkContext to the StreamingContext constructor? it seems like better practice and is a guess at the problem cause. On Tue, Oct 14, 2014 at 9:13 PM, SK wrote: > Hi, > >

matrix operations?

2014-10-15 Thread ll
hi there... is there any other matrix operations in addition to multiply()? like addition or dot product? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/matrix-operations-tp16508.html Sent from the Apache Spark User List mailing list archive at Nabble.co

RowMatrix.multiply() ?

2014-10-15 Thread ll
hi.. it looks like RowMatrix.multiply() takes a local Matrix as a parameter and returns the result as a distributed RowMatrix. how do you perform this series of multiplications if A, B, C, and D are all RowMatrix? ((A x B) x C) x D) thanks! -- View this message in context: http://apache-sp

Re: SparkSQL IndexOutOfBoundsException when reading from Parquet

2014-10-15 Thread Terry Siu
Hi Yin, pqt_rdt_snappy has 76 columns. These two parquet tables were created via Hive 0.12 from existing Avro data using CREATE TABLE following by an INSERT OVERWRITE. These are partitioned tables - pqt_rdt_snappy has one partition while pqt_segcust_snappy has two partitions. For pqt_segcust_sn

Re: SPARK_SUBMIT_CLASSPATH question

2014-10-15 Thread Greg Hill
I guess I was a little light on the details in my haste. I'm using Spark on YARN, and this is in the driver process in yarn-client mode (most notably spark-shell). I've had to manually add a bunch of JARs that I had thought it would just pick up like everything else does: export SPARK_SUBMIT

Re: Problem executing Spark via JBoss application

2014-10-15 Thread Yana Kadiyska
>From this line : Removing executor app-20141015142644-0125/0 because it is EXITED I would guess that you need to examine the executor log to see why the executor actually exited. My guess would be that the executor cannot connect back to your driver. But check the log from the executor. It should

Serialize/deserialize Naive Bayes model and index files

2014-10-15 Thread jatinpreet
Hi, I am trying to persist the files generated as a result of Naive bayes training with MLlib. These comprise of the model file, label index(own class) and term dictionary(own class). I need to save them on an HDFS location and then deserialize when needed for prediction. How can I do the same wi

Re: RowMatrix.multiply() ?

2014-10-15 Thread Reza Zadeh
Hi, We are currently working on distributed matrix operations. Two RowMatrices cannot be currently multiplied together. Neither can be they be added. They functionality will be added soon. You can of course achieve this yourself by using IndexedRowMatrix and doing one join per operation you reques

spark-sql not coming up with Hive 0.10.0/CDH 4.6

2014-10-15 Thread Anurag Tangri
Hi, I compiled spark 1.1.0 with CDH 4.6 but when I try to get spark-sql cli up, it gives error: == [atangri@pit-uat-hdputil1 bin]$ ./spark-sql Spark assembly has been built with Hive, including Datanucleus jars on classpath Java HotSpot(TM) 64-Bit Server VM warning: ignoring option M

Re: spark-sql not coming up with Hive 0.10.0/CDH 4.6

2014-10-15 Thread Anurag Tangri
I see Hive 0.10.0 metastore sql does not have a VERSION table but spark is looking for it. Anyone else faced this issue or any ideas on how to fix it ? Thanks, Anurag Tangri On Wed, Oct 15, 2014 at 10:51 AM, Anurag Tangri wrote: > Hi, > I compiled spark 1.1.0 with CDH 4.6 but when I try to

Re: spark-sql not coming up with Hive 0.10.0/CDH 4.6

2014-10-15 Thread Marcelo Vanzin
Hi Anurag, Spark SQL (from the Spark standard distribution / sources) currently requires Hive 0.12; as you mention, CDH4 has Hive 0.10, so that's not gonna work. CDH 5.2 ships with Spark 1.1.0 and is modified so that Spark SQL can talk to the Hive 0.13.1 that is also bundled with CDH, so if that'

Re: spark-sql not coming up with Hive 0.10.0/CDH 4.6

2014-10-15 Thread Anurag Tangri
Hi Marcelo, Exactly. Found it few minutes ago. I ran mysql hive 12 sql on my hive 10 metastore, which created missing tables and it seems to be working now. Not sure if everything else in CDH 4.6/Hive 10 would also still be working though or not. Looks like we cannot use Spark SQL in a clean way

Exception while reading SendingConnection to ConnectionManagerId

2014-10-15 Thread Jimmy Li
Hi there, I'm running spark on ec2, and am running into an error there that I don't get locally. Here's the error: 11335 [handle-read-write-executor-3] ERROR org.apache.spark.network.SendingConnection - Exception while reading SendingConnection to ConnectionManagerId([IP HERE]) java.nio.channels.

Re: Spark Streaming: Sentiment Analysis of Twitter streams

2014-10-15 Thread SK
You are right. Creating the StreamingContext from the SparkContext instead of SparkConf helped. Thanks for the help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Sentiment-Analysis-of-Twitter-streams-tp16410p16520.html Sent from the Apache

Spark tasks still scheduled after Spark goes down

2014-10-15 Thread pkl
Hi, My setup: tomcat (running a web app which initializes SparkContext) and dedicated Spark cluster (1 master 2 workers, 1VM per each). I am able to properly start this setup where SparkContext properly initializes connection with master. I am able to execute tasks and perform required calculation

Re: Spark Streaming Empty DStream / RDD and reduceByKey

2014-10-15 Thread Abraham Jacob
Hi All, I figured out what the problem was. Thank you Sean for pointing me in the right direction. All the jibber jabber about empty DStream / RDD was all just pure nonsense [?] . I guess the sequence of events (the fact that spark streaming started crashing just after I implemented the reduceByke

Getting the value from DStream[Int]

2014-10-15 Thread SK
Hi, As a result of a reduction operation, the resultant value "score" is a DStream[Int] . How can I get the simple Int value? I tried score[0], and score._1, but neither worked and can't find a getValue() in the DStream API. thanks -- View this message in context: http://apache-spark-user

Spark Streaming is slower than Spark

2014-10-15 Thread Tarun Garg
Hi, I am evaluating Sparking Streaming with kafka and i found that spark streaming is slower than Spark. It took more time is processing same amount of data as per the Spark Console it can process 2300 Records per seconds. Is my assumption is correct? Spark Streaming has to do a lot of this along

Re: SPARK_SUBMIT_CLASSPATH question

2014-10-15 Thread Marcelo Vanzin
Hi Greg, I'm not sure exactly what it is that you're trying to achieve, but I'm pretty sure those variables are not supposed to be set by users. You should take a look at the documentation for "spark.driver.extraClassPath" and "spark.driver.extraLibraryPath", and the equivalent options for executo

how to set log level of spark executor on YARN(using yarn-cluster mode)

2014-10-15 Thread eric wong
Hi, I want to check the DEBUG log of spark executor on YARN(using yarn-cluster mode), but 1. yarn daemonlog setlevel DEBUG YarnChild.class 2. set log4j.properties in spark/conf folder on client node. no means above works. So how could i set the log level of spark executor* on YARN container to

Re: how to set log level of spark executor on YARN(using yarn-cluster mode)

2014-10-15 Thread Marcelo Vanzin
Hi Eric, Check the "Debugging Your Application" section at: http://spark.apache.org/docs/latest/running-on-yarn.html Long story short: upload your log4j.properties using the "--files" argument of spark-submit. (Mental note: we could make the log level configurable via a system property...) On

Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-15 Thread Ray
Hi Xiangrui, I am using yarn-cluster mode. The current hadoop cluster is configured to only accept "yarn-cluster" mode and not allow "yarn-client" mode. I have no prevelige to change that. Without initializing with "k-means||", the job finished in 10 minutes. With "k-means", it just hangs there f

Spark Streaming: Invalid lambda deserialization error

2014-10-15 Thread Chia-Chun Shih
Hi, I am testing Spark Streaming (local mode, with Kafka). The code is as follows: public class LocalStreamTest2 { public static void main(String[] args) { JavaSparkContext sc = new JavaSparkContext("local[4]", "Local Stream Test"); JavaStreamingContext ssc = new JavaStreamingContext(sc, new D

Play framework

2014-10-15 Thread Mohammed Guller
Hi - Has anybody figured out how to integrate a Play application with Spark and run it on a Spark cluster using spark-submit script? I have seen some blogs about creating a simple Play app and running it locally on a dev machine with sbt run command. However, those steps don't work for Spark-su

Sample codes for Spark streaming + Kafka + Scala + sbt?

2014-10-15 Thread Gary Zhao
Hi Anyone can share a project as a sample? I tried them a couple days ago but couldn't make it work. Looks like it's due to some Kafka dependency issue. I'm using sbt-assembly. Thanks Gary

Spark's shuffle file size keep increasing

2014-10-15 Thread Haopu Wang
I have a Spark application which is running Spark Streaming and Spark SQL. I observed the size of shuffle files under "spark.local.dir" folder keeps increase and never decreases. Eventually it will run out-of-disk-space error. The question is: when will Spark delete these shuffle files? In the ap

Re: Spark Concepts

2014-10-15 Thread nsareen
Anybody with good hands on with Spark, please do reply. It would help us a lot!! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Concepts-tp16477p16536.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: YARN deployment of Spark and Thrift JDBC server

2014-10-15 Thread neeraj
I would like to reiterate that I don't have Hive installed on the Hadoop cluster. I have some queries on following comment from Cheng Lian-2: "The Thrift server is used to interact with existing Hive data, and thus needs Hive Metastore to access Hive catalog. In your case, you need to build Spark

Re: How to write data into Hive partitioned Parquet table?

2014-10-15 Thread Banias H
I got tipped by an expert that the error of "Unsupported language features in query" that I had was due to the fact that SparkSQL does not support dynamic partitions, and I can do saveAsParquetFile() for each partition. My inefficient implementation is to: //1. run the query without DISTRIBUTE B

Re: How to close resources shared in executor?

2014-10-15 Thread Fengyun RAO
Thanks, Ted. Util.Connection.close() should be called only once, so it can NOT be in a map function val result = rdd.map(line => { val table = Util.Connection.getTable("user") ... Util.Connection.close() } As you mentioned: Calling table.close() is the recommended approach. HConnectionMana

Problems with ZooKeeper and key canceled

2014-10-15 Thread Malte
I have a spark cluster on mesos and when I run long running GraphX processing I receive a lot of the following two errors and one by one my slaves stop doing any work for the process until its idle. Any idea what is happening? First type of error message: INFO SendingConnection: Initiating connec

Re: How to close resources shared in executor?

2014-10-15 Thread Fengyun RAO
I may have misunderstood your point. val result = rdd.map(line => { val table = Util.Connection.getTable("user") ... table.close() } Did you mean this is enough, and there’s no need to call Util.Connection.close(), or HConnectionManager.deleteAllConnections()? Where is the documentation th

RE: Problem executing Spark via JBoss application

2014-10-15 Thread Mehdi Singer
Indeed it was a problem on the executor side… I have to figure out how to fix it now ;-) Thanks! Mehdi De : Yana Kadiyska [mailto:yana.kadiy...@gmail.com] Envoyé : mercredi 15 octobre 2014 18:32 À : Mehdi Singer Cc : user@spark.apache.org Objet : Re: Problem executing Spark via JBoss applicatio