subscribe

2014-09-11 Thread Erik van oosten
- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Multi-tenancy for Spark (Streaming) Applications

2014-09-11 Thread Tobias Pfeiffer
Hi, by now I understood maybe a bit better how spark-submit and YARN play together and how Spark driver and slaves play together on YARN. Now for my usecase, as described on < https://spark.apache.org/docs/latest/submitting-applications.html>, I would probably have a end-user-facing gateway that

Re: can fileStream() or textFileStream() remember state?

2014-09-11 Thread vasiliy
When you get a stream from sc.fileStream() spark will process only files with file timestamp > then current timestamp so all data from HDFS should not be processed again. You may have a another problem - spark will not process files that moved to your HDFS folder between your application restarts.

SchemaRDD saveToCassandra

2014-09-11 Thread lmk
Hi, My requirement is to extract certain fields from json files, run queries on them and save the result to cassandra. I was able to parse json , filter the result and save the rdd(regular) to cassandra. Now, when I try to read the json file through sqlContext , execute some queries on the same an

Spark not installed + no access to web UI

2014-09-11 Thread mrm
Hi, I have been launching Spark in the same ways for the past months, but I have only recently started to have problems with it. I launch Spark using spark-ec2 script, but then I cannot access the web UI when I type address:8080 into the browser (it doesn't work with lynx either from the master no

Unpersist

2014-09-11 Thread Deep Pradhan
I want to create a temporary variables in a spark code. Can I do this? for (i <- num) { val temp = .. { do something } temp.unpersist() } Thank You

Re: Spark not installed + no access to web UI

2014-09-11 Thread Akhil Das
Which version of spark are you having? Thanks Best Regards On Thu, Sep 11, 2014 at 3:10 PM, mrm wrote: > Hi, > > I have been launching Spark in the same ways for the past months, but I > have > only recently started to have problems with it. I launch Spark using > spark-ec2 script, but then I c

Re: Unpersist

2014-09-11 Thread Akhil Das
like this? var temp = ... for (i <- num) { temp = .. { do something } temp.unpersist() } Thanks Best Regards On Thu, Sep 11, 2014 at 3:26 PM, Deep Pradhan wrote: > I want to create a temporary variables in a spark code. > Can I do this? > > for (i <- num) > { > val temp = ..

Re: Spark not installed + no access to web UI

2014-09-11 Thread mrm
I tried 1.0.0, 1.0.1 and 1.0.2. I also tried the latest github commit. After several hours trying to launch it, now it seems to be working, this is what I did (not sure if any of these steps helped): 1/ clone the spark repo into the master node 2/ run sbt/sbt assembly 3/ copy spark and spark-ec2

JMXSink for YARN deployment

2014-09-11 Thread Vladimir Tretyakov
Hello, we are in Sematext (https://apps.sematext.com/) are writing Monitoring tool for Spark and we came across one question: How to enable JMX metrics for YARN deployment? We put "*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink" to file $SPARK_HOME/conf/metrics.properties but it doesn't w

Re: How to scale more consumer to Kafka stream

2014-09-11 Thread richiesgr
Thanks for all I'm going to check both solution -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-scale-more-consumer-to-Kafka-stream-tp13883p13959.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Spark streaming stops computing while the receiver keeps running without any errors reported

2014-09-11 Thread Aniket Bhatnagar
Hi all I am trying to run kinesis spark streaming application on a standalone spark cluster. The job works find in local mode but when I submit it (using spark-submit), it doesn't do anything. I enabled logs for org.apache.spark.streaming.kinesis package and I regularly get the following in worker

Re: How to scale more consumer to Kafka stream

2014-09-11 Thread Gerard Maas
This pattern works. One note, thought: Use 'union' only if you need to group the data from all RDDs into one RDD for processing (like count distinct or need a groupby). If your process can be parallelized over every stream of incoming data, I suggest you just apply the required transformations on

problem in using Spark-Cassandra connector

2014-09-11 Thread Karunya Padala
Hi, I am new to spark. I encountered an issue when trying to connect to Cassandra using Spark Cassandra connector. Can anyone help me. Following are the details. 1) Following Spark and Cassandra versions I am using on LUbuntu12.0. i)spark-1.0.2-bin-hadoop2 ii) apache-cassandra-2.0.10 2) In t

Re: problem in using Spark-Cassandra connector

2014-09-11 Thread Reddy Raja
You will have to create create KeySpace and Table. See the message, Table not found: EmailKeySpace.Emails Looks like you have not created the Emails table. On Thu, Sep 11, 2014 at 6:04 PM, Karunya Padala < karunya.pad...@infotech-enterprises.com> wrote: > > > Hi, > > > > I am new to spark. I

RE: problem in using Spark-Cassandra connector

2014-09-11 Thread Karunya Padala
I have created key space called EmailKeySpace’and table called Emails and inserted some data in the Cassandra. See my Cassandra console screen shot. [cid:image001.png@01CFCDEB.8FB55CB0] Regards, Karunya. From: Reddy Raja [mailto:areddyr...@gmail.com] Sent: 11 September 2014 18:07 To: Karunya

Spark on Raspberry Pi?

2014-09-11 Thread Sandeep Singh
Has anyone tried using Raspberry Pi for Spark? How efficient is it to use around 10 Pi's for local testing env ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Raspberry-Pi-tp13965.html Sent from the Apache Spark User List mailing list archive at N

Re: How to scale more consumer to Kafka stream

2014-09-11 Thread Dibyendu Bhattacharya
I agree Gerard. Thanks for pointing this.. Dib On Thu, Sep 11, 2014 at 5:28 PM, Gerard Maas wrote: > This pattern works. > > One note, thought: Use 'union' only if you need to group the data from all > RDDs into one RDD for processing (like count distinct or need a groupby). > If your process c

Fwd: Spark on Raspberry Pi?

2014-09-11 Thread Chen He
Pi's bus speed, memory size and access speed, and processing ability are limited. The only benefit could be the power consumption. On Thu, Sep 11, 2014 at 8:04 AM, Sandeep Singh wrote: > Has anyone tried using Raspberry Pi for Spark? How efficient is it to use > around 10 Pi's for local testing

RE: JMXSink for YARN deployment

2014-09-11 Thread Shao, Saisai
Hi, I’m guessing the problem is that driver or executor cannot get the metrics.properties configuration file in the yarn container, so metrics system cannot load the right sinks. Thanks Jerry From: Vladimir Tretyakov [mailto:vladimir.tretya...@sematext.com] Sent: Thursday, September 11, 2014 7

unable to create new native thread

2014-09-11 Thread arthur.hk.c...@gmail.com
Hi I am trying the Spark sample program “SparkPi”, I got an error "unable to create new native thread", how to resolve this? 14/09/11 21:36:16 INFO scheduler.DAGScheduler: Completed ResultTask(0, 644) 14/09/11 21:36:16 INFO scheduler.TaskSetManager: Finished TID 643 in 43 ms on node1 (progress

Re: JMXSink for YARN deployment

2014-09-11 Thread Vladimir Tretyakov
Hi Shao, thx for explanation, any ideas how to fix it? Where should I put metrics.properties file? On Thu, Sep 11, 2014 at 4:18 PM, Shao, Saisai wrote: > Hi, > > > > I’m guessing the problem is that driver or executor cannot get the > metrics.properties configuration file in the yarn container,

RE: JMXSink for YARN deployment

2014-09-11 Thread Shao, Saisai
I think you can try to use ” spark.metrics.conf” to manually specify the path of metrics.properties, but the prerequisite is that each container should find this file in their local FS because this file is loaded locally. Besides I think this might be a kind of workaround, a better solution is t

Re: Unpersist

2014-09-11 Thread Deep Pradhan
After every loop I want the temp variable to cease to exist On Thu, Sep 11, 2014 at 4:33 PM, Akhil Das wrote: > like this? > > var temp = ... > for (i <- num) > { > temp = .. >{ >do something >} > temp.unpersist() > } > > Thanks > Best Regards > > On Thu, Sep 11, 2014 at 3:26 PM

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-11 Thread Nan Zhu
Hi, Can you attach more logs to see if there is some entry from ContextCleaner? I met very similar issue before…but haven’t get resolved Best, -- Nan Zhu On Thursday, September 11, 2014 at 10:13 AM, Dibyendu Bhattacharya wrote: > Dear All, > > Not sure if this is a false alarm.

Re: Some Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-11 Thread Nan Zhu
This is my case about broadcast variable: 14/07/21 19:49:13 INFO Executor: Running task ID 4 14/07/21 19:49:13 INFO DAGScheduler: Completed ResultTask(0, 2) 14/07/21 19:49:13 INFO TaskSetManager: Finished TID 2 in 95 ms on localhost (progress: 3/106) 14/07/21 19:49:13 INFO TableOutputFormat:

Re[2]: HBase 0.96+ with Spark 1.0+

2014-09-11 Thread spark
Hi guys, any luck with this issue, anyone? I aswell tried all the possible exclusion combos to a no avail. thanks for your ideas reinis -Original-Nachricht- > Von: "Stephen Boesch" > An: user > Datum: 28-06-2014 15:12 > Betreff: Re: HBase 0.96+ with Spark 1.0+ > > Hi Siyuan, Th

Re: JMXSink for YARN deployment

2014-09-11 Thread Vladimir Tretyakov
Hi again, yeah , I've tried to use ” spark.metrics.conf” before my question in ML, had no luck:( Any other ideas from somebody? Seems nobody use metrics in YARN deployment mode. How about Mesos? I didn't try but maybe Spark has the same difficulties on Mesos? PS: Spark is great thing in general,

Re: JMXSink for YARN deployment

2014-09-11 Thread Kousuke Saruta
Hi Vladimir How about use --files option with spark-submit? - Kousuke (2014/09/11 23:43), Vladimir Tretyakov wrote: Hi again, yeah , I've tried to use ” spark.metrics.conf” before my question in ML, had no luck:( Any other ideas from somebody? Seems nobody use metrics in YARN deployment mode

Python execution support on clusters

2014-09-11 Thread david_allanus
Is there some doc that I missed that describes what execution engines Python is support for with Spark? If we use spark-submit, with a yarn cluster an error is produced saying 'Error: Cannot currently run Python driver programs on cluster'. Thanks in advance David -- View this message in contex

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-11 Thread alexandria1101
Thank you!! I can do this using saveAsTable with the schemaRDD, right? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Table-not-found-using-jdbc-console-to-query-sparksql-hive-thriftserver-tp13840p13979.html Sent from the Apache Spark User List mailing lis

compiling spark source code

2014-09-11 Thread rapelly kartheek
HI, Can someone please tell me how to compile the spark source code to effect the changes in the source code. I was trying to ship the jars to all the slaves, but in vain. -Karthik

Out of memory with Spark Streaming

2014-09-11 Thread Aniket Bhatnagar
I am running a simple Spark Streaming program that pulls in data from Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps data and persists to a store. The program is running in local mode right now and runs out of memory after a while. I am yet to investigate heap dumps but

Re: efficient zipping of lots of RDDs

2014-09-11 Thread Mohit Jaggi
filed jira SPARK-3489 On Thu, Sep 4, 2014 at 9:36 AM, Mohit Jaggi wrote: > Folks, > I sent an email announcing > https://github.com/AyasdiOpenSource/df > > This dataframe is basically a map of RDDs of columns(along with DSL > sugar), as column

Re: Setting up jvm in pyspark from shell

2014-09-11 Thread Davies Liu
The heap size of JVM can not been changed dynamically, so you need to config it before running pyspark. If you run it in local mode, you should config spark.driver.memory (in 1.1 or master). Or, you can use --driver-memory 2G (should work in 1.0+) On Wed, Sep 10, 2014 at 10:43 PM, Mohit Singh

Re: compiling spark source code

2014-09-11 Thread Daniil Osipov
In the spark source folder, execute `sbt/sbt assembly` On Thu, Sep 11, 2014 at 8:27 AM, rapelly kartheek wrote: > HI, > > > Can someone please tell me how to compile the spark source code to effect > the changes in the source code. I was trying to ship the jars to all the > slaves, but in vain.

Re: JMXSink for YARN deployment

2014-09-11 Thread Vladimir Tretyakov
Hi, Kousuke, Can you please explain a bit detailed what do you mean, I am new in Spark, looked at https://spark.apache.org/docs/latest/submitting-applications.html seems there is no '--files' option. I just have to add '--files /path-to-metrics.properties' ? Undocumented ability? Thx for answer.

Re: Spark on Raspberry Pi?

2014-09-11 Thread Daniil Osipov
Limited memory could also cause you some problems and limit usability. If you're looking for a local testing environment, vagrant boxes may serve you much better. On Thu, Sep 11, 2014 at 6:18 AM, Chen He wrote: > > > > Pi's bus speed, memory size and access speed, and processing ability are > li

Spark SQL and running parquet tables?

2014-09-11 Thread DanteSama
I've been under the impression that creating and registering a parquet table will pick up on updates to the table, such as inserts. I have a program running that does the following: // Create Context val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) // Register table sqlContext

Re: JMXSink for YARN deployment

2014-09-11 Thread Kousuke Saruta
Hi, Vladimir You can see about --files option at https://spark.apache.org/docs/latest/running-on-yarn.html If you use --files option like "--files /path-to-metrics.properties", the file is distributed to each executor/driver's working directory, so they can load the file. - Kousuke (2014

Re: Out of memory with Spark Streaming

2014-09-11 Thread Bharat Venkat
You could set "spark.executor.memory" to something bigger than the default (512mb) On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > I am running a simple Spark Streaming program that pulls in data from > Kinesis at a batch interval of 10 seconds, windows i

Re: Out of memory with Spark Streaming

2014-09-11 Thread Aniket Bhatnagar
I did change it to be 1 gb. It still ran out of memory but a little later. The streaming job isnt handling a lot of data. In every 2 seconds, it doesn't get more than 50 records. Each record size is not more than 500 bytes. On Sep 11, 2014 10:54 PM, "Bharat Venkat" wrote: > You could set "spark

Re: Spark on Raspberry Pi?

2014-09-11 Thread Aniket Bhatnagar
Just curiois... What's the use case you are looking to implement? On Sep 11, 2014 10:50 PM, "Daniil Osipov" wrote: > Limited memory could also cause you some problems and limit usability. If > you're looking for a local testing environment, vagrant boxes may serve you > much better. > > On Thu, S

Re: Spark on Raspberry Pi?

2014-09-11 Thread Chanwit Kaewkasi
We've found that Raspberry Pi is not enough for Hadoop/Spark mainly because the memory consumption. What we've built is a cluster form with 22 Cubieboards, each contains 1 GB RAM. Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit On Thu, Sep 11, 2014 at 8:04 PM, Sandeep Singh

Re: Spark SQL JDBC

2014-09-11 Thread alexandria1101
Even when I comment out those 3 lines, I still get the same error. Did someone solve this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-JDBC-tp11369p13992.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: spark on yarn history server + hdfs permissions issue

2014-09-11 Thread Greg Hill
To answer my own question, in case someone else runs into this. The spark user needs to be in the same group on the namenode, and hdfs caches that information for it seems like at least an hour. Magically started working on its own. Greg From: Greg mailto:greg.h...@rackspace.com>> Date: Tuesd

Re: Re[2]: HBase 0.96+ with Spark 1.0+

2014-09-11 Thread Aniket Bhatnagar
Dependency hell... My fav problem :). I had run into a similar issue with hbase and jetty. I cant remember thw exact fix, but is are excerpts from my dependencies that may be relevant: val hadoop2Common = "org.apache.hadoop" % "hadoop-common" % hadoop2Version excludeAll(

Re: Spark on Raspberry Pi?

2014-09-11 Thread Chen He
Here is the POC of cubieboards hadoop cluster performance and power consumption. I suggest cubietrunk would be a choice since cubieboard2 only has 100Mbps ethernet. http://www.slideshare.net/airbots/hadoop-mapreduce-performance-study-on-arm-cluster On Thu, Sep 11, 2014 at 12:50 PM, Chanwit Kaewk

Network requirements between Driver, Master, and Slave

2014-09-11 Thread Jim Carroll
Hello all, I'm trying to run a Driver on my local network with a deployment on EC2 and it's not working. I was wondering if either the master or slave instances (in standalone) connect back to the driver program. I outlined the details of my observations in a previous post but here is what I'm se

SparkSQL HiveContext TypeTag compile error

2014-09-11 Thread Du Li
Hi, I have the following code snippet. It works fine on spark-shell but in a standalone app it reports "No TypeTag available for MySchema” at compile time when calling hc.createScheamaRdd(rdd). Anybody knows what might be missing? Thanks, Du -- Import org.apache.spark.sql.hive.HiveContext

Re: SchemaRDD saveToCassandra

2014-09-11 Thread Michael Armbrust
This might be a better question to ask on the cassandra mailing list as I believe that is where the exception is coming from. On Thu, Sep 11, 2014 at 2:37 AM, lmk wrote: > Hi, > My requirement is to extract certain fields from json files, run queries on > them and save the result to cassandra. >

Reading from multiple sockets

2014-09-11 Thread Varad Joshi
Still fairly new to Spark so please bear with me. I am trying to write a streaming app that has multiple workers that read from sockets and process the data. Here is a very simplified version of what I am trying to do: val carStreamSeq = (1 to 2).map( _ => ssc.socketTextStream(host, port) ).toArra

RE: cannot read file form a local path

2014-09-11 Thread Mozumder, Monir
I am seeing this same issue with Spark 1.0.1 (tried with file:// for local file ) : scala> val lines = sc.textFile("file:///home/monir/.bashrc") lines: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :12 scala> val linecount = lines.count org.apache.hadoop.mapred.InvalidInputEx

Re: Spark SQL and running parquet tables?

2014-09-11 Thread DanteSama
Michael Armbrust wrote > You'll need to run parquetFile("path").registerTempTable("name") to > refresh the table. I'm not seeing that function on SchemaRDD in 1.0.2, is there something I'm missing? SchemaRDD Scaladoc

Re: SparkSQL HiveContext TypeTag compile error

2014-09-11 Thread Du Li
Solved it. The problem occurred because the case class was defined within a test case in FunSuite. Moving the case class definition out of test fixed the problem. From: Du Li mailto:l...@yahoo-inc.com.INVALID>> Date: Thursday, September 11, 2014 at 11:25 AM To: "user@spark.apache.org

single worker vs multiple workers on each machine

2014-09-11 Thread Mike Sam
Hi There, I am new to Spark and I was wondering when you have so much memory on each machine of the cluster, is it better to run multiple workers with limited memory on each machine or is it better to run a single worker with access to the majority of the machine memory? If the answer is "it depen

Re: JMXSink for YARN deployment

2014-09-11 Thread Vladimir Tretyakov
Hello again, thx for doc, tried different ways with '--files', no luck. Can somebody who already has Spark on YARN try enable JMX metrics sink? Maybe problem in my hands:) PS, I've also tried to play with 'yarn.nodemanager.local-dirs', no results. On Thu, Sep 11, 2014 at 8:23 PM, Kousuke Saruta

spark sql - create new_table as select * from table

2014-09-11 Thread jamborta
Hi, I am trying to create a new table from a select query as follows: CREATE TABLE IF NOT EXISTS new_table ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/test/new_table' AS select * from table this works in Hive, but in Spark SQL (1.0.2

Re: spark sql - create new_table as select * from table

2014-09-11 Thread Du Li
The implementation of SparkSQL is currently incomplete. You may try it out with HiveContext instead of SQLContext. On 9/11/14, 1:21 PM, "jamborta" wrote: >Hi, > >I am trying to create a new table from a select query as follows: > >CREATE TABLE IF NOT EXISTS new_table ROW FORMAT DELIMITED F

Re: SparkSQL HiveContext TypeTag compile error

2014-09-11 Thread Du Li
Just moving it out of test is not enough. Must move the case class definition to the top level. Otherwise it would report a runtime error of task not serializable when executing collect(). From: Du Li mailto:l...@yahoo-inc.com.INVALID>> Date: Thursday, September 11, 2014 at 12:33 PM To: "user

Re: spark sql - create new_table as select * from table

2014-09-11 Thread jamborta
thanks. this was actually using hivecontext. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006p14009.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: cannot read file form a local path

2014-09-11 Thread Mozumder, Monir
Seems starting spark-shell in local mode solves this. But still then it cannot recognize file beginning with a '.' MASTER=local[4] ./bin/spark-shell . scala> val lineCount = sc.textFile("/home/monir/ref").count lineCount: Long = 68 scala> val lineCount2 = sc.tex

Re[2]: HBase 0.96+ with Spark 1.0+

2014-09-11 Thread spark
Thank you, Aniket for your hint! Alas, I am facing really "hellish" situation as it seems, because I have integration tests using BOTH spark and HBase (Minicluster). Thus I get either: class "javax.servlet.ServletRegistration"'s signer information does not match signer information of other clas

Re: Out of memory with Spark Streaming

2014-09-11 Thread Tathagata Das
Which version of spark are you running? If you are running the latest one, then could try running not a window but a simple event count on every 2 second batch, and see if you are still running out of memory? TD On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wr

Re: Spark streaming stops computing while the receiver keeps running without any errors reported

2014-09-11 Thread Tathagata Das
This is very puzzling, given that this works in the local mode. Does running the kinesis example work with your spark-submit? https://github.com/apache/spark/blob/master/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala The instructions are present i

Re: Re[2]: HBase 0.96+ with Spark 1.0+

2014-09-11 Thread Sean Owen
This was already answered at the bottom of this same thread -- read below. On Thu, Sep 11, 2014 at 9:51 PM, wrote: > class "javax.servlet.ServletRegistration"'s signer information does not > match signer information of other classes in the same package > java.lang.SecurityException: class "javax

SparkContext and multi threads

2014-09-11 Thread moon soo Lee
Hi, I'm trying to make spark work on multithreads java application. What i'm trying to do is, - Create a Single SparkContext - Create Multiple SparkILoop and SparkIMain - Inject created SparkContext into SparkIMain interpreter. Thread is created by every user request and take a SparkILoop and in

Re: single worker vs multiple workers on each machine

2014-09-11 Thread Sean Owen
As I understand, there's generally not an advantage to running many executors per machine. Each will already use all the cores, and multiple executors just means splitting the available memory instead of having one big pool. I think there may be an argument at extremes of scale where one JVM with a

Re: Out of memory with Spark Streaming

2014-09-11 Thread Tim Smith
I noticed that, by default, in CDH-5.1 (Spark 1.0.0), in both, StandAlone and Yarn mode - no GC options are set when an executor is launched. The only options passed in StandAlone mode are "-XX:MaxPermSize=128m -Xms16384M -Xmx16384M" (when I give each executor 16G). In Yarn mode, even fewer JVM op

Re: spark sql - create new_table as select * from table

2014-09-11 Thread Yin Huai
What is the schema of "table"? On Thu, Sep 11, 2014 at 4:30 PM, jamborta wrote: > thanks. this was actually using hivecontext. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-create-new-table-as-select-from-table-tp14006p14009.html > Sen

Re: spark sql - create new_table as select * from table

2014-09-11 Thread Yin Huai
Oh, never mind. The support of CTAS queries is pretty limited. Can you try to first create the table and then use insert into? On Thu, Sep 11, 2014 at 6:45 PM, Yin Huai wrote: > What is the schema of "table"? > > On Thu, Sep 11, 2014 at 4:30 PM, jamborta wrote: > >> thanks. this was actually us

Re: Spark SQL and running parquet tables?

2014-09-11 Thread Yin Huai
It is in SQLContext ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SQLContext ). On Thu, Sep 11, 2014 at 3:21 PM, DanteSama wrote: > Michael Armbrust wrote > > You'll need to run parquetFile("path").registerTempTable("name") to > > refresh the table. > > I'm not

Re: Re: Spark SQL -- more than two tables for join

2014-09-11 Thread Yin Huai
1.0.1 does not have the support on outer joins (added in 1.1). Can you try 1.1 branch? On Wed, Sep 10, 2014 at 9:28 PM, boyingk...@163.com wrote: > Hi,michael : > > I think Arthur.hk.chan isn't here now,I Can > Show something: > 1)my spark version is 1.0.1 > 2) when I use multiple join ,like t

Spark Streaming in 1 hour batch duration RDD files gets lost

2014-09-11 Thread Jeoffrey Lim
Hi, Our spark streaming app is configured to pull data from Kafka in 1 hour batch duration which performs aggregation of data by specific keys and store the related RDDs to HDFS in the transform phase. We have tried checkpoint of 7 days on the DStream of Kafka to ensure that the generated stream

Backwards RDD

2014-09-11 Thread Victor Tso-Guillen
Iterating an RDD gives you each partition in order of their split index. I'd like to be able to get each partition in reverse order, but I'm having difficultly implementing the compute() method. I thought I could do something like this: override def getDependencies: Seq[Dependency[_]] = { Se

Announcing Spark 1.1.0!

2014-09-11 Thread Patrick Wendell
I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is the second release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 171 developers! This release brings operational and performance improvements in Spark core including a new implement

Re: Backwards RDD

2014-09-11 Thread Victor Tso-Guillen
I'm now making the Backwards RDD take the previous RDD's partitions and then using those to iterate. Passes my test. Is it kosher? On Thu, Sep 11, 2014 at 5:00 PM, Victor Tso-Guillen wrote: > Iterating an RDD gives you each partition in order of their split index. > I'd like to be able to get ea

Re: Announcing Spark 1.1.0!

2014-09-11 Thread Nicholas Chammas
Nice work everybody! I'm looking forward to trying out this release! On Thu, Sep 11, 2014 at 8:12 PM, Patrick Wendell wrote: > I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is > the second release on the API-compatible 1.X line. It is Spark's > largest release ever, with co

Configuring Spark for heterogenous hardware

2014-09-11 Thread Victor Tso-Guillen
So I have a bunch of hardware with different core and memory setups. Is there a way to do one of the following: 1. Express a ratio of cores to memory to retain. The spark worker config would represent all of the cores and all of the memory usable for any application, and the application would take

History server: ERROR ReplayListenerBus: Exception in parsing Spark event log

2014-09-11 Thread SK
Hi, I am using Spark 1.0.2 on a mesos cluster. After I run my job, when I try to look at the detailed application stats using a history server@18080, the stats don't show up for some of the jobs even though the job completed successfully and the event logs are written to the log folder. The log f

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Haopu Wang
I see the binary packages include hadoop 1, 2.3 and 2.4. Does Spark 1.1.0 support hadoop 2.5.0 at below address? http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Friday, Septembe

coalesce on SchemaRDD in pyspark

2014-09-11 Thread Brad Miller
Hi All, I'm having some trouble with the coalesce and repartition functions for SchemaRDD objects in pyspark. When I run: sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', '{"foo":"baz"}'])).coalesce(1) I get this error: Py4JError: An error occurred while calling o94.coalesce. Trace: py4j.Py4JEx

Re: Announcing Spark 1.1.0!

2014-09-11 Thread Tobias Pfeiffer
Hi, On Fri, Sep 12, 2014 at 9:12 AM, Patrick Wendell wrote: > I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is > the second release on the API-compatible 1.X line. It is Spark's > largest release ever, with contributions from 171 developers! > Great, congratulations!! The

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Denny Lee
I’m not sure if I’m completely answering your question here but I’m currently working (on OSX) with Hadoop 2.5 and I used the Spark 1.1 with Hadoop 2.4 without any issues. On September 11, 2014 at 18:11:46, Haopu Wang (hw...@qilinsoft.com) wrote: I see the binary packages include hadoop 1, 2.3

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Haopu Wang
Danny, thanks for the response. I raise the question because in Spark 1.0.2, I saw one binary package for hadoop2, but in Spark 1.1.0, there are separate packages for hadoop 2.3 and 2.4. That implies some difference in Spark according to hadoop version. F

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-11 Thread Denny Lee
It sort of depends on the definition of efficiently.  From a work flow perspective I would agree but from an I/O perspective, wouldn’t there be the same multi-pass from the standpoint of the Hive context needing to push the data into HDFS?  Saying this, if you’re pushing the data into HDFS and t

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Denny Lee
Please correct me if I’m wrong but I was under the impression as per the maven repositories that it was just to stay more in sync with the various version of Hadoop.  Looking at the latest documentation (https://spark.apache.org/docs/latest/building-with-maven.html), there are multiple Hadoop v

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Haopu Wang
From the web page (https://spark.apache.org/docs/latest/building-with-maven.html) which is pointed out by you, it’s saying “Because HDFS is not protocol-compatible across versions, if you want to read from HDFS, you’ll need to build Spark against the specific HDFS version in your environment.”

Applications status missing when Spark HA(zookeeper) enabled

2014-09-11 Thread jason chen
Hi guys, I configured Spark with the configuration in spark-env.sh: export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=host1:2181,host2:2181,host3:2181 -Dspark.deploy.zookeeper.dir=/spark" And I started spark-shell on one master host1(active): MASTER

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Denny Lee
Yes, atleast for my query scenarios, I have been able to use Spark 1.1 with Hadoop 2.4 against Hadoop 2.5.  Note, Hadoop 2.5 is considered a relatively minor release (http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available) where Hadoop 2.4 and 2.3 were considered mo

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Haopu Wang
Got it, thank you, Denny! From: Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 11:04 AM To: user@spark.apache.org; Haopu Wang; d...@spark.apache.org; Patrick Wendell Subject: RE: Announcing Spark 1.1.0! Yes, atleast for my query

Re: DistCP - Spark-based

2014-09-11 Thread Nicholas Chammas
I've created SPARK-3499 to track creating a Spark-based distcp utility. Nick On Tue, Aug 12, 2014 at 4:20 PM, Matei Zaharia wrote: > Good question; I don't know of one but I believe people at Cloudera had > some thoughts of porting Sqoop to Spa

Re: Spark SQL JDBC

2014-09-11 Thread Denny Lee
When you re-ran sbt did you clear out the packages first and ensure that the datanucleus jars were generated within lib_managed? I remembered having to do that when I was working testing out different configs. On Thu, Sep 11, 2014 at 10:50 AM, alexandria1101 < alexandria.shea...@gmail.com> wrote:

Re: Spark SQL Thrift JDBC server deployment for production

2014-09-11 Thread Denny Lee
Could you provide some context about running this in yarn-cluster mode? The Thrift server that's included within Spark 1.1 is based on Hive 0.12. Hive has been able to work against YARN since Hive 0.10. So when you start the thrift server, provided you copied the hive-site.xml over to the Spark co

Re: Announcing Spark 1.1.0!

2014-09-11 Thread Tim Smith
Thanks for all the good work. Very excited about seeing more features and better stability in the framework. On Thu, Sep 11, 2014 at 5:12 PM, Patrick Wendell wrote: > I am happy to announce the availability of Spark 1.1.0! Spark 1.1.0 is > the second release on the API-compatible 1.X line. It i

Re: Announcing Spark 1.1.0!

2014-09-11 Thread Matei Zaharia
Thanks to everyone who contributed to implementing and testing this release! Matei On September 11, 2014 at 11:52:43 PM, Tim Smith (secs...@gmail.com) wrote: Thanks for all the good work. Very excited about seeing more features and better stability in the framework. On Thu, Sep 11, 2014 at 5:

Re: compiling spark source code

2014-09-11 Thread rapelly kartheek
I have been doing that. All the modifications to the code are not being compiled. On Thu, Sep 11, 2014 at 10:45 PM, Daniil Osipov wrote: > In the spark source folder, execute `sbt/sbt assembly` > > On Thu, Sep 11, 2014 at 8:27 AM, rapelly kartheek > wrote: > >> HI, >> >> >> Can someone please

Re: Announcing Spark 1.1.0!

2014-09-11 Thread Debasish Das
Congratulations on the 1.1 release ! On Thu, Sep 11, 2014 at 9:08 PM, Matei Zaharia wrote: > Thanks to everyone who contributed to implementing and testing this > release! > > Matei > > On September 11, 2014 at 11:52:43 PM, Tim Smith (secs...@gmail.com) wrote: > > Thanks for all the good work. V

RE: Spark SQL JDBC

2014-09-11 Thread Cheng, Hao
I copied the 3 datanucleus jars (datanucleus-api-jdo-3.2.1.jar, datanucleus-core-3.2.2.jar, datanucleus-rdbms-3.2.1.jar) to the fold lib/ manually, and it works for me. From: Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 11:28 AM To: alexandria1101 Cc: u...@spark.incu

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-11 Thread Du Li
SchemaRDD has a method insertInto(table). When the table is partitioned, it would be more sensible and convenient to extend it with a list of partition key and values. From: Denny Lee mailto:denny.g@gmail.com>> Date: Thursday, September 11, 2014 at 6:39 PM To: Du Li mailto:l...@yahoo-inc.c

  1   2   >