Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-14 Thread Michael Armbrust
Sorry for the trouble. There are two issues here: - Parsing of repeated nested (i.e. something[0].field) is not supported in the plain SQL parser. SPARK-2096 - Resolution is broken in the HiveQL parser. SPARK-2483

Re: Nested Query With Spark SQL(1.0.1)

2014-07-14 Thread anyweil
Yes, just as my last post, using [] to access array data and "." to access nested fields seems not work. BTW, i have deeped into the code of the current master branch. spark / sql / catalyst / src / main / scala / org / apache / spark / sql / catalyst / plans / logical / LogicalPlan.scala from l

Re: KMeansModel Construtor error

2014-07-14 Thread Xiangrui Meng
I don't think MLlib supports model serialization/deserialization. You got the error because the constructor is private. I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-2488 and we try to make sure it is implemented in v1.1. For now, you can modify the KMeansModel and remove p

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-14 Thread anyweil
Thank you so much for the reply, here is my code. 1. val conf = new SparkConf().setAppName("Simple Application") 2. conf.setMaster("local") 3. val sc = new SparkContext(conf) 4. val sqlContext = new org.apache.spark.sql.SQLContext(sc) 5. import sqlContext.createSchemaRDD 6. val path1 =

Re: ALS on EC2

2014-07-14 Thread Xiangrui Meng
Could you share the code of RecommendationALS and the complete spark-submit command line options? Thanks! -Xiangrui On Mon, Jul 14, 2014 at 11:23 PM, Srikrishna S wrote: > Using properties file: null > Main class: > RecommendationALS > Arguments: > _train.csv > _validation.csv > _test.csv > Syste

Re: Nested Query With Spark SQL(1.0.1)

2014-07-14 Thread Michael Armbrust
In general this should be supported using [] to access array data and "." to access nested fields. Is there something you are trying that isn't working? On Mon, Jul 14, 2014 at 11:25 PM, anyweil wrote: > I mean the query on the nested data such as JSON, not the nested query, > sorry > for the

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Nicholas Chammas
Hmm, I'd like to clarify something from your comments, Tathagata. Going forward, is Twitter Streaming functionality not supported from the shell? What should users do if they'd like to process live Tweets from the shell? Nick On Mon, Jul 14, 2014 at 11:50 PM, Nicholas Chammas < nicholas.cham...

Re: Spark SQL throws ClassCastException on first try; works on second

2014-07-14 Thread Nicholas Chammas
Ah, good catch, that seems to be it. I'd use 1.0.1, except I've been hitting up against SPARK-2471 with that version, which doesn't let me access my data in S3. :( OK, at least I know this has probably already been fixed. Nick On Tue, Jul 15,

Re: Nested Query With Spark SQL(1.0.1)

2014-07-14 Thread anyweil
I mean the query on the nested data such as JSON, not the nested query, sorry for the misunderstanding. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Query-the-nested-JSON-data-With-Spark-SQL-1-0-1-tp9544p9726.html Sent from the Apache Spark User List mail

ALS on EC2

2014-07-14 Thread Srikrishna S
Using properties file: null Main class: RecommendationALS Arguments: _train.csv _validation.csv _test.csv System properties: SPARK_SUBMIT -> true spark.app.name -> RecommendationALS spark.jars -> file:/root/projects/spark-recommendation-benchmark/benchmark_mf/target/scala-2.10/recommendation-bench

Re: hdfs replication on saving RDD

2014-07-14 Thread Andrew Ash
In general it would be nice to be able to configure replication on a per-job basis. Is there a way to do that without changing the config values in the Hadoop conf/ directory between jobs? Maybe by modifying OutputFormats or the JobConf ? On Mon, Jul 14, 2014 at 11:12 PM, Matei Zaharia wrote:

Re: Spark SQL throws ClassCastException on first try; works on second

2014-07-14 Thread Michael Armbrust
You might be hitting SPARK-1994 , which is fixed in 1.0.1. On Mon, Jul 14, 2014 at 11:16 PM, Nick Chammas wrote: > I’m running this query against RDD[Tweet], where Tweet is a simple case > class with 4 fields. > > sqlContext.sql(""" > SELECT u

答复:RACK_LOCAL Tasks Failed to finish

2014-07-14 Thread 洪奇
I just running PageRank(included in GraphX) on a dataset which has 55876487 edges. I submit the application to YARN with options`--num-executors 30 --executor-memory 30g --driver-memory 10g --executor-cores 8`. Thanks--发件人:Ankur Da

Re: SparkR failed to connect to the master

2014-07-14 Thread Shivaram Venkataraman
You'll need to build SparkR to match the Spark version deployed on the cluster. You can do that by changing the Spark version in SparkR's build.sbt [1]. If you are using the Maven build you'll need to edit pom.xml Thanks Shivaram [1] https://github.com/amplab-extras/SparkR-pkg/blob/master/pkg/src

Spark SQL throws ClassCastException on first try; works on second

2014-07-14 Thread Nick Chammas
I’m running this query against RDD[Tweet], where Tweet is a simple case class with 4 fields. sqlContext.sql(""" SELECT user, COUNT(*) as num_tweets FROM tweets GROUP BY user ORDER BY num_tweets DESC, user ASC ; """).take(5) The first time I run this, it throws the following: 14

Eclipse Spark plugin and sample Scala projects

2014-07-14 Thread buntu
Hi -- I tried searching for eclipse spark plugin setup for developing with Spark and there seems to be some information I can go with but I have not seen a starter app or project to import into Eclipse and try it out. Can anyone please point me to any Scala projects to import into Scala Eclipse ID

KMeansModel Construtor error

2014-07-14 Thread Rohit Pujari
Hello Folks: I have written a simple program to read the already saved model from HDFS and score it. But when I'm trying to read the saved model, I get the following error. Any clues what might be going wrong here .. val x = sc.objectFile[Vector]("/data/model").collect() val y = new KMeansModel(

Re: Scheduling in spark

2014-07-14 Thread Kartheek.R
Thank you Andrew for the updated link. regards Karthik -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Scheduling-in-spark-tp9035p9717.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Scheduling in spark

2014-07-14 Thread Kartheek.R
Thank you so much for the link, Sujeet. regards Karthik -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Scheduling-in-spark-tp9035p9716.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark Streaming Json file groupby function

2014-07-14 Thread srinivas
Hi TD, Thanks for ur help...i am able to convert map to records using case class. I am left with doing some aggregations. I am trying to do some SQL type operations on my records set. My code looks like case class Record(ID:Int,name:String,score:Int,school:String) //val records = jsonf.map(m =>

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Patrick Wendell
Adding new build modules is pretty high overhead, so if this is a case where a small amount of duplicated code could get rid of the dependency, that could also be a good short-term option. - Patrick On Mon, Jul 14, 2014 at 2:15 PM, Matei Zaharia wrote: > Yeah, I'd just add a spark-util that has

Re: SQL + streaming

2014-07-14 Thread hsy...@gmail.com
Actually, I deployed this on yarn cluster(spark-submit) and I couldn't find any output from the yarn stdout logs On Mon, Jul 14, 2014 at 6:25 PM, Tathagata Das wrote: > Can you make sure you are running locally on more than 1 local cores? You > could set the master in the SparkConf as conf.setM

Re: Possible bug in Spark Streaming :: TextFileStream

2014-07-14 Thread Madabhattula Rajesh Kumar
Hi Team, Is this issue with JavaStreamingContext.textFileStream("hdfsfolderpath") API also? Please conform. If yes, could you please help me to fix this issue. I'm using spark 1.0.0 version. Regards, Rajesh On Tue, Jul 15, 2014 at 5:42 AM, Tathagata Das wrote: > Oh yes, this was a bug and it

Re: Error when testing with large sparse svm

2014-07-14 Thread crater
(1) What is "number of partitions"? Is it number of workers per node? (2) I already set the driver memory pretty big, which is 25g. (3) I am running Spark 1.0.1 in standalone cluster with 9 nodes, 1 one them works as master, others are workers. -- View this message in context: http://apache-s

Re: RACK_LOCAL Tasks Failed to finish

2014-07-14 Thread Ankur Dave
What GraphX application are you running? If it's a custom application that calls RDD.unpersist, that might cause RDDs to be recomputed. It's tricky to do unpersisting correctly, so you might try not unpersisting and see if that helps. Ankur

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Nicholas Chammas
> > At some point, you were able to access TwitterUtils from spark shell using > Spark 1.0.0+ ? Yep. > If yes, then what change in Spark caused it to not work any more? It still works for me. I was just commenting on your remark that it doesn't work through the shell, which I now understand t

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Tathagata Das
Oh right, that could have happened only after Spark 1.0.0. So let me clarify. At some point, you were able to access TwitterUtils from spark shell using Spark 1.0.0+ ? If yes, then what change in Spark caused it to not work any more? TD On Mon, Jul 14, 2014 at 7:52 PM, Nicholas Chammas < nichol

truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-14 Thread Walrus theCat
Hi, I've got a socketTextStream through which I'm reading input. I have three Dstreams, all of which are the same window operation over that socketTextStream. I have a four core machine. As we've been covering lately, I have to give a "cores" parameter to my StreamingSparkContext: ssc = new St

Re: hdfs replication on saving RDD

2014-07-14 Thread Matei Zaharia
You can change this setting through SparkContext.hadoopConfiguration, or put the conf/ directory of your Hadoop installation on the CLASSPATH when you launch your app so that it reads the config values from there. Matei On Jul 14, 2014, at 8:06 PM, valgrind_girl <124411...@qq.com> wrote: > eag

Re: hdfs replication on saving RDD

2014-07-14 Thread valgrind_girl
eager to know this issue too,does any one knows how? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/hdfs-replication-on-saving-RDD-tp289p9700.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

branch-1.0-jdbc on EC2?

2014-07-14 Thread billk
I'm wondering if anyone has had success with an EC2 deployment of the https://github.com/apache/spark/tree/branch-1.0-jdbc branch that Michael Armbrust referenced in his Unified Data Access with Spark SQL

Re: Announcing Spark 1.0.1

2014-07-14 Thread Tobias Pfeiffer
Hi, congratulations on the release! I'm always pleased to see how features pop up in new Spark versions that I had added for myself in a very hackish way before (such as JSON support for Spark SQL). I am wondering if there is any good way to learn early about what is going to be in upcoming versi

RACK_LOCAL Tasks Failed to finish

2014-07-14 Thread 洪奇
Hi all,When running GraphX applications on Spark, task scheduler may schedule some tasks to be executed on RACK_LOCAL executors,but the tasks get halting in that case, repeating print the following log information: 14-07-14 15:59:14 INFO [Executor task launch worker-6] BlockFetcherIterator$Basic

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Nicholas Chammas
If we're talking about the issue you captured in SPARK-2464 , then it was a newly launched EC2 cluster on 1.0.1. On Mon, Jul 14, 2014 at 10:48 PM, Tathagata Das wrote: > Did you make any updates in Spark version recently, after which you > notic

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Tathagata Das
Did you make any updates in Spark version recently, after which you noticed this problem? Because if you were using Spark 0.8 and below, then twitter would have worked in the Spark shell. In Spark 0.9, we moved those dependencies out of the core spark for those to update more freely without raising

Re: ---cores option in spark-shell

2014-07-14 Thread Andrew Or
Yes, the documentation is actually a little outdated. We will get around to fix it shortly. Please use --driver-cores or --executor-cores instead. 2014-07-14 19:10 GMT-07:00 cjwang : > Neither do they work in new 1.0.1 either > > > > -- > View this message in context: > http://apache-spark-user-

Re: jsonRDD: NoSuchMethodError

2014-07-14 Thread Michael Armbrust
Have you upgraded the cluster where you are running this 1.0.1 as well? A NoSuchMethodError almost always means that the class files available at runtime are different from those that were there when you compiled your program. On Mon, Jul 14, 2014 at 7:06 PM, SK wrote: > Hi, > > I am using Spa

Re: ---cores option in spark-shell

2014-07-14 Thread cjwang
Neither do they work in new 1.0.1 either -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cores-option-in-spark-shell-tp6809p9690.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: spark1.0.1 catalyst transform filter not push down

2014-07-14 Thread Victor Sheng
I use queryPlan.queryExecution.analyzed to get the logical plan. it works. And What you explained to me is very useful. Thank you very much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-catalyst-transform-filter-not-push-down-tp9599p9689.ht

jsonRDD: NoSuchMethodError

2014-07-14 Thread SK
Hi, I am using Spark 1.0.1. I am using the following piece of code to parse a json file. It is based on the code snippet in the SparkSQL programming guide. However, the compiler outputs an error stating: java.lang.NoSuchMethodError: org.apache.spark.sql.SQLContext.jsonRDD(Lorg/apache/spark/rdd/R

SPARK_WORKER_PORT (standalone cluster)

2014-07-14 Thread jay vyas
Hi spark ! What is the purpose of the randomly assigned SPARK_WORKER_PORT from the documentation it sais to "join a cluster", but its not clear to me how a random port could be used to communicate with other members of a spark pool. This question might be grounded in my ignorance ... if so plea

Re: Ideal core count within a single JVM

2014-07-14 Thread Matei Zaharia
BTW you can see the number of parallel tasks in the application UI (http://localhost:4040) or in the log messages (e.g. when it says progress: 17/20, that means there are 20 tasks total and 17 are done). Spark will try to use at least one task per core in local mode so there might be more of the

Re: Ideal core count within a single JVM

2014-07-14 Thread Matei Zaharia
I see, so here might be the problem. With more cores, there's less memory available per core, and now many of your threads are doing external hashing (spilling data to disk), as evidenced by the calls to ExternalAppendOnlyMap.spill. Maybe with 10 threads, there was enough memory per task to do

Re: Error when testing with large sparse svm

2014-07-14 Thread Srikrishna S
I am running Spark 1.0.1 on a 5 node yarn cluster. I have set the driver memory to 8G and executor memory to about 12G. Regards, Krishna On Mon, Jul 14, 2014 at 5:56 PM, Xiangrui Meng wrote: > Is it on a standalone server? There are several settings worthing checking: > > 1) number of partition

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Nicholas Chammas
On Mon, Jul 14, 2014 at 6:52 PM, Tathagata Das wrote: > The twitter functionality is not available through the shell. > I've been processing Tweets live from the shell, though not for a long time. That's how I uncovered the problem with the Twitter receiver not deregistering, btw. Did I misunde

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Tathagata Das
I guess this is not clearly documented. At a high level, any class that is in the package org.apache.spark.streaming.XXX where XXX is in { twitter, kafka, flume, zeromq, mqtt } is not available in the Spark shell. I have added this to the larger JIRA of things-to-add-to-streaming-docs https://

Re: SQL + streaming

2014-07-14 Thread Tathagata Das
Can you make sure you are running locally on more than 1 local cores? You could set the master in the SparkConf as conf.setMaster("local[4]"). Then see if there are jobs running on every batch of data in the Spark web ui (running on localhost:4040). If you still dont get any output, try first simpl

Re: SparkR failed to connect to the master

2014-07-14 Thread cjwang
I tried installing the latest Spark 1.0.1 and SparkR couldn't find the master either. I restarted with Spark 0.9.1 and SparkR was able to find the master. So, there seemed to be something that changed after Spark 1.0.0. -- View this message in context: http://apache-spark-user-list.1001560.n3

Re: Error when testing with large sparse svm

2014-07-14 Thread Xiangrui Meng
Is it on a standalone server? There are several settings worthing checking: 1) number of partitions, which should match the number of cores 2) driver memory (you can see it from the executor tab of the Spark WebUI and set it with "--driver-memory 10g" 3) the version of Spark you were running Best

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread durin
Thanks. Can I see that a Class is not available in the shell somewhere in the API Docs or do I have to find out by trial and error? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/import-org-apache-spark-streaming-twitter-in-Shell-tp9665p9678.html Sent from

Re: SQL + streaming

2014-07-14 Thread hsy...@gmail.com
No errors but no output either... Thanks! On Mon, Jul 14, 2014 at 4:59 PM, Tathagata Das wrote: > Could you elaborate on what is the problem you are facing? Compiler error? > Runtime error? Class-not-found error? Not receiving any data from Kafka? > Receiving data but SQL command throwing error

running spark from intellj

2014-07-14 Thread jamborta
hi all, I have simple example that reads a file in and counts the number of rows as follows: val conf = new SparkConf() .setMaster("spark://spark-master:7077") .setAppName("Test") .set("spark.executor.memory", "256m") val sc = new SparkContext(conf) val data = sc.textFil

Re: Possible bug in Spark Streaming :: TextFileStream

2014-07-14 Thread Tathagata Das
Oh yes, this was a bug and it has been fixed. Checkout from the master branch! https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC TD On Mon, Jul 7,

Re: Spark-Streaming collect/take functionality.

2014-07-14 Thread Tathagata Das
Why doesnt something like this work? If you want a continuously updated reference to the top counts, you can use a global variable. var topCounts: Array[(String, Int)] = null sortedCounts.foreachRDD (rdd => val currentTopCounts = rdd.take(10) // print currentTopCounts it or watever top

Re: SQL + streaming

2014-07-14 Thread Tathagata Das
Could you elaborate on what is the problem you are facing? Compiler error? Runtime error? Class-not-found error? Not receiving any data from Kafka? Receiving data but SQL command throwing error? No errors but no output either? TD On Mon, Jul 14, 2014 at 4:06 PM, hsy...@gmail.com wrote: > Hi Al

Re: How to kill running spark yarn application

2014-07-14 Thread hsy...@gmail.com
Before "yarn application -kill" If you do jps You'll have a list of SparkSubmit and ApplicationMaster After you use yarn applicaton -kill you only kill the SparkSubmit On Mon, Jul 14, 2014 at 4:29 PM, Jerry Lam wrote: > Then yarn application -kill appid should work. This is what I did 2 hours

Re: How to kill running spark yarn application

2014-07-14 Thread Jerry Lam
Then yarn application -kill appid should work. This is what I did 2 hours ago. Sorry I cannot provide more help. Sent from my iPhone > On 14 Jul, 2014, at 6:05 pm, "hsy...@gmail.com" wrote: > > yarn-cluster > > >> On Mon, Jul 14, 2014 at 2:44 PM, Jerry Lam wrote: >> Hi Siyuan, >> >> I won

Spark-Streaming collect/take functionality.

2014-07-14 Thread jon.burns
Hello everyone, I'm an undergrad working on a summarization project. I've created a summarizer in normal Spark and it works great, however I want to write it for Spark_Streaming to increase it's functionality. Basically I take in a bunch of text and get the most popular words as well as most popul

Re: Stateful RDDs?

2014-07-14 Thread Tathagata Das
Trying answer your questions as concisely as possible 1. In the current implementation, the entire state RDD needs to loaded for any update. It is a known limitation, that we want to overcome in the future. Therefore the state Dstream should not be persisted to disk as all the data in the state RD

SQL + streaming

2014-07-14 Thread hsy...@gmail.com
Hi All, Couple days ago, I tried to integrate SQL and streaming together. My understanding is I can transform RDD from Dstream to schemaRDD and execute SQL on each RDD. But I got no luck Would you guys help me take a look at my code? Thank you very much! object KafkaSpark { def main(args: Arr

Change when loading/storing String data using Parquet

2014-07-14 Thread Michael Armbrust
I just wanted to send out a quick note about a change in the handling of strings when loading / storing data using parquet and Spark SQL. Before, Spark SQL did not support binary data in Parquet, so all binary blobs were implicitly treated as Strings. 9fe693

Re: import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread Tathagata Das
The twitter functionality is not available through the shell. 1) we separated these non-core functionality into separate subprojects so that their dependencies do not collide/pollute those of of core spark 2) a shell is not really the best way to start a long running stream. Its best to use twitte

import org.apache.spark.streaming.twitter._ in Shell

2014-07-14 Thread durin
I'm using spark > 1.0.0 (three weeks old build of latest). Along the lines of this tutorial , I want to read some tweets from twitter. When trying to execute in the Spark-Shell, I get The tutorial

Re: Spark Streaming Json file groupby function

2014-07-14 Thread Tathagata Das
In general it may be a better idea to actually convert the records from hashmaps, to a specific data structure. Say case class Record(id: int, name: String, mobile: String, score: Int, test_type: String ... ) Then you should be able to do something like val records = jsonf.map(m => convertMapToR

Re: Number of executors change during job running

2014-07-14 Thread Tathagata Das
Can you give me a screen shot of the stages page in the web ui, the spark logs, and the code that is causing this behavior. This seems quite weird to me. TD On Mon, Jul 14, 2014 at 2:11 PM, Bill Jay wrote: > Hi Tathagata, > > It seems repartition does not necessarily force Spark to distribute

Re: Spark Streaming Json file groupby function

2014-07-14 Thread srinivas
Hi, Thanks for ur reply...i imported StreamingContext and right now i am getting my Dstream as something like map(id -> 123, name -> srini, mobile -> 12324214, score -> 123, test_type -> math) map(id -> 321, name -> vasu, mobile -> 73942090, score -> 324, test_type ->sci) map(id -> 432, name -

Re: How to kill running spark yarn application

2014-07-14 Thread hsy...@gmail.com
yarn-cluster On Mon, Jul 14, 2014 at 2:44 PM, Jerry Lam wrote: > Hi Siyuan, > > I wonder if you --master yarn-cluster or yarn-client? > > Best Regards, > > Jerry > > > On Mon, Jul 14, 2014 at 5:08 PM, hsy...@gmail.com > wrote: > >> Hi all, >> >> A newbie question, I start a spark yarn applicat

Parsing Json object definition spanning multiple lines

2014-07-14 Thread SK
Hi, I have a json file where the definition of each object spans multiple lines. An example of one object definition appears below. { "name": "16287e9cdf", "width": 500, "height": 325, "width": 1024, "height": 665, "obj": [ { "x": 395.08, "y": 82.09,

Re: SparkR failed to connect to the master

2014-07-14 Thread cjwang
I restarted Spark Master with spark-0.9.1 and SparkR was able to communicate with the Master. I am using the latest SparkR pkg-e1f95b6. Maybe it has problem communicating to Spark 1.0.0? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-failed-to-co

Re: How to kill running spark yarn application

2014-07-14 Thread Jerry Lam
Hi Siyuan, I wonder if you --master yarn-cluster or yarn-client? Best Regards, Jerry On Mon, Jul 14, 2014 at 5:08 PM, hsy...@gmail.com wrote: > Hi all, > > A newbie question, I start a spark yarn application through spark-submit > > How do I kill this app. I can kill the yarn app by "ya

Re: Spark Streaming Json file groupby function

2014-07-14 Thread srinivas
Hi, Thanks for ur reply...i imported StreamingContext and right now i am getting my Dstream as something like map(id -> 123, name -> srini, mobile -> 12324214, score -> 123, test_type -> math) map(id -> 321, name -> vasu, mobile -> 73942090, score -> 324, test_type ->sci) map(id -> 432, name -

Re: pyspark sc.parallelize running OOM with smallish data

2014-07-14 Thread Mohit Jaggi
Continuing to debug with Scala, I tried this on local with enough memory (10g) and it is able to count the dataset. With more memory(for executor and driver) in a cluster it still fails. The data is about 2Gbytes. It is 30k * 4k doubles. On Sat, Jul 12, 2014 at 6:31 PM, Aaron Davidson wrote: >

Re: Memory & compute-intensive tasks

2014-07-14 Thread Daniel Siegmann
Depending on how your C++ program is designed, maybe you can feed the data from multiple partitions into the same process? Getting the results back might be tricky. But that may be the only way to guarantee you're only using one invocation per node. On Mon, Jul 14, 2014 at 5:12 PM, Matei Zaharia

Re: Spark 1.0.1 EC2 - Launching Applications

2014-07-14 Thread Matei Zaharia
The script should be there, in the spark/bin directory. What command did you use to launch the cluster? Matei On Jul 14, 2014, at 1:12 PM, Josh Happoldt wrote: > Hi All, > > I've used the spark-ec2 scripts to build a simple 1.0.1 Standalone cluster on > EC2. It appears that the spark-submit

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Matei Zaharia
Yeah, I'd just add a spark-util that has these things. Matei On Jul 14, 2014, at 1:04 PM, Michael Armbrust wrote: > Yeah, sadly this dependency was introduced when someone consolidated the > logging infrastructure. However, the dependency should be very small and > thus easy to remove, and I

Re: Memory & compute-intensive tasks

2014-07-14 Thread Matei Zaharia
I think coalesce with shuffle=true will force it to have one task per node. Without that, it might be that due to data locality it decides to launch multiple ones on the same node even though the total # of tasks is equal to the # of nodes. If this is the *only* thing you run on the cluster, yo

Re: Number of executors change during job running

2014-07-14 Thread Bill Jay
Hi Tathagata, It seems repartition does not necessarily force Spark to distribute the data into different executors. I have launched a new job which uses repartition right after I received data from Kafka. For the first two batches, the reduce stage used more than 80 executors. Starting from the t

How to kill running spark yarn application

2014-07-14 Thread hsy...@gmail.com
Hi all, A newbie question, I start a spark yarn application through spark-submit How do I kill this app. I can kill the yarn app by "yarn application -kill appid" but the application master is still running. What's the proper way to shutdown the entire app? Best, Siyuan

Re: Memory & compute-intensive tasks

2014-07-14 Thread Daniel Siegmann
I don't have a solution for you (sorry), but do note that rdd.coalesce(numNodes) keeps data on the same nodes where it was. If you set shuffle=true then it should repartition and redistribute the data. But it uses the hash partitioner according to the ScalaDoc - I don't know of any way to supply a

Re: Client application that calls Spark and receives an MLlib *model* Scala Object, not just result

2014-07-14 Thread Soumya Simanta
Please look at the following. https://github.com/ooyala/spark-jobserver http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language https://github.com/EsotericSoftware/kryo You can train your model convert it to PMML and return that to your client OR You can train your model and write that mod

Client application that calls Spark and receives an MLlib *model* Scala Object, not just result

2014-07-14 Thread Aris Vlasakakis
Hello Spark community, I would like to write an application in Scala that i a model server. It should have an MLlib Linear Regression model that is already trained on some big set of data, and then is able to repeatedly call myLinearRegressionModel.predict() many times and return the result. Now,

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-14 Thread Michael Armbrust
Handling of complex types is somewhat limited in SQL at the moment. It'll be more complete if you use HiveQL. That said, the problem here is you are calling .name on an array. You need to pick an item from the array (using [..]) or use something like a lateral view explode. On Sat, Jul 12, 201

Spark 1.0.1 EC2 - Launching Applications

2014-07-14 Thread Josh Happoldt
Hi All, I've used the spark-ec2 scripts to build a simple 1.0.1 Standalone cluster on EC2. It appears that the spark-submit script is not bundled with a spark-ec2 install. Given that: What is the recommended way to execute spark jobs on a standalone EC2 cluster? Spark-submit provides extrem

Memory & compute-intensive tasks

2014-07-14 Thread Ravi Pandya
I'm trying to run a job that includes an invocation of a memory & compute-intensive multithreaded C++ program, and so I'd like to run one task per physical node. Using rdd.coalesce(# nodes) seems to just allocate one task per core, and so runs out of memory on the node. Is there any way to give the

Re: Nested Query With Spark SQL(1.0.1)

2014-07-14 Thread Michael Armbrust
What sort of nested query are you talking about? Right now we only support nested queries in the FROM clause. I'd like to add support for other cases in the future. On Sun, Jul 13, 2014 at 4:11 AM, anyweil wrote: > Or is it supported? I know I could doing it myself with filter, but if SQL > c

Re: Catalyst dependency on Spark Core

2014-07-14 Thread Michael Armbrust
Yeah, sadly this dependency was introduced when someone consolidated the logging infrastructure. However, the dependency should be very small and thus easy to remove, and I would like catalyst to be usable outside of Spark. A pull request to make this possible would be welcome. Ideally, we'd cre

Re: Ideal core count within a single JVM

2014-07-14 Thread lokesh.gidra
I am only playing with 'N' in local[N]. I thought that by increasing N, Spark will automatically use more parallel tasks. Isn't it so? Can you please tell me how can I modify the number of parallel tasks? For me, there are hardly any threads in BLOCKED state in jstack output. In 'top' I see my app

Re: can't print DStream after reduce

2014-07-14 Thread Tathagata Das
The problem is not really for local[1] or local. The problem arises when there are more input streams than there are cores. But I agree, for people who are just beginning to use it by running it locally, there should be a check addressing this. I made a JIRA for this. https://issues.apache.org/jir

Re: Spark SQL 1.0.1 error on reading fixed length byte array

2014-07-14 Thread Michael Armbrust
This is not supported yet, but there is a PR open to fix it: https://issues.apache.org/jira/browse/SPARK-2446 On Mon, Jul 14, 2014 at 4:17 AM, Pei-Lun Lee wrote: > Hi, > > I am using spark-sql 1.0.1 to load parquet files generated from method > described in: > > https://gist.github.com/massie/7

Re: Error in JavaKafkaWordCount.java example

2014-07-14 Thread Tathagata Das
Are you compiling it within Spark using Spark's recommended way (see doc web page)? Or are you compiling it in your own project? In the latter case, make sure you are using the Scala 2.10.4. TD On Sun, Jul 13, 2014 at 6:43 AM, Mahebub Sayyed wrote: > Hello, > > I am referring following example

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-14 Thread Tathagata Das
When you are sending data using simple socket code to send messages, are those messages "\n" delimited? If its not, then the receiver of socketTextSTream, wont identify them as separate events, and keep buffering them. TD On Sun, Jul 13, 2014 at 10:49 PM, kytay wrote: > Hi Tobias > > I have be

Re: Ideal core count within a single JVM

2014-07-14 Thread Matei Zaharia
Are you increasing the number of parallel tasks with cores as well? With more tasks there will be more data communicated and hence more calls to these functions. Unfortunately contention is kind of hard to measure, since often the result is that you see many cores idle as they're waiting on a l

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-14 Thread Tathagata Das
The depends on your requirements. If you want to process the 250 GB input file as a "stream" to emulate the stream of data, then it should be split into files (such that event ordering is maintained in those splits, if necessary). And then those splits should be moved one-by-one in the directory mo

Re: All of the tasks have been completed but the Stage is still shown as "Active"?

2014-07-14 Thread Tathagata Das
Seems like it is related. Possibly those PRs that Andrew mentioned are going to fix this issue. On Fri, Jul 11, 2014 at 5:51 AM, Haopu Wang wrote: > I saw some exceptions like this in driver log. Can you shed some > lights? Is it related with the behaviour? > > > > 14/07/11 20:40:09 ERROR Liv

Re: Number of executors change during job running

2014-07-14 Thread Tathagata Das
After using repartition(300), how many executors did it run on? By the way, repartitions(300) means it will divide the shuffled data into 300 partitions. Since there are many cores on each of the 300 machines/executors, these partitions (each requiring a core) may not be spread all 300 executors. H

Re: Spark Streaming Json file groupby function

2014-07-14 Thread Tathagata Das
You have to import StreamingContext._ to enable groupByKey operations on DStreams. After importing that you can apply groupByKey on any DStream, that is a DStream of key-value pairs (e.g. DStream[(String, Int)]) . The data in each pair RDDs will be grouped by the first element in the tuple as the

Re: Ideal core count within a single JVM

2014-07-14 Thread lokesh.gidra
Thanks a lot for replying back. Actually, I am running the SparkPageRank example with 160GB heap (I am sure the problem is not GC because the excess time is being spent in java code only). What I have observed in Jprofiler and Oprofile outputs is that the amount of time spent in following 2 funct

Re: writing FLume data to HDFS

2014-07-14 Thread Tathagata Das
Stepping a bit back, if you just want to write flume data to HDFS, you can use flume's HDFS sink for that. Trying to do this using Spark Streaming and SparkFlumeEvent is unnecessarily complex. And I guess it is tricky to write the raw bytes from the sparkflumevent into a file. If you want to do it

Re: Can we get a spark context inside a mapper

2014-07-14 Thread Jerry Lam
Hi there, I think the question is interesting; a spark of sparks = spark I wonder if you can use the spark job server ( https://github.com/ooyala/spark-jobserver)? So in the spark task that requires a new spark context, instead of creating it in the task, contact the job server to create one and

Re: Error when testing with large sparse svm

2014-07-14 Thread Srikrishna S
That is exactly the same error that I got. I am still having no success. Regards, Krishna On Mon, Jul 14, 2014 at 11:50 AM, crater wrote: > Hi Krishna, > > Thanks for your help. Are you able to get your 29M data running yet? I fix > the previous problem by setting larger spark.akka.frameSize, bu

  1   2   >