Re: used cores are less then total no. of core

2015-02-24 Thread Akhil Das
You can set the following in the conf while creating the SparkContext (if you are not using spark-submit) .set("spark.cores.max", "32") Thanks Best Regards On Wed, Feb 25, 2015 at 11:52 AM, Somnath Pandeya < somnath_pand...@infosys.com> wrote: > Hi All, > > > > I am running a simple word co

Re: Executors dropping all memory stored RDDs?

2015-02-24 Thread Thomas Gerber
I have a strong suspicion that it was caused by a disk full on the executor. I am not sure if the executor was supposed to recover that way from it. I cannot be sure about it, I should have had enough disk space, but I think I had some data skew which could have lead to some executor to run out of

Re: used cores are less then total no. of core

2015-02-24 Thread VISHNU SUBRAMANIAN
Try adding --total-executor-cores 5 , where 5 is the number of cores. Thanks, Vishnu On Wed, Feb 25, 2015 at 11:52 AM, Somnath Pandeya < somnath_pand...@infosys.com> wrote: > Hi All, > > > > I am running a simple word count example of spark (standalone cluster) , > In the UI it is showing > > F

Re: Cannot access Spark web UI

2015-02-24 Thread Mukesh Jha
My Hadoop version is Hadoop 2.5.0-cdh5.3.0 >From the Driver logs [3] I can see that SparkUI started on a specified port, also my YARN app tracking URL[1] points to that port which is in turn getting redirected to the proxy URL[2] which gives me java.net.BindException: Cannot assign requested addre

Re: spark streaming: stderr does not roll

2015-02-24 Thread Mukesh Jha
I'm also facing the same issue. I tried the configurations but it seems the executors spark's log4j.properties seems to override the passed values, so you have to change /etc/spark/conf/log4j.properties. Let me know if any of you have managed to get this fixes programatically. I am planning to u

Re: Unable to run hive queries inside spark

2015-02-24 Thread Denny Lee
That's all you should need to do. Saying this, I did run into an issue similar to this when I was switching Spark versions which were tied to different default Hive versions (eg Spark 1.3 by default works with Hive 0.13.1). I'm wondering if you may be hitting this issue due to that? On Tue, Feb 24,

Re: Unable to run hive queries inside spark

2015-02-24 Thread kundan kumar
Hi Denny, yes the user has all the rights to HDFS. I am running all the spark operations with this user. and my hive-site.xml looks like this hive.metastore.warehouse.dir /user/hive/warehouse location of default database for the warehouse Do I need to do anything explicitly oth

used cores are less then total no. of core

2015-02-24 Thread Somnath Pandeya
Hi All, I am running a simple word count example of spark (standalone cluster) , In the UI it is showing For each worker no. of cores available are 32 ,but while running the jobs only 5 cores are being used, What should I do to increase no. of used core or it is selected based on jobs. Thanks

Re: Not able to update collections

2015-02-24 Thread Shixiong Zhu
Rdd.foreach runs in the executors. You should use `collect` to fetch data to the driver. E.g., myRdd.collect().foreach { node => { mp(node) = 1 } } Best Regards, Shixiong Zhu 2015-02-25 4:00 GMT+08:00 Vijayasarathy Kannan : > Thanks, but it still doesn't seem to work. > > Bel

Re: Unable to run hive queries inside spark

2015-02-24 Thread Denny Lee
The error message you have is: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:file:/user/hive/warehouse/src is not a directory or unable to create one) Could you verify that you (the user you are running under) has the rights to create th

SparkStreaming failing with exception Could not compute split, block input

2015-02-24 Thread Mukesh Jha
Hi Experts, My Spark Job is failing with below error. >From the logs I can see that input-3-1424842351600 was added at 5:32:32 and was never purged out of memory. Also the available free memory for the executor is *2.1G*. Please help me figure out why executors cannot fetch this input. Txz for

Help vote for Spark talks at the Hadoop Summit

2015-02-24 Thread Reynold Xin
Hi all, The Hadoop Summit uses community choice voting to decide which talks to feature. It would be great if the community could help vote for Spark talks so that Spark has a good showing at this event. You can make three votes on each track. Below I've listed 3 talks that are important to Spark'

Re: Spark excludes "fastutil" dependencies we need

2015-02-24 Thread Ted Yu
bq. depend on missing fastutil classes like Long2LongOpenHashMap Looks like Long2LongOpenHashMap should be added to the shaded jar. Cheers On Tue, Feb 24, 2015 at 7:36 PM, Jim Kleckner wrote: > Spark includes the clearspring analytics package but intentionally excludes > the dependencies of th

Unable to run hive queries inside spark

2015-02-24 Thread kundan kumar
Hi , I have placed my hive-site.xml inside spark/conf and i am trying to execute some hive queries given in the documentation. Can you please suggest what wrong am I doing here. scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) hiveContext: org.apache.spark.sql.hive.HiveCo

Task not serializable exception

2015-02-24 Thread Kartheek.R
Hi, I run into Task not Serializable excption with following code below. When I remove the threads and run, it works, but with threads I run into Task not serializable exception. object SparkKart extends Serializable{ def parseVector(line: String): Vector[Double] = { DenseVector(line.split('

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-24 Thread anamika gupta
Hi Akhil I guess it skipped my attention. I would definitely give it a try. While I would still like to know what is the issue with the way I have created schema? On Tue, Feb 24, 2015 at 4:35 PM, Akhil Das wrote: > Did you happen to have a look at > https://spark.apache.org/docs/latest/sql-pro

Spark excludes "fastutil" dependencies we need

2015-02-24 Thread Jim Kleckner
Spark includes the clearspring analytics package but intentionally excludes the dependencies of the fastutil package (see below). Spark includes parquet-column which includes fastutil and relocates it under parquet/ but creates a shaded jar file which is incomplete because it shades out some of t

Re: How to start spark-shell with YARN?

2015-02-24 Thread Denny Lee
It may have to do with the akka heartbeat interval per SPARK-3923 - https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-3923 ? On Tue, Feb 24, 2015 at 16:40 Xi Shen wrote: > Hi Sean, > > I launched the spark-shell on the same machine as I started YARN service. > I don't think port

Re: How to start spark-shell with YARN?

2015-02-24 Thread Xi Shen
Hi Sean, I launched the spark-shell on the same machine as I started YARN service. I don't think port will be an issue. I am new to spark. I checked the HDFS web UI and the YARN web UI. But I don't know how to check the AM. Can you help? Thanks, David On Tue, Feb 24, 2015 at 8:37 PM Sean Owen

Re: reduceByKey vs countByKey

2015-02-24 Thread Jey Kottalam
Hi Sathish, The current implementation of countByKey uses reduceByKey: https://github.com/apache/spark/blob/v1.2.1/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L332 It seems that countByKey is mostly deprecated: https://issues.apache.org/jira/browse/SPARK-3994 -Jey On Tue, Fe

Re: Can you add Big Industries to the Powered by Spark page?

2015-02-24 Thread Emre Sevinc
Hello, Thanks for adding, but URL seems to have a typo: when I click it tries to open http//www.bigindustries.be/ But it should be: http://www.bigindustries.be/ Kind regards, Emre Sevinç http://http//www.bigindustries.be/ On Feb 25, 2015 12:29 AM, "Patrick Wendell" wrote: > I've added it,

reduceByKey vs countByKey

2015-02-24 Thread Sathish Kumaran Vairavelu
Hello, Quick question. I am trying to understand difference between reduceByKey vs countByKey? Which one gives better performance reduceByKey or countByKey? While we can perform same count operation using reduceByKey why we need countByKey/countByValue? Thanks Sathish

Re: [ML][SQL] Select UserDefinedType attribute in a DataFrame

2015-02-24 Thread Jaonary Rabarisoa
By doing so, I got the following error : Exception in thread "main" org.apache.spark.sql.AnalysisException: GetField is not valid on fields Seems that it doesn't like image.data expression. On Wed, Feb 25, 2015 at 12:37 AM, Xiangrui Meng wrote: > Btw, the correct syntax for alias should be > `

Re: [ML][SQL] Select UserDefinedType attribute in a DataFrame

2015-02-24 Thread Xiangrui Meng
Btw, the correct syntax for alias should be `df.select($"image.data".as("features"))`. On Tue, Feb 24, 2015 at 3:35 PM, Xiangrui Meng wrote: > If you make `Image` a case class, then select("image.data") should work. > > On Tue, Feb 24, 2015 at 3:06 PM, Jaonary Rabarisoa wrote: >> Hi all, >> >> I

Re: [ML][SQL] Select UserDefinedType attribute in a DataFrame

2015-02-24 Thread Xiangrui Meng
If you make `Image` a case class, then select("image.data") should work. On Tue, Feb 24, 2015 at 3:06 PM, Jaonary Rabarisoa wrote: > Hi all, > > I have a DataFrame that contains a user defined type. The type is an image > with the following attribute > > class Image(w: Int, h: Int, data: Vector)

Re: Add PredictionIO to Powered by Spark

2015-02-24 Thread Patrick Wendell
Added - thanks! I trimmed it down a bit to fit our normal description length. On Mon, Jan 5, 2015 at 8:24 AM, Thomas Stone wrote: > Please can we add PredictionIO to > https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark > > PredictionIO > http://prediction.io/ > > PredictionIO is a

Re: Can you add Big Industries to the Powered by Spark page?

2015-02-24 Thread Patrick Wendell
I've added it, thanks! On Fri, Feb 20, 2015 at 12:22 AM, Emre Sevinc wrote: > > Hello, > > Could you please add Big Industries to the Powered by Spark page at > https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark ? > > > Company Name: Big Industries > > URL: http://http://www.bigi

[ML][SQL] Select UserDefinedType attribute in a DataFrame

2015-02-24 Thread Jaonary Rabarisoa
Hi all, I have a DataFrame that contains a user defined type. The type is an image with the following attribute *class Image(w: Int, h: Int, data: Vector)* In my DataFrame, images are stored in column named "image" that corresponds to the following case class *case class LabeledImage(label: Int

EventLog / Timeline calculation - Optimization

2015-02-24 Thread syepes
Hello, For the past days I have been trying to process and analyse with Spark a Cassandra eventLog table similar to the one shown here. Basically what I want to calculate is the delta time "epoch" between each event type for all the device id's in the table. Currently its working as expected but I

Re: Is Ubuntu server or desktop better for spark cluster

2015-02-24 Thread Sebastián Ramírez
Check out the FAQ in the link by Deepak Vohra. The main differences are that the desktop installation includes common user's packages, as LibreOffice, while the server installation doesn't. But the server includes "server user's" packages, as apache2. Also, the Desktop has a GUI (a graphical inte

Re: Spark 1.3 dataframe documentation

2015-02-24 Thread Davies Liu
Another way to see the Python docs: $ export PYTHONPATH=$SPARK_HOME/python $ pydoc pyspark.sql On Tue, Feb 24, 2015 at 2:01 PM, Reynold Xin wrote: > The official documentation will be posted when 1.3 is released (early > March). > > Right now, you can build the docs yourself by running "jekyll b

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Yiannis Gkoufas
No problem, Joe. There you go https://issues.apache.org/jira/browse/SPARK-5081 And also there is this one https://issues.apache.org/jira/browse/SPARK-5715 which is marked as resolved On 24 February 2015 at 21:51, Joe Wass wrote: > Thanks everyone. > > Yiannis, do you know if there's a bug report

Re: Spark 1.3 dataframe documentation

2015-02-24 Thread Reynold Xin
The official documentation will be posted when 1.3 is released (early March). Right now, you can build the docs yourself by running "jekyll build" in docs. Alternatively, just look at dataframe,py as Ted pointed out. On Tue, Feb 24, 2015 at 6:56 AM, Ted Yu wrote: > Have you looked at python/py

Fair Scheduler Pools

2015-02-24 Thread pnpritchard
Hi, I am trying to use the fair scheduler pools (http://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools) to schedule two jobs at the same time. In my simple example, I have configured spark in local mode with 2 cores ("local[2]"). I have also configured two pools in fairsche

Re: New guide on how to write a Spark job in Clojure

2015-02-24 Thread Reynold Xin
Thanks for sharing, Chris. On Tue, Feb 24, 2015 at 4:39 AM, Christian Betz < christian.b...@performance-media.de> wrote: > Hi all, > > Maybe some of you are interested: I wrote a new guide on how to start > using Spark from Clojure. The tutorial covers > >- setting up a project, >- doin

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Joe Wass
Thanks everyone. Yiannis, do you know if there's a bug report for this regression? For some other (possibly connected) reason I upgraded from 1.1.1 to 1.2.1, but I can't remember what the bug was. Joe On 24 February 2015 at 19:26, Yiannis Gkoufas wrote: > Hi there, > > I assume you are usin

Re: Movie Recommendation tutorial

2015-02-24 Thread Sean Owen
It's something like the average error in rating, but a bit different -- it's the square root of average squared error. But if you think of the ratings as 'stars' you could kind of think of 0.86 as 'generally off by 0.86' stars and that would be somewhat right. Whether that's good depends on what t

Re: Movie Recommendation tutorial

2015-02-24 Thread Krishna Sankar
Yep, much better with 0.1. "The best model was trained with rank = 12 and lambda = 0.1, and numIter = 20, and its RMSE on the test set is 0.869092" (Spark 1.3.0) Question : What is the intuition behind RSME of 0.86 vs 1.3 ? I know the smaller the better. But is it that better ? And what is a good

Re: Not able to update collections

2015-02-24 Thread Vijayasarathy Kannan
Thanks, but it still doesn't seem to work. Below is my entire code. var mp = scala.collection.mutable.Map[VertexId, Int]() var myRdd = graph.edges.groupBy[VertexId](f).flatMap { edgesBySrc => func(edgesBySrc, a, b) } myRdd.foreach { node => { mp(node) = 1 } } Val

Re: Not able to update collections

2015-02-24 Thread Sean Owen
Instead of ...foreach { edgesBySrc => { lst ++= func(edgesBySrc) } } try ...flatMap { edgesBySrc => func(edgesBySrc) } or even more succinctly ...flatMap(func) This returns an RDD that basically has the list you are trying to build, I believe. You can collect() to the driver but be

Re: Not able to update collections

2015-02-24 Thread Vijayasarathy Kannan
I am a beginner to Scala/Spark. Could you please elaborate on how to make RDD of results of func() and collect? On Tue, Feb 24, 2015 at 2:27 PM, Sean Owen wrote: > They aren't the same 'lst'. One is on your driver. It gets copied to > executors when the tasks are executed. Those copies are upda

Re: Not able to update collections

2015-02-24 Thread Sean Owen
They aren't the same 'lst'. One is on your driver. It gets copied to executors when the tasks are executed. Those copies are updated. But the updates will never reflect in the local copy back in the driver. You may just wish to make an RDD of the results of func() and collect() them back to the dr

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Yiannis Gkoufas
Hi there, I assume you are using spark 1.2.1 right? I faced the exact same issue and switched to 1.1.1 with the same configuration and it was solved. On 24 Feb 2015 19:22, "Ted Yu" wrote: > Here is a tool which may give you some clue: > http://file-leak-detector.kohsuke.org/ > > Cheers > > On Tu

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Ted Yu
Here is a tool which may give you some clue: http://file-leak-detector.kohsuke.org/ Cheers On Tue, Feb 24, 2015 at 11:04 AM, Vladimir Rodionov < vrodio...@splicemachine.com> wrote: > Usually it happens in Linux when application deletes file w/o double > checking that there are no open FDs (resou

Not able to update collections

2015-02-24 Thread kvvt
I am working on the below piece of code. var lst = scala.collection.mutable.MutableList[VertexId]() graph.edges.groupBy[VertexId](f).foreach { edgesBySrc => { lst ++= func(edgesBySrc) } } println(lst.length) Here, the final println() always says that the length of the list is 0. The li

Re: Running out of space (when there's no shortage)

2015-02-24 Thread Vladimir Rodionov
Usually it happens in Linux when application deletes file w/o double checking that there are no open FDs (resource leak). In this case, Linux holds all space allocated and does not release it until application exits (crashes in your case). You check file system and everything is normal, you have en

Re: Brodcast Variable updated from one transformation and used from another

2015-02-24 Thread Yiannis Gkoufas
Sorry for the mistake, I actually have it this way: val myObject = new MyObject(); val myObjectBroadcasted = sc.broadcast(myObject); val rdd1 = sc.textFile("/file1").map(e => { myObjectBroadcasted.value.insert(e._1); (e._1,1) }); rdd.cache.count(); //to make sure it is transformed. val rdd2 =

Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-24 Thread Colin Kincaid Williams
So back to my original question. I can see the spark logs using the example above: yarn logs -applicationId application_1424740955620_0009 This shows yarn log aggregation working. I can see the std out and std error in that container information above. Then how can I get this information in a we

Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-24 Thread Imran Rashid
the spark history server and the yarn history server are totally independent. Spark knows nothing about yarn logs, and vice versa, so unfortunately there isn't any way to get all the info in one place. On Tue, Feb 24, 2015 at 12:36 PM, Colin Kincaid Williams wrote: > Looks like in my tired stat

Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-24 Thread Colin Kincaid Williams
Looks like in my tired state, I didn't mention spark the whole time. However, it might be implied by the application log above. Spark log aggregation appears to be working, since I can run the yarn command above. I do have yarn logging setup for the yarn history server. I was trying to use the spar

Re: Sharing Spark Drivers

2015-02-24 Thread John Omernik
I am aware of that, but two things are working against me here with spark-kernel. Python is our language, and we are really looking for a supported way to approach this for the enterprise. I like the concept, it just doesn't work for us given our constraints. This does raise an interesting point

RE: Brodcast Variable updated from one transformation and used from another

2015-02-24 Thread Ganelin, Ilya
You're not using the broadcasted variable within your map operations. You're attempting to modify myObjrct directly which won't work because you are modifying the serialized copy on the executor. You want to do myObjectBroadcasted.value.insert and myObjectBroadcasted.value.lookup. Sent with G

[SparkSQL] Number of map tasks in SparkSQL

2015-02-24 Thread Yana Kadiyska
Shark used to have shark.map.tasks variable. Is there an equivalent for Spark SQL? We are trying a scenario with heavily partitioned Hive tables. We end up with a UnionRDD with a lot of partitions underneath and hence too many tasks: https://github.com/apache/spark/blob/master/sql/hive/src/main/sc

Brodcast Variable updated from one transformation and used from another

2015-02-24 Thread Yiannis Gkoufas
Hi all, I am trying to do the following. val myObject = new MyObject(); val myObjectBroadcasted = sc.broadcast(myObject); val rdd1 = sc.textFile("/file1").map(e => { myObject.insert(e._1); (e._1,1) }); rdd.cache.count(); //to make sure it is transformed. val rdd2 = sc.textFile("/file2").map(e

throughput in the web console?

2015-02-24 Thread Josh J
Hi, I plan to run a parameter search varying the number of cores, epoch, and parallelism. The web console provides a way to archive the previous runs, though is there a way to view in the console the throughput? Rather than logging the throughput separately to the log files and correlating the log

Re: Running multiple threads with same Spark Context

2015-02-24 Thread Yana Kadiyska
It's hard to tell. I have not run this on EC2 but this worked for me: The only thing that I can think of is that the scheduling mode is set to - *Scheduling Mode:* FAIR val pool: ExecutorService = Executors.newFixedThreadPool(poolSize) while_loop to get curr_job pool.execute(new ReportJ

Re: How to get yarn logs to display in the spark or yarn history-server?

2015-02-24 Thread Christophe Préaud
Hi Colin, Here is how I have configured my hadoop cluster to have yarn logs available through both the yarn CLI and the _yarn_ history server (with gzip compression and 10 days retention): 1. Add the following properties in the yarn-site.xml on each node managers and on the resource manager:

RE: Union and reduceByKey will trigger shuffle even same partition?

2015-02-24 Thread Shuai Zheng
Hi Imran, I will say your explanation is extremely helpful J I tested some ideas according to your explanation and it make perfect sense to me. I modify my code to use cogroup+mapValues instead of union+reduceByKey to preserve the partition, which gives me more than 100% performance gain

Running out of space (when there's no shortage)

2015-02-24 Thread Joe Wass
I'm running a cluster of 3 Amazon EC2 machines (small number because it's expensive when experiments keep crashing after a day!). Today's crash looks like this (stacktrace at end of message). org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 On my thr

Re: Executor size and checkpoints

2015-02-24 Thread Yana Kadiyska
Tathagata, yes, I was using StreamingContext.getOrCreate. My question is about the design decision here. I was expecting that if I have a streaming application that say crashed, and I wanted to give the executors more memory, I would be able to restart, using the checkpointed RDD but with more memo

Re: Accumulator in SparkUI for streaming

2015-02-24 Thread Petar Zecevic
Interesting. Accumulators are shown on Web UI if you are using the ordinary SparkContext (Spark 1.2). It just has to be named (and that's what you did). scala> val acc = sc.accumulator(0, "test accumulator") acc: org.apache.spark.Accumulator[Int] = 0 scala> val rdd = sc.parallelize(1 to 1000)

Re: Pyspark save Decison Tree Module with joblib/pickle

2015-02-24 Thread Sebastián Ramírez
Great to know, thanks Xiangrui. *Sebastián Ramírez* Diseñador de Algoritmos Tel: (+571) 795 7950 ext: 1012 Cel: (+57) 300 370 77 10 Calle 73 No 7 - 06 Piso 4 Linkedin: co.linkedin.com/in/tiangolo/ Twitter: @tiangolo

Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
Thank You Akhil. Will look into it. Its free, isn't it? I am still a student :) On Tue, Feb 24, 2015 at 9:06 PM, Akhil Das wrote: > If you signup for Google Compute Cloud, you will get free $300 credits for > 3 months and you can start a pretty good cluster for your testing purposes. > :) > > Th

Re: updateStateByKey and invFunction

2015-02-24 Thread Ashish Sharma
But how will I specify my state there? On Tue, Feb 24, 2015 at 12:50 AM Arush Kharbanda wrote: > You can use a reduceByKeyAndWindow with your specific time window. You can > specify the inverse function in reduceByKeyAndWindow. > > On Tue, Feb 24, 2015 at 1:36 PM, Ashish Sharma > wrote: > >> So

Re: Spark on EC2

2015-02-24 Thread Akhil Das
Yes it is :) Thanks Best Regards On Tue, Feb 24, 2015 at 9:09 PM, Deep Pradhan wrote: > Thank You Akhil. Will look into it. > Its free, isn't it? I am still a student :) > > On Tue, Feb 24, 2015 at 9:06 PM, Akhil Das > wrote: > >> If you signup for Google Compute Cloud, you will get free $300

Re: Spark on EC2

2015-02-24 Thread Akhil Das
If you signup for Google Compute Cloud, you will get free $300 credits for 3 months and you can start a pretty good cluster for your testing purposes. :) Thanks Best Regards On Tue, Feb 24, 2015 at 8:25 PM, Deep Pradhan wrote: > Hi, > I have just signed up for Amazon AWS because I learnt that i

Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
Thank You All. I think I will look into paying ~$0.7/hr as Sean suggested. On Tue, Feb 24, 2015 at 9:01 PM, gen tang wrote: > Hi, > > I am sorry that I made a mistake on AWS tarif. You can read the email of > sean owen which explains better the strategies to run spark on AWS. > > For your questi

Re: Spark on EC2

2015-02-24 Thread gen tang
Hi, I am sorry that I made a mistake on AWS tarif. You can read the email of sean owen which explains better the strategies to run spark on AWS. For your question: it means that you just download spark and unzip it. Then run spark shell by ./bin/spark-shell or ./bin/pyspark. It is useful to get f

Re: Sharing Spark Drivers

2015-02-24 Thread Chip Senkbeil
Hi John, This would be a potential application for the Spark Kernel project ( https://github.com/ibm-et/spark-kernel). The Spark Kernel serves as your driver application, allowing you to feed it snippets of code (or load up entire jars via magics) in Scala to execute against a Spark cluster. Alth

Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
No, I think I am ok with the time it takes. Just that, with the increase in the partitions along with the increase in the number of workers, I want to see the improvement in the performance of an application. I just want to see this happen. Any comments? Thank You On Tue, Feb 24, 2015 at 8:52 PM,

Re: Spark on EC2

2015-02-24 Thread Sean Owen
You can definitely, easily, try a 1-node standalone cluster for free. Just don't be surprised when the CPU capping kicks in within about 5 minutes of any non-trivial computation and suddenly the instance is very s-l-o-w. I would consider just paying the ~$0.07/hour to play with an m3.medium, which

Re: Spark on EC2

2015-02-24 Thread Charles Feduke
This should help you understand the cost of running a Spark cluster for a short period of time: http://www.ec2instances.info/ If you run an instance for even 1 second of a single hour you are charged for that complete hour. So before you shut down your miniature cluster make sure you really are d

Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
Thank You Sean. I was just trying to experiment with the performance of Spark Applications with various worker instances (I hope you remember that we discussed about the worker instances). I thought it would be a good one to try in EC2. So, it doesn't work out, does it? Thank You On Tue, Feb 24,

Re: Spark on EC2

2015-02-24 Thread Sean Owen
The free tier includes 750 hours of t2.micro instance time per month. http://aws.amazon.com/free/ That's basically a month of hours, so it's all free if you run one instance only at a time. If you run 4, you'll be able to run your cluster of 4 for about a week free. A t2.micro has 1GB of memory,

Re: Spark on EC2

2015-02-24 Thread Deep Pradhan
Kindly bear with my questions as I am new to this. >> If you run spark on local mode on a ec2 machine What does this mean? Is it that I launch Spark cluster from my local machine,i.e., by running the shell script that is there in /spark/ec2? On Tue, Feb 24, 2015 at 8:32 PM, gen tang wrote: > Hi,

Re: Spark on EC2

2015-02-24 Thread gen tang
Hi, As a real spark cluster needs a least one master and one slaves, you need to launch two machine. Therefore the second machine is not free. However, If you run spark on local mode on a ec2 machine. It is free. The charge of AWS depends on how much and the types of machine that you launched, bu

Re: Use case for data in SQL Server

2015-02-24 Thread Denny Lee
Hi Suhel, My team is currently working with a lot of SQL Server databases as one of our many data sources and ultimately we pull the data into HDFS from SQL Server. As we had a lot of SQL databases to hit, we used the jTDS driver and SQOOP to extract the data out of SQL Server and into HDFS (smal

Re: Spark 1.3 dataframe documentation

2015-02-24 Thread Ted Yu
Have you looked at python/pyspark/sql/dataframe.py ? Cheers On Tue, Feb 24, 2015 at 6:12 AM, poiuytrez wrote: > Hello, > > I have built Spark 1.3. I can successfully use the dataframe api. However, > I > am not able to find its api documentation in Python. Do you know when the > documentation w

Spark on EC2

2015-02-24 Thread Deep Pradhan
Hi, I have just signed up for Amazon AWS because I learnt that it provides service for free for the first 12 months. I want to run Spark on EC2 cluster. Will they charge me for this? Thank You

Re: Memory problems when calling pipe()

2015-02-24 Thread Juan Rodríguez Hortalá
Hi, I finally solved the problem by setting spark.yarn.executor.memoryOverhead with the option --conf "spark.yarn.executor.memoryOverhead=" for spark-submit, as pointed out in http://stackoverflow.com/questions/28404714/yarn-why-doesnt-task-go-out-of-heap-space-but-container-gets-killed and ht

Re: Get filename in Spark Streaming

2015-02-24 Thread Emre Sevinc
Hello Subacini, Until someone more knowledgeable suggests a better, more straightforward, and simpler approach with a working code snippet, I suggest the following workaround / hack: inputStream.foreachRDD(rdd => val myStr = rdd.toDebugString // process myStr string value, e.g. using

Spark 1.3 dataframe documentation

2015-02-24 Thread poiuytrez
Hello, I have built Spark 1.3. I can successfully use the dataframe api. However, I am not able to find its api documentation in Python. Do you know when the documentation will be available? Best Regards, poiuytrez -- View this message in context: http://apache-spark-user-list.1001560.n3.na

Sharing Spark Drivers

2015-02-24 Thread John Omernik
I have been posting on the Mesos list, as I am looking to see if it it's possible or not to share spark drivers. Obviously, in stand alone cluster mode, the Master handles requests, and you can instantiate a new sparkcontext to a currently running master. However in Mesos (and perhaps Yarn) I don'

New guide on how to write a Spark job in Clojure

2015-02-24 Thread Christian Betz
Hi all, Maybe some of you are interested: I wrote a new guide on how to start using Spark from Clojure. The tutorial covers * setting up a project, * doing REPL- or Test Driven Development of Spark jobs * Running Spark jobs locally. Just read it on https://gorillalabs.github.io/spa

Re: spark streaming window operations on a large window size

2015-02-24 Thread Avi Levi
OK - thanks a lot On Tue, Feb 24, 2015 at 9:49 AM, Tathagata Das wrote: > Yes. > > On Mon, Feb 23, 2015 at 11:16 PM, Avi Levi wrote: > >> @Tathagata Das so basically you are saying it is supported out of the >> box, but we should expect a significant performance hit - is that right? >> >> >> >>

Re: issue Running Spark Job on Yarn Cluster

2015-02-24 Thread avilevi3
you should fetch the complete logs for the application using 'yarn logs' command, like so: yarn logs -applicationId [the application's id] and look for the real error info -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issue-Running-Spark-Job-on-Yarn-Clus

Re: Use case for data in SQL Server

2015-02-24 Thread Cheng Lian
There is a newly introduced JDBC data source in Spark 1.3.0 (not the JdbcRDD in Spark core), which may be useful. However, currently there's no SQL server specific logics implemented. I'd assume standard SQL queries should work. Cheng On 2/24/15 7:02 PM, Suhel M wrote: Hey, I am trying to w

Re: Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-24 Thread Akhil Das
Did you happen to have a look at https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema Thanks Best Regards On Tue, Feb 24, 2015 at 3:39 PM, anu wrote: > My issue is posted here on stack-overflow. What am I doing wrong here? > > > http://stackover

Use case for data in SQL Server

2015-02-24 Thread Suhel M
Hey, I am trying to work out what is the best way we can leverage Spark for crunching data that is sitting in SQL Server databases. Ideal scenario is being able to efficiently work with big data (10billion+ rows of activity data). We need to shape this data for machine learning problems and want

Facing error while extending scala class with Product interface to overcome limit of 22 fields in spark-shell

2015-02-24 Thread anu
My issue is posted here on stack-overflow. What am I doing wrong here? http://stackoverflow.com/questions/28689186/facing-error-while-extending-scala-class-with-product-interface-to-overcome-limi -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Facing-error

Re: Getting to proto buff classes in Spark Context

2015-02-24 Thread Sean Owen
I assume this is a difference between your local driver classpath and remote worker classpath. It may not be a question of whether the class is there, but classpath visibility issues. Have you looked into settings like spark.files.userClassPathFirst? On Tue, Feb 24, 2015 at 4:43 AM, necro351 . wr

Re: RDD String foreach println

2015-02-24 Thread Sean Owen
println occurs on the machine where the task executes, which may or may not be the same as your local driver process. collect()-ing brings data back to the driver, so printing there definitely occurs on the driver. On Tue, Feb 24, 2015 at 9:48 AM, patcharee wrote: > Hi, > > I would like to print

RDD String foreach println

2015-02-24 Thread patcharee
Hi, I would like to print the content of RDD[String]. I tried 1) linesWithSpark.foreach(println) 2) linesWithSpark.collect().foreach(println) I submitted the job by spark-submit. 1) did not print, but 2) did. But when I used the shell, both 1) and 2) printed. Any ideas why 1) behaves differen

Re: On app upgrade, restore sliding window data.

2015-02-24 Thread Arush Kharbanda
I think this could be of some help to you. https://issues.apache.org/jira/browse/SPARK-3660 On Tue, Feb 24, 2015 at 2:18 AM, Matus Faro wrote: > Hi, > > Our application is being designed to operate at all times on a large > sliding window (day+) of data. The operations performed on the window

Re: How to start spark-shell with YARN?

2015-02-24 Thread Sean Owen
I don't think the build is at issue. The error suggests your App Master can't be contacted. Is there a network port issue? did the AM fail? On Tue, Feb 24, 2015 at 9:15 AM, Xi Shen wrote: > Hi Arush, > > I got the pre-build from https://spark.apache.org/downloads.html. When I > start spark-shell

Re: How to start spark-shell with YARN?

2015-02-24 Thread Xi Shen
Hi Arush, I got the pre-build from https://spark.apache.org/downloads.html. When I start spark-shell, it prompts: Spark assembly has been built with Hive, including Datanucleus jars on classpath So we don't have pre-build with YARN support? If so, how the spark-submit work? I checked the YAR

Running multiple threads with same Spark Context

2015-02-24 Thread Harika
Hi all, I have been running a simple SQL program on Spark. To test the concurrency, I have created 10 threads inside the program, all threads using same SQLContext object. When I ran the program on my EC2 cluster using spark-submit, only 3 threads were running in parallel. I have repeated the test

Re: Movie Recommendation tutorial

2015-02-24 Thread Guillaume Charhon
I am using Spark 1.2.1. Thank you Krishna, I am getting almost the same results as you so it must be an error in the tutorial. Xiangrui, I made some additional tests with lambda to 0.1 and I am getting a much better rmse: RMSE (validation) = 0.868981 for the model trained with rank = 8, lambda =

Re: Missing shuffle files

2015-02-24 Thread Anders Arpteg
If you thinking of the yarn memory overhead, then yes, I have increased that as well. However, I'm glad to say that my job finished successfully finally. Besides the timeout and memory settings, performing repartitioning (with shuffling) at the right time seems to be the key to make this large job

Re: updateStateByKey and invFunction

2015-02-24 Thread Arush Kharbanda
You can use a reduceByKeyAndWindow with your specific time window. You can specify the inverse function in reduceByKeyAndWindow. On Tue, Feb 24, 2015 at 1:36 PM, Ashish Sharma wrote: > So say I want to calculate top K users visiting a page in the past 2 hours > updated every 5 mins. > > so here

  1   2   >