Spark On Yarn Issue: Initial job has not accepted any resources

2014-11-17 Thread LinCharlie
Hi All:I was submitting a spark_program.jar to `spark on yarn cluster` on a driver machine with yarn-client mode. Here is the spark-submit command I used: ./spark-submit --master yarn-client --class com.charlie.spark.grax.OldFollowersExample --queue dt_spark ~/script/spark-flume-test-0.1-SNAPSHO

Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Ye Xianjin
Hi Prashant Sharma, It's not even ok to build with scala-2.11 profile on my machine. Just check out the master(c6e0c2ab1c29c184a9302d23ad75e4ccd8060242) run sbt/sbt -Pscala-2.11 clean assembly: .. skip the normal part info] Resolving org.scalamacros#quasiquotes_2.11;2.0.1 ... [warn] module

Re: Null pointer exception with larger datasets

2014-11-17 Thread Akhil Das
Make sure your list is not null, if that is null then its more like doing: JavaRDD distData = sc.parallelize(*null*) distData.foreach(println) Thanks Best Regards On Tue, Nov 18, 2014 at 12:07 PM, Naveen Kumar Pokala < npok...@spcapitaliq.com> wrote: > Hi, > > > > I am having list Students a

Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Prashant Sharma
Looks like sbt/sbt -Pscala-2.11 is broken by a recent patch for improving maven build. Prashant Sharma On Tue, Nov 18, 2014 at 12:57 PM, Prashant Sharma wrote: > It is safe in the sense we would help you with the fix if you run into > issues. I have used it, but since I worked on the patch th

Re: How can I apply such an inner join in Spark Scala/Python

2014-11-17 Thread Akhil Das
Simple join would do it. val a: List[(Int, Int)] = List((1,2),(2,4),(3,6)) val b: List[(Int, Int)] = List((1,3),(2,5),(3,6), (4,5),(5,6)) val A = sparkContext.parallelize(a) val B = sparkContext.parallelize(b) val ac = new PairRDDFunctions[Int, Int](A) *val C = ac.join(B

Re: Check your cluster UI to ensure that workers are registered and have sufficient memory

2014-11-17 Thread lin_qili
I occur to this issue with the spark on yarn version 1.0.2. Is there any hints? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Check-your-cluster-UI-to-ensure-that-workers-are-registered-and-have-sufficient-memory-tp5358p19133.html Sent from the Apache Spar

Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Prashant Sharma
It is safe in the sense we would help you with the fix if you run into issues. I have used it, but since I worked on the patch the opinion can be biased. I am using scala 2.11 for day to day development. You should checkout the build instructions here : https://github.com/ScrapCodes/spark-1/blob/pa

Re: Probability in Naive Bayes

2014-11-17 Thread Sean Owen
This was recently discussed on this mailing list. You can't get the probabilities out directly now, but you can hack a bit to get the internal data structures of NaiveBayesModel and compute it from there. If you really mean the probability of either A or B, then if your classes are exclusive it is

Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Jianshi Huang
Any notable issues for using Scala 2.11? Is it stable now? Or can I use Scala 2.11 in my spark application and use Spark dist build with 2.10 ? I'm looking forward to migrate to 2.11 for some quasiquote features. Couldn't make it run in 2.10... Cheers, -- Jianshi Huang LinkedIn: jianshi Twitte

Probability in Naive Bayes

2014-11-17 Thread Samarth Mailinglist
I am trying to use Naive Bayes for a project of mine in Python and I want to obtain the probability value after having built the model. Suppose I have two classes - A and B. Currently there is an API to to find which class a sample belongs to (predict). Now, I want to find the probability of it be

Null pointer exception with larger datasets

2014-11-17 Thread Naveen Kumar Pokala
Hi, I am having list Students and size is one Lakh and I am trying to save the file. It is throwing null pointer exception. JavaRDD distData = sc.parallelize(list); distData.saveAsTextFile("hdfs://master/data/spark/instruments.txt"); 14/11/18 01:33:21 WARN scheduler.TaskSetManager: Lost task

Running PageRank in GraphX

2014-11-17 Thread Deep Pradhan
Hi, I just ran the PageRank code in GraphX with some sample data. What I am seeing is that the total rank changes drastically if I change the number of iterations from 10 to 100. Why is that so? Thank You

Re: Is there setup and cleanup function in spark?

2014-11-17 Thread Jianshi Huang
I see. Agree that lazy eval is not suitable for proper setup and teardown. We also abandoned it due to inherent incompatibility between implicit and lazy. It was fun to come up this trick though. Jianshi On Tue, Nov 18, 2014 at 10:28 AM, Tobias Pfeiffer wrote: > Hi, > > On Fri, Nov 14, 2014 at

Re: Status of MLLib exporting models to PMML

2014-11-17 Thread Sean Owen
I'm just using PMML. I haven't hit any limitation of its expressiveness, for the model types is supports. I don't think there is a point in defining a new format for models, excepting that PMML can get very big. Still, just compressing the XML gets it down to a manageable size for just about any re

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Eric Zhen
Yes, it's always appears on a part of the whole tasks in a stage(i.e. 100/100 (65 failed)), and sometimes cause the stage to fail. And there is another error that I'm not sure if there is a correlation. java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.sql.catalyst.expre

Re: Status of MLLib exporting models to PMML

2014-11-17 Thread Manish Amde
Hi Charles, I am not aware of other storage formats. Perhaps Sean or Sandy can elaborate more given their experience with Oryx. There is work by Smola et al at Google that talks about large scale model update and deployment. https://www.usenix.org/conference/osdi14/technical-sessions/presentation

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Michael Armbrust
Interesting, I believe we have run that query with version 1.1.0 with codegen turned on and not much has changed there. Is the error deterministic? On Mon, Nov 17, 2014 at 7:04 PM, Eric Zhen wrote: > Hi Michael, > > We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4. > > On Tue, Nov 18,

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Eric Zhen
Hi Michael, We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4. On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust wrote: > What version of Spark SQL? > > On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen wrote: > >> Hi all, >> >> We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codegen

Re: Is there setup and cleanup function in spark?

2014-11-17 Thread Tobias Pfeiffer
Hi, On Fri, Nov 14, 2014 at 2:49 PM, Jianshi Huang wrote: > Ok, then we need another trick. > > let's have an *implicit lazy var connection/context* around our code. And > setup() will trigger the eval and initialization. > Due to lazy evaluation, I think having setup/teardown is a bit tricky.

Re: Spark Custom Receiver

2014-11-17 Thread Tathagata Das
That is not yet in the roadmap. If there is sufficient demand for it, we can think about it. But exposing internal communication details in a API stable manner is usually a little tricky. TD On Mon, Nov 17, 2014 at 5:50 PM, Jacob Abraham wrote: > Thanks TD for your reply. Hacking is fine, but I

Re: Communication between Driver and Executors

2014-11-17 Thread Tobias Pfeiffer
Hi, so I didn't manage to get the Broadcast variable with a new value distributed to my executors in YARN mode. In local mode it worked fine, but when running on YARN either nothing happened (when unpersist() was called on the driver) or I got a TimeoutException (when called on the executor). I fi

java.lang.ArithmeticException while create Parquet

2014-11-17 Thread Jahagirdar, Madhu
All, This causes below error: However if i replace JavaHiveContext with Hive Context (see the commented code below) and replace JavaClass with CaseClass (Scala) the same code works ok. Any reason why this could be happening ? ava.lang.ArithmeticException: / by zero at parquet.hadoop.Intern

Re: Spark Custom Receiver

2014-11-17 Thread Jacob Abraham
Thanks TD for your reply. Hacking is fine, but I am not a fan of hacking into something that I don't fully understand (law of unintended consequences). So, this option is out for me... :) I could setup my own actor (I suppose you mean akka actors) and here the problem is that I don't understand the

Re: Using data in RDD to specify HDFS directory to write to

2014-11-17 Thread jschindler
Yes, thank you for suggestion. The error I found below was in the worker logs. AssociationError [akka.tcp://sparkwor...@cloudera01.local.company.com:7078] -> [akka.tcp://sparkexecu...@cloudera01.local.company.com:33329]: Error [Association failed with [akka.tcp://sparkexecu...@cloudera01.local.co

RE: Spark streaming cannot receive any message from Kafka

2014-11-17 Thread Shao, Saisai
Hi Bill, Would you mind describing what you found a little more specifically, I’m not sure there’s the a parameter in KafkaUtils.createStream you can specify the spark parallelism, also what is the exception stacks. Thanks Jerry From: Bill Jay [mailto:bill.jaypeter...@gmail.com] Sent: Tuesday,

RDD Blocks skewing to just few executors

2014-11-17 Thread mtimper
Hi I'm running a standalone cluster with 8 worker servers. I'm developing a streaming app that is adding new lines of text to several different RDDs each batch interval. Each line has a well randomized unique identifier that I'm trying to use for partitioning, since the data stream does contain du

Re: SparkSQL exception on spark.sql.codegen

2014-11-17 Thread Michael Armbrust
What version of Spark SQL? On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen wrote: > Hi all, > > We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codegen=true, we > got exceptions as below, has anyone else saw these before? > > java.lang.ExceptionInInitializerError > at > org.apache.spa

Re: Exception in spark sql when running a group by query

2014-11-17 Thread Michael Armbrust
You are perhaps hitting an issue that was fixed by #3248 ? On Mon, Nov 17, 2014 at 9:58 AM, Sadhan Sood wrote: > While testing sparkSQL, we were running this group by with expression > query and got an exception. The same query worked fine on hive. > >

Re: independent user sessions with a multi-user spark sql thriftserver (Spark 1.1)

2014-11-17 Thread Michael Armbrust
This is an unfortunate/known issue that we are hoping to address in the next release: https://issues.apache.org/jira/browse/SPARK-2087 I'm not sure how straightforward a fix would be, but it would involve keeping / setting the SessionState for each connection to the server. It would be great if y

Re: Assigning input files to spark partitions

2014-11-17 Thread Daniel Siegmann
I'm not aware of any such mechanism. On Mon, Nov 17, 2014 at 2:55 PM, Pala M Muthaia wrote: > Hi Daniel, > > Yes that should work also. However, is it possible to setup so that each > RDD has exactly one partition, without repartitioning (and thus incurring > extra cost)? Is there a mechanism si

Re: Assigning input files to spark partitions

2014-11-17 Thread Pala M Muthaia
Hi Daniel, Yes that should work also. However, is it possible to setup so that each RDD has exactly one partition, without repartitioning (and thus incurring extra cost)? Is there a mechanism similar to MR where we can ensure each partition is assigned some amount of data by size, by setting some

RE: RDD.aggregate versus accumulables...

2014-11-17 Thread lordjoe
I have been playing with using accumulators (despite the possible error with multiple attempts) These provide a convenient way to get some numbers while still performing business logic. I posted some sample code at http://lordjoesoftware.blogspot.com/. Even if accumulators are not perfect today -

independent user sessions with a multi-user spark sql thriftserver (Spark 1.1)

2014-11-17 Thread Michael Allman
Hello, We're running a spark sql thriftserver that several users connect to with beeline. One limitation we've run into is that the current working database (set with "use ") is shared across all connections. So changing the database on one connection changes the database for all connections. T

Re: Spark streaming cannot receive any message from Kafka

2014-11-17 Thread Bill Jay
Hi all, I find the reason of this issue. It seems in the new version, if I do not specify spark.default.parallelism in KafkaUtils.createstream, there will be an exception since the kakfa stream creation stage. In the previous versions, it seems Spark will use the default value. Thanks! Bill On

IOException: exception in uploadSinglePart

2014-11-17 Thread Justin Mills
Spark 1.1.0, running on AWS EMR cluster using yarn-client as master. I'm getting the following error when I attempt to save a RDD to S3. I've narrowed it down to a single partition that is ~150Mb in size (versus the other partitions that are closer to 1 Mb). I am able to work around this by saving

Re: Missing SparkSQLCLIDriver and Beeline drivers in Spark

2014-11-17 Thread Ted Yu
Minor correction: there was a typo in commandline hive-thirftserver should be hive-thriftserver Cheers On Thu, Aug 7, 2014 at 6:49 PM, Cheng Lian wrote: > Things have changed a bit in the master branch, and the SQL programming > guide in master branch actually doesn’t apply to branch-1.0-jdbc.

Re: java.lang.OutOfMemoryError: Requested array size exceeds VM limit

2014-11-17 Thread akhandeshi
"only option is to split you problem further by increasing parallelism" My understanding is by increasing the number of partitions, is that right? That didn't seem to help because it is seem the partitions are not uniformly sized. My observation is when I increase the number of partitions, it c

Re: How can I apply such an inner join in Spark Scala/Python

2014-11-17 Thread Sean Owen
Just RDD.join() should be an inner join. On Mon, Nov 17, 2014 at 5:51 PM, Blind Faith wrote: > So let us say I have RDDs A and B with the following values. > > A = [ (1, 2), (2, 4), (3, 6) ] > > B = [ (1, 3), (2, 5), (3, 6), (4, 5), (5, 6) ] > > I want to apply an inner join, such that I get the

RE: RDD.aggregate versus accumulables...

2014-11-17 Thread Segerlind, Nathan L
Thanks for the link to the bug. Unfortunately, using accumulators like this is getting spread around as a recommended practice despite the bug. From: Daniel Siegmann [mailto:daniel.siegm...@velos.io] Sent: Monday, November 17, 2014 8:32 AM To: Segerlind, Nathan L Cc: user Subject: Re: RDD.aggre

Re: How to broadcast a textFile?

2014-11-17 Thread YaoPau
OK then I'd still need to write the code (within my spark streaming code I'm guessing) to convert my text file into an object like a HashMap before broadcasting. How can I make sure only the HashMap is being broadcast while all the pre-processing to create the HashMap is only performed once?

Spark streaming on Yarn

2014-11-17 Thread kpeng1
Hi, I have been using spark streaming in standalone mode and now I want to migrate to spark running on yarn, but I am not sure how you would you would go about designating a specific node in the cluster to act as an avro listener since I am using flume based push approach with spark. -- View

How do I get the executor ID from running Java code

2014-11-17 Thread Steve Lewis
The spark UI lists a number of Executor IDS on the cluster. I would like to access both executor ID and Task/Attempt IDs from the code inside a function running on a slave machine. Currently my motivation is to examine parallelism and locality but in Hadoop this aids in allowing code to write non

Exception in spark sql when running a group by query

2014-11-17 Thread Sadhan Sood
While testing sparkSQL, we were running this group by with expression query and got an exception. The same query worked fine on hive. SELECT from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200), '/MM/dd') as pst_date, count(*) as num_xyzs FROM all_matched_abc GROUP BY

How can I apply such an inner join in Spark Scala/Python

2014-11-17 Thread Blind Faith
So let us say I have RDDs A and B with the following values. A = [ (1, 2), (2, 4), (3, 6) ] B = [ (1, 3), (2, 5), (3, 6), (4, 5), (5, 6) ] I want to apply an inner join, such that I get the following as a result. C = [ (1, (2, 3)), (2, (4, 5)), (3, (6,6)) ] That is, those keys which are not pr

Re: Building Spark with hive does not work

2014-11-17 Thread Ted Yu
Looks like this was where you got that commandline: http://search-hadoop.com/m/JW1q5RlPrl Cheers On Mon, Nov 17, 2014 at 9:44 AM, Hao Ren wrote: > Sry for spamming, > > Just after my previous post, I noticed that the command used is: > > ./sbt/sbt -Phive -Phive-thirftserver clean assembly/asse

Re: Building Spark with hive does not work

2014-11-17 Thread Hao Ren
Sry for spamming, Just after my previous post, I noticed that the command used is: ./sbt/sbt -Phive -Phive-thirftserver clean assembly/assembly thriftserver* the typo error is the evil. Stupid, me. I believe I just copy-pasted from somewhere else, but no even checked it, meanwhile no error ms

Re: RandomGenerator class not found exception

2014-11-17 Thread Chitturi Padma
As you are using sbt ..u need not put in ~/.m2/repositories for maven. Include the jar explicitly using the option --driver-class-path while submitting the jar to spark cluster On Mon, Nov 17, 2014 at 7:41 PM, Ritesh Kumar Singh [via Apache Spark User List] wrote: > It's still not working. Keep

Re: Building Spark with hive does not work

2014-11-17 Thread Hao Ren
Hi Cheng, Actually, I am just using the commit 64c6b9b Here is HEAD of the master branch: $ git log commit 64c6b9bad559c21f25cd9fbe37c8813cdab939f2 Author: Michael Armbrust Date: Sun Nov 16 21:55:57 2014 -0800 [SPARK-4410][SQL] Add support for external sort Adds a new operator

Re: How to broadcast a textFile?

2014-11-17 Thread Andy Twigg
Broadcast copies arbitrary objects, so you could read it into an object such an array of lines then broadcast that. Andy On Monday, 17 November 2014, YaoPau wrote: > I have a 1 million row file that I'd like to read from my edge node, and > then > send a copy of it to each Hadoop machine's memo

How to broadcast a textFile?

2014-11-17 Thread YaoPau
I have a 1 million row file that I'd like to read from my edge node, and then send a copy of it to each Hadoop machine's memory in order to run JOINs in my spark streaming code. I see examples in the docs of how use use broadcast() for a simple array, but how about when the data is in a textFile?

Re: RDD.aggregate versus accumulables...

2014-11-17 Thread Surendranauth Hiraman
We use Algebird for calculating things like min/max, stddev, variance, etc. https://github.com/twitter/algebird/wiki -Suren On Mon, Nov 17, 2014 at 11:32 AM, Daniel Siegmann wrote: > You should *never* use accumulators for this purpose because you may get > incorrect answers. Accumulators can

Re: Building Spark with hive does not work

2014-11-17 Thread Cheng Lian
Hey Hao, Which commit are you using? Just tried 64c6b9b with exactly the same command line flags, couldn't reproduce this issue. Cheng On 11/17/14 10:02 PM, Hao Ren wrote: Hi, I am building spark on the most recent master branch. I checked this page: https://github.com/apache/spark/blob/ma

Re: RDD.aggregate versus accumulables...

2014-11-17 Thread Daniel Siegmann
You should *never* use accumulators for this purpose because you may get incorrect answers. Accumulators can count the same thing multiple times - you cannot rely upon the correctness of the values they compute. See SPARK-732 for more info. On Sun,

Re: same error of SPARK-1977 while using trainImplicit in mllib 1.0.2

2014-11-17 Thread aaronlin
Thanks. It works for me. -- Aaron Lin On 2014年11月15日 Saturday at 上午1:19, Xiangrui Meng wrote: > If you use Kryo serialier, you need to register mutable.BitSet and Rating: > > https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.s

Re: How to measure communication between nodes in Spark Standalone Cluster?

2014-11-17 Thread Yifan LI
I am not sure there is a direct way(an api in graphx, etc) to measure the number of transferred vertex values among nodes during computation. It might depend on: - the operations in your application, e.g. only communicate with its immediate neighbours for each vertex. - the partition strategy yo

Re: How do you force a Spark Application to run in multiple tasks

2014-11-17 Thread Daniel Siegmann
I've never used Mesos, sorry. On Fri, Nov 14, 2014 at 5:30 PM, Steve Lewis wrote: > The cluster runs Mesos and I can see the tasks in the Mesos UI but most > are not doing much - any hints about that UI > > On Fri, Nov 14, 2014 at 11:39 AM, Daniel Siegmann < > daniel.siegm...@velos.io> wrote: >

Re: Returning breeze.linalg.DenseMatrix from method

2014-11-17 Thread Ritesh Kumar Singh
Yeah, it works. Although when I try to define a var of type DenseMatrix, like this: var mat1: DenseMatrix[Double] It gives an error saying we need to initialise the matrix mat1 at the time of declaration. Had to initialise it as : var mat1: DenseMatrix[Double] = DenseMatrix.zeros[Double](1,1) A

Re: Understanding spark operation pipeline and block storage

2014-11-17 Thread Hao Ren
Hi Cheng, You are right. =) I checked your article a few days ago. I have some further questions: According to the article, the following code takes the spatial complexity o(1). val lines = spark.textFile("hdfs://") val errors = lines.filter(_.startsWith("ERROR")) val messages = errors.map(_.sp

Re: RandomGenerator class not found exception

2014-11-17 Thread Ritesh Kumar Singh
It's still not working. Keep getting the same error. I even deleted the commons-math3/* folder containing the jar. And then under the directory "org/apache/commons/" made a folder called 'math3' and put the commons-math3-3.3.jar in it. Still its not working. I also tried sc.addJar("/path/to/jar")

Building Spark with hive does not work

2014-11-17 Thread Hao Ren
Hi, I am building spark on the most recent master branch. I checked this page: https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md The cmd *./sbt/sbt -Phive -Phive-thirftserver clean assembly/assembly* works fine. A fat jar is created. However, when I started the SQL-CLI,

Re: Returning breeze.linalg.DenseMatrix from method

2014-11-17 Thread tribhuvan...@gmail.com
This should fix it -- def func(str: String): DenseMatrix*[Double]* = { ... ... } So, why is this required? Think of it like this -- If you hadn't explicitly mentioned Double, it might have been that the calling function expected a DenseMatrix[SomeOtherType], and performed a SomeOtherType-

Re: Building Spark for Hive The requested profile "hadoop-1.2" could not be activated because it does not exist.

2014-11-17 Thread akshayhazari
Oops , I guess , this is the right way to do it mvn -Phive -Dhadoop.version=1.2.1 clean -DskipTests package -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Building-Spark-for-Hive-The-requested-profile-hadoop-1-2-could-not-be-activated-because-it-does-not--

Re: HDFS read text file

2014-11-17 Thread Hlib Mykhailenko
Hello Naveen, I think you should first override "toString" method of your sample.spark.test.Student class. -- Cordialement, Hlib Mykhailenko Doctorant à INRIA Sophia-Antipolis Méditerranée 2004 Route des Lucioles BP93 06902 SOPHIA ANTIPOLIS cedex - Original Message - > From: "N

Re: HDFS read text file

2014-11-17 Thread Akhil Das
You can use the sc.objectFile to read it. It will be RDD[Student] type. Thanks Best Regards On Mon, Nov 17, 2014 at 4:03 PM, Naveen Kumar Pokala < npok...@spcapitaliq.com> wrote: > Hi, > > > > > > JavaRDD s

Re: How to measure communication between nodes in Spark Standalone Cluster?

2014-11-17 Thread Akhil Das
You can use Ganglia to see the overall data transfer across the cluster/nodes. I don't think there's a direct way to get the vertices being transferred. Thanks Best Regards On Mon, Nov 17, 2014 at 4:29 PM, Hlib Mykhailenko wrote: > Hello, > > I use Spark Standalone Cluster and I want to measure

How to measure communication between nodes in Spark Standalone Cluster?

2014-11-17 Thread Hlib Mykhailenko
Hello, I use Spark Standalone Cluster and I want to measure somehow internode communication. As I understood, Graphx transfers only vertices values. Am I right? But I do not want to get number of bytes which were transferred between any two nodes. So is there way to measure how many values

Building Spark for Hive The requested profile "hadoop-1.2" could not be activated because it does not exist.

2014-11-17 Thread akshayhazari
I am using Apache Hadoop 1.2.1 . I wanted to use Spark Sql with Hive. So I tried to build Spark like so . > mvn -Phive,hadoop-1.2 -Dhadoop.version=1.2.1 clean -DskipTests package But I get the following error. The requested profile "hadoop-1.2" could not be activated because it does not ex

HDFS read text file

2014-11-17 Thread Naveen Kumar Pokala
Hi, JavaRDD studentsData = sc.parallelize(list);--list is Student Info List studentsData.saveAsTextFile("hdfs://master/data/spark/instruments.txt"); above statements saved the students information in the HDFS as a text file. Each object is a line in text file as below. [cid:image001.png@01D0

Spark streaming batch overrun

2014-11-17 Thread facboy
Hi all, In this presentation (https://prezi.com/1jzqym68hwjp/spark-gotchas-and-anti-patterns/) it mentions that Spark Streaming's behaviour is undefined if a batch overruns the polling interval. Is this something that might be addressed in future or is it fundamental to the design? -- View th

Re: How to incrementally compile spark examples using mvn

2014-11-17 Thread Sean Owen
The downloads just happen once so this is not a problem. If you are just building one module in a project, it needs a compiled copy of other modules. It will either use your locally-built and locally-installed artifact, or, download one from the repo if possible. This isn't needed if you are comp

Landmarks in GraphX section of Spark API

2014-11-17 Thread Deep Pradhan
Hi, I was going through the graphx section in the Spark API in https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ShortestPaths$ Here, I find the word "landmark". Can anyone explain to me what is landmark means. Is it a simple English word or does it mean somethi

Re: Functions in Spark

2014-11-17 Thread Gerard Maas
One 'rule of thumbs' is to use rdd.toDebugString and check the lineage for ShuffleRDD. As long as there's no need for restructuring the RDD, operations can be pipelined on each partition. "rdd.toDebugString" is your friend :-) -kr, Gerard. On Mon, Nov 17, 2014 at 7:37 AM, Mukesh Jha wrote: >

Re: RandomGenerator class not found exception

2014-11-17 Thread Chitturi Padma
Include the commons-math3/3.3 in class path while submitting jar to spark cluster. Like.. spark-submit --driver-class-path maths3.3jar --class MainClass --master spark cluster url appjar On Mon, Nov 17, 2014 at 1:55 PM, Ritesh Kumar Singh [via Apache Spark User List] wrote: > My sbt file for the

Re: RandomGenerator class not found exception

2014-11-17 Thread Akhil Das
Add this jar while creating the sparkContext. sc.addJar("/path/to/commons-math3-3.3.jar") ​And make sure it is shipped and available in the environment tab (4040)​ Thanks Best Regards On Mon, Nov 17, 2014 at 1:54 PM, Rite

RandomGenerator class not found exception

2014-11-17 Thread Ritesh Kumar Singh
My sbt file for the project includes this: libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.1.0", "org.apache.spark" %% "spark-mllib" % "1.1.0", "org.apache.commons" % "commons-math3" % "3.3" ) =

Re: spark-submit question

2014-11-17 Thread Samarth Mailinglist
I figured it out. I had to use pyspark.files.SparkFiles to get the locations of files loaded into Spark. On Mon, Nov 17, 2014 at 1:26 PM, Sean Owen wrote: > You are changing these paths and filenames to match your own actual > scripts and file locations right? > On Nov 17, 2014 4:59 AM, "Samart