K-Means And Class Tags
Hi All, I'm trying a simple K-Means example as per the website: val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble))) but I'm trying to write a Java based validation method first so that missing values are omitted or replaced with 0. public RDD prepareKMeans(JavaRDD data) { JavaRDD words = data.flatMap(new FlatMapFunction() { public Iterable call(String s) { String[] split = s.split(","); ArrayList add = new ArrayList(); if (split.length != 2) { add.add(Vectors.dense(0, 0)); } else { add.add(Vectors.dense(Double.parseDouble(split[0]), Double.parseDouble(split[1]))); } return add; } }); return words.rdd(); } When I then call from scala: val parsedData=dc.prepareKMeans(data); val p=parsedData.collect(); I get Exception in thread "main" java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.mllib.linalg.Vector; Why is the class tag is object rather than vector? 1) How do I get this working correctly using the Java validation example above or 2) How can I modify val parsedData = data.map(s => Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split size <2 I ignore the line? or 3) Is there a better way to do input validation first? Using spark and mlib: libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.2.0" libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.2.0" Many thanks in advance Dev
Re: K-Means And Class Tags
Thanks for the suggestion, can anyone offer any advice on the ClassCast Exception going from Java to Scala? Why does JavaRDD.rdd() and then a collect() result in this exception? On Thu, Jan 8, 2015 at 4:13 PM, Yana Kadiyska wrote: > How about > > data.map(s=>s.split(",")).filter(_.length>1).map(good_entry=>Vectors.dense((Double.parseDouble(good_entry[0]), > Double.parseDouble(good_entry[1])) > > (full disclosure, I didn't actually run this). But after the first map you > should have an RDD[Array[String]], then you'd discard everything shorter > than 2, and convert the rest to dense vectors?...In fact if you're > expecting length exactly 2 might want to filter ==2... > > > On Thu, Jan 8, 2015 at 10:58 AM, Devl Devel > wrote: > >> Hi All, >> >> I'm trying a simple K-Means example as per the website: >> >> val parsedData = data.map(s => >> Vectors.dense(s.split(',').map(_.toDouble))) >> >> but I'm trying to write a Java based validation method first so that >> missing values are omitted or replaced with 0. >> >> public RDD prepareKMeans(JavaRDD data) { >> JavaRDD words = data.flatMap(new FlatMapFunction> Vector>() { >> public Iterable call(String s) { >> String[] split = s.split(","); >> ArrayList add = new ArrayList(); >> if (split.length != 2) { >> add.add(Vectors.dense(0, 0)); >> } else >> { >> add.add(Vectors.dense(Double.parseDouble(split[0]), >>Double.parseDouble(split[1]))); >> } >> >> return add; >> } >> }); >> >> return words.rdd(); >> } >> >> When I then call from scala: >> >> val parsedData=dc.prepareKMeans(data); >> val p=parsedData.collect(); >> >> I get Exception in thread "main" java.lang.ClassCastException: >> [Ljava.lang.Object; cannot be cast to >> [Lorg.apache.spark.mllib.linalg.Vector; >> >> Why is the class tag is object rather than vector? >> >> 1) How do I get this working correctly using the Java validation example >> above or >> 2) How can I modify val parsedData = data.map(s => >> Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split size <2 >> I >> ignore the line? or >> 3) Is there a better way to do input validation first? >> >> Using spark and mlib: >> libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.2.0" >> libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "1.2.0" >> >> Many thanks in advance >> Dev >> > >
Re: K-Means And Class Tags
Hi Joseph Thanks for the suggestion, however retag is a private method and when I call in Scala: val retaggedInput = parsedData.retag(classOf[Vector]) I get: Symbol retag is inaccessible from this place However I can do this from Java, and it works in Scala: return words.rdd().retag(Vector.class); Dev On Thu, Jan 8, 2015 at 9:35 PM, Joseph Bradley wrote: > I believe you're running into an erasure issue which we found in > DecisionTree too. Check out: > > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala#L134 > > That retags RDDs which were created from Java to prevent the exception > you're running into. > > Hope this helps! > Joseph > > On Thu, Jan 8, 2015 at 12:48 PM, Devl Devel > wrote: > >> Thanks for the suggestion, can anyone offer any advice on the ClassCast >> Exception going from Java to Scala? Why does JavaRDD.rdd() and then a >> collect() result in this exception? >> >> On Thu, Jan 8, 2015 at 4:13 PM, Yana Kadiyska >> wrote: >> >> > How about >> > >> > >> data.map(s=>s.split(",")).filter(_.length>1).map(good_entry=>Vectors.dense((Double.parseDouble(good_entry[0]), >> > Double.parseDouble(good_entry[1])) >> > >> > (full disclosure, I didn't actually run this). But after the first map >> you >> > should have an RDD[Array[String]], then you'd discard everything shorter >> > than 2, and convert the rest to dense vectors?...In fact if you're >> > expecting length exactly 2 might want to filter ==2... >> > >> > >> > On Thu, Jan 8, 2015 at 10:58 AM, Devl Devel > > >> > wrote: >> > >> >> Hi All, >> >> >> >> I'm trying a simple K-Means example as per the website: >> >> >> >> val parsedData = data.map(s => >> >> Vectors.dense(s.split(',').map(_.toDouble))) >> >> >> >> but I'm trying to write a Java based validation method first so that >> >> missing values are omitted or replaced with 0. >> >> >> >> public RDD prepareKMeans(JavaRDD data) { >> >> JavaRDD words = data.flatMap(new >> FlatMapFunction> >> Vector>() { >> >> public Iterable call(String s) { >> >> String[] split = s.split(","); >> >> ArrayList add = new ArrayList(); >> >> if (split.length != 2) { >> >> add.add(Vectors.dense(0, 0)); >> >> } else >> >> { >> >> add.add(Vectors.dense(Double.parseDouble(split[0]), >> >>Double.parseDouble(split[1]))); >> >> } >> >> >> >> return add; >> >> } >> >> }); >> >> >> >> return words.rdd(); >> >> } >> >> >> >> When I then call from scala: >> >> >> >> val parsedData=dc.prepareKMeans(data); >> >> val p=parsedData.collect(); >> >> >> >> I get Exception in thread "main" java.lang.ClassCastException: >> >> [Ljava.lang.Object; cannot be cast to >> >> [Lorg.apache.spark.mllib.linalg.Vector; >> >> >> >> Why is the class tag is object rather than vector? >> >> >> >> 1) How do I get this working correctly using the Java validation >> example >> >> above or >> >> 2) How can I modify val parsedData = data.map(s => >> >> Vectors.dense(s.split(',').map(_.toDouble))) so that when s.split size >> <2 >> >> I >> >> ignore the line? or >> >> 3) Is there a better way to do input validation first? >> >> >> >> Using spark and mlib: >> >> libraryDependencies += "org.apache.spark" % "spark-core_2.10" % >> "1.2.0" >> >> libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % >> "1.2.0" >> >> >> >> Many thanks in advance >> >> Dev >> >> >> > >> > >> > >
Re: LinearRegressionWithSGD accuracy
Thanks, that helps a bit at least with the NaN but the MSE is still very high even with that step size and 10k iterations: training Mean Squared Error = 3.3322561285919316E7 Does this method need say 100k iterations? On Thu, Jan 15, 2015 at 5:42 PM, Robin East wrote: > -dev, +user > > You’ll need to set the gradient descent step size to something small - a > bit of trial and error shows that 0.0001 works. > > You’ll need to create a LinearRegressionWithSGD instance and set the step > size explicitly: > > val lr = new LinearRegressionWithSGD() > lr.optimizer.setStepSize(0.0001) > lr.optimizer.setNumIterations(100) > val model = lr.run(parsedData) > > On 15 Jan 2015, at 16:46, devl.development > wrote: > > From what I gather, you use LinearRegressionWithSGD to predict y or the > response variable given a feature vector x. > > In a simple example I used a perfectly linear dataset such that x=y > y,x > 1,1 > 2,2 > ... > > 1,1 > > Using the out-of-box example from the website (with and without scaling): > > val data = sc.textFile(file) > >val parsedData = data.map { line => > val parts = line.split(',') > LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) //y > and x > >} >val scaler = new StandardScaler(withMean = true, withStd = true) > .fit(parsedData.map(x => x.features)) >val scaledData = parsedData > .map(x => > LabeledPoint(x.label, >scaler.transform(Vectors.dense(x.features.toArray > >// Building the model >val numIterations = 100 >val model = LinearRegressionWithSGD.train(parsedData, numIterations) > >// Evaluate model on training examples and compute training error * > tried using both scaledData and parsedData >val valuesAndPreds = scaledData.map { point => > val prediction = model.predict(point.features) > (point.label, prediction) >} >val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() >println("training Mean Squared Error = " + MSE) > > Both scaled and unscaled attempts give: > > training Mean Squared Error = NaN > > I've even tried x, y+(sample noise from normal with mean 0 and stddev 1) > still comes up with the same thing. > > Is this not supposed to work for x and y or 2 dimensional plots? Is there > something I'm missing or wrong in the code above? Or is there a limitation > in the method? > > Thanks for any advice. > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > >
Re: LinearRegressionWithSGD accuracy
It was a bug in the code, however adding the step parameter got the results to work. Mean Squared Error = 2.610379825794694E-5 I've also opened a jira to put the step parameter in the examples so that people new to mllib have a way to improve the MSE. https://issues.apache.org/jira/browse/SPARK-5273 On Thu, Jan 15, 2015 at 8:23 PM, Joseph Bradley wrote: > It looks like you're training on the non-scaled data but testing on the > scaled data. Have you tried this training & testing on only the scaled > data? > > On Thu, Jan 15, 2015 at 10:42 AM, Devl Devel > wrote: > >> Thanks, that helps a bit at least with the NaN but the MSE is still very >> high even with that step size and 10k iterations: >> >> training Mean Squared Error = 3.3322561285919316E7 >> >> Does this method need say 100k iterations? >> >> >> >> >> >> >> On Thu, Jan 15, 2015 at 5:42 PM, Robin East >> wrote: >> >> > -dev, +user >> > >> > You’ll need to set the gradient descent step size to something small - a >> > bit of trial and error shows that 0.0001 works. >> > >> > You’ll need to create a LinearRegressionWithSGD instance and set the >> step >> > size explicitly: >> > >> > val lr = new LinearRegressionWithSGD() >> > lr.optimizer.setStepSize(0.0001) >> > lr.optimizer.setNumIterations(100) >> > val model = lr.run(parsedData) >> > >> > On 15 Jan 2015, at 16:46, devl.development >> > wrote: >> > >> > From what I gather, you use LinearRegressionWithSGD to predict y or the >> > response variable given a feature vector x. >> > >> > In a simple example I used a perfectly linear dataset such that x=y >> > y,x >> > 1,1 >> > 2,2 >> > ... >> > >> > 1,1 >> > >> > Using the out-of-box example from the website (with and without >> scaling): >> > >> > val data = sc.textFile(file) >> > >> >val parsedData = data.map { line => >> > val parts = line.split(',') >> > LabeledPoint(parts(1).toDouble, Vectors.dense(parts(0).toDouble)) >> //y >> > and x >> > >> >} >> >val scaler = new StandardScaler(withMean = true, withStd = true) >> > .fit(parsedData.map(x => x.features)) >> >val scaledData = parsedData >> > .map(x => >> > LabeledPoint(x.label, >> >scaler.transform(Vectors.dense(x.features.toArray >> > >> >// Building the model >> >val numIterations = 100 >> >val model = LinearRegressionWithSGD.train(parsedData, numIterations) >> > >> >// Evaluate model on training examples and compute training error * >> > tried using both scaledData and parsedData >> >val valuesAndPreds = scaledData.map { point => >> > val prediction = model.predict(point.features) >> > (point.label, prediction) >> >} >> >val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), >> 2)}.mean() >> >println("training Mean Squared Error = " + MSE) >> > >> > Both scaled and unscaled attempts give: >> > >> > training Mean Squared Error = NaN >> > >> > I've even tried x, y+(sample noise from normal with mean 0 and stddev 1) >> > still comes up with the same thing. >> > >> > Is this not supposed to work for x and y or 2 dimensional plots? Is >> there >> > something I'm missing or wrong in the code above? Or is there a >> limitation >> > in the method? >> > >> > Thanks for any advice. >> > >> > >> > >> > -- >> > View this message in context: >> > >> http://apache-spark-developers-list.1001551.n3.nabble.com/LinearRegressionWithSGD-accuracy-tp10127.html >> > Sent from the Apache Spark Developers List mailing list archive at >> > Nabble.com. >> > >> > - >> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> > For additional commands, e-mail: dev-h...@spark.apache.org >> > >> > >> > >> > >
Some praise and comments on Spark
Hi Spark Developers, First, apologies if this doesn't belong on this list but the comments/praise are relevant to all developers. This is just a small note about what we really like about Spark, I/we don't mean to start a whole long discussion thread in this forum but just share our positive experiences with Spark thus far. To start, as you can tell, we think that the Spark project is amazing and we love it! Having put in nearly half a decade worth of sweat and tears into production Hadoop, MapReduce clusters and application development it's so refreshing to see something arguably simpler and more elegant to supersede it. These are the things we love about Spark and hope these principles continue: -the one command build; make-distribution.sh. Simple, clean and ideal for deployment and devops and rebuilding on different environments and nodes. -not having too much runtime and deploy config; as admins and developers we are sick of setting props like io.sort and mapred.job.shuffle.merge.percent and dfs file locations and temp directories and so on and on again and again every time we deploy a job, new cluster, environment or even change company. -a fully built-in stack, one global project for SQL, dataframes, MLlib etc, so there is no need to add on projects to it on as per Hive, Hue, Hbase etc. This helps life and keeps everything in one place. -single (global) user based operation - no creation of a hdfs mapred unix user, makes life much simpler -single quick-start daemons; master and slaves. Not having to worry about JT, NN, DN , TT, RM, Hbase master ... and doing netstat and jps on hundreds of clusters makes life much easier. -proper code versioning, feature releases and release management. - good & well organised documentation with good examples. In addition to the comments above this is where we hope Spark never ends up: -tonnes of configuration properties and "go faster" type flags. For example Hadoop and Hbase users will know that there are a whole catalogue of properties for regions, caches, network properties, block sizes, etc etc. Please don't end up here for example: https://hadoop.apache.org/docs/r1.0.4/mapred-default.html, it is painful having to configure all of this and then create a set of properties for each environment and then tie this into CI and deployment tools. -no more daemons and processes to have to monitor and manipulate and restart and crash. -a project that penalises developers (that will ultimately help promote Spark to their managers and budget holders) with expensive training, certification, books and accreditation. Ideally this open source should be free, free training= more users = more commercial uptake. Anyway, those are our thoughts for what they are worth, keep up the good work, we just had to mention it. Again sorry if this is not the right place or if there is another forum for this stuff. Cheers
Re: Integrating Spark with Ignite File System
Hi Dmitriy, Thanks for the input, I think as per my previous email it would be good to have a bridge project that for example, creates a IgniteFS RDD, similar to the JDBC or HDFS one in which we can extract blocks and populate RDD partitions, I'll post this proposal on your list. Thanks Devl On Sat, Apr 11, 2015 at 9:28 AM, Reynold Xin wrote: > Welcome, Dmitriy, to the Spark dev list! > > > On Sat, Apr 11, 2015 at 1:14 AM, Dmitriy Setrakyan > wrote: > > > Hello Everyone, > > > > I am one of the committers to Apache Ignite and have noticed some talks > on > > this dev list about integrating Ignite In-Memory File System (IgniteFS) > > with Spark. We definitely like the idea. If you have any questions about > > Apache Ignite at all, feel free to forward them to the Ignite dev list. > We > > are going to be monitoring this list as well. > > > > Ignite mailing list: dev-subscr...@ignite.incubator.apache.org > > > > Regards, > > Dmitriy > > >
Stop Master and Slaves without SSH
Hey All, start-slaves.sh and stop-slaves.sh make use of SSH to connect to remote clusters. Are there alternative methods to do this without SSH? For example using: ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT is fine but there is no way to kill the Worker without using stop-slave(s).sh or using ps -ef and then kill. Are there alternatives available such as Hadoop's: hadoop-daemon.sh start|stop xyz? I noticed spark-daemon.sh exists but maybe we need to increase the documentation around it, for instance: Usage: spark-daemon.sh [--config ] (start|stop|status) what are the valid spark-commands? Can this be used to start and stop workers on the current node? Many thanks Devl
Recreating JIRA SPARK-8142
Hi All We are having some trouble with: sparkConf.set("spark.driver.userClassPathFirst","true"); sparkConf.set("spark.executor.userClassPathFirst","true"); and would appreciate some independent verification. The issue comes down to this: Spark 1.3.1 hadoop 2.6 is deployed on the cluster. In my application code I use maven to bring in: hadoop-common 2.6.0 - provided hadoop-client 2.6.0 provided hadoop -hdfs 2.6.0 provided spark-sql_s.10 provided spark-core_2.10 provided hbase-client 1.1.0 included.packaged hbase -protocol 1.1.0 included/packaged hbase -server 1.1.0 included/packaged When I set userClasspath* to true I get a ClassCastException: Full details are in https://issues.apache.org/jira/browse/SPARK-8142 Can someone help verify this? I.e. If you have spark 1.3.1, Hadoop and Hbase can you create a simple Spark job say to read an HBase table into a RDD. Then set the Spark and Hadoop dependencies above as "provided" and then set: sparkConf.set("spark.driver.userClassPathFirst","true"); sparkConf.set("spark.executor.userClassPathFirst","true"); and repeat the job. Do you get the same exception in the JIRA or missing classes or event run into this https://issues.apache.org/jira/browse/SPARK-1867? Please comment on the JIRA, it would be useful to have a second verification of this. I know userClasspath* options are experimental but it's good to know what's going on. Cheers Devl
SparkR installation not working
Hi All, I've built spark 1.5.0 with hadoop 2.6 with a fresh download : build/mvn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package I try to run SparkR it launches the normal R without the spark addons: ./bin/sparkR --master local[*] Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > With no "Welcome to SparkR" also > sc <- sparkR.init() Error: could not find function "sparkR.init" > sqlContext <- sparkRSQL.init(sc) Error: could not find function "sparkRSQL.init" > Spark-shell and other components are fine. Using scala 2.10.6 and Java 1.8_45, Ubuntu 15.0.4. Please can anyone give me any pointers? Is there a spark maven profile I need to enable? Thanks Devl
Spark Avro Generation
Hi So far I've been managing to build Spark from source but since a change in spark-streaming-flume I have no idea how to generate classes (e.g. SparkFlumeProtocol) from the avro schema. I have used sbt to run avro:generate (from the top level spark dir) but it produces nothing - it just says: > avro:generate [success] Total time: 0 s, completed Aug 11, 2014 12:26:49 PM. Please can someone send me their build.sbt or just tell me how to build spark so that all avro files get generated as well? Sorry for the noob question but I really have tried by best on this one! Cheers
Re: Spark Avro Generation
Thanks very much that helps, not having to generate the entire build. On Mon, Aug 11, 2014 at 6:09 PM, Ron Gonzalez wrote: > If you don't want to build the entire thing, you can also do > > mvn generate-sources in externals/flume-sink > > Thanks, > Ron > > Sent from my iPhone > > > On Aug 11, 2014, at 8:32 AM, Hari Shreedharan > wrote: > > > > Jay running sbt compile or assembly should generate the sources. > > > >> On Monday, August 11, 2014, Devl Devel > wrote: > >> > >> Hi > >> > >> So far I've been managing to build Spark from source but since a change > in > >> spark-streaming-flume I have no idea how to generate classes (e.g. > >> SparkFlumeProtocol) from the avro schema. > >> > >> I have used sbt to run avro:generate (from the top level spark dir) but > it > >> produces nothing - it just says: > >> > >>> avro:generate > >> [success] Total time: 0 s, completed Aug 11, 2014 12:26:49 PM. > >> > >> Please can someone send me their build.sbt or just tell me how to build > >> spark so that all avro files get generated as well? > >> > >> Sorry for the noob question but I really have tried by best on this one! > >> Cheers > >> >
Compie error with XML elements
When compiling the master checkout of spark. The Intellij compile fails with: Error:(45, 8) not found: value $scope ^ which is caused by HTML elements in classes like HistoryPage.scala: val content = ... How can I compile these classes that have html node elements in them? Thanks in advance.