Re: mLIb solving linear regression with sparse inputs
Thank you! Would happen to have this code in Java?. This is extremely helpful! Iman On Sun, Nov 6, 2016 at 3:35 AM -0800, "Robineast [via Apache Spark User List]" wrote: Here’s a way of creating sparse vectors in MLLib: import org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.rdd.RDD val rdd = sc.textFile("A.txt").map(line => line.split(",")). map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble)) val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el)) val create = (first: (Int, Int, Double)) => (Array(first._2), Array(first._3))val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int, Double)) => (head._1 :+ tail._2, head._2 :+ tail._3)val merge = (a: (Array[Int], Array[Double]), b: (Array[Int], Array[Double])) => (a._1 ++ b._1, a._2 ++ b._2) val A = pairRdd.combineByKey(create,combine,merge).map(el => Vectors.sparse(3,el._2._1,el._2._2)) If you have a separate file of b’s then you would need to manipulate this slightly to join the b’s to the A RDD and then create LabeledPoints. I guess there is a way of doing this using the newer ML interfaces but it’s not particularly obvious to me how. One point: In the example you give the b’s are exactly the same as col 2 in the A matrix. I presume this is just a quick hacked together example because that would give a trivial result. ---Robin EastSpark GraphX in Action Michael Malak and Robin EastManning Publications Co.http://www.manning.com/books/spark-graphx-in-action On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] <[hidden email]> wrote: I would like to use it. But how do I do the following 1) Read sparse data (from text or database) 2) pass the sparse data to the linearRegression class? For example: Sparse matrix A row, column, value 0,0,.42 0,1,.28 0,2,.89 1,0,.83 1,1,.34 1,2,.42 2,0,.23 3,0,.42 3,1,.98 3,2,.88 4,0,.23 4,1,.36 4,2,.97 Sparse vector b row, column, value 0,2,.89 1,2,.42 3,2,.88 4,2,.97 Solve Ax = b??? If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28008.html To start a new topic under Apache Spark User List, email [hidden email] To unsubscribe from Apache Spark User List, click here. NAML Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28027.html To unsubscribe from mLIb solving linear regression with sparse inputs, click here. NAML -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28028.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: mLIb solving linear regression with sparse inputs
Hi Robin, It looks like the linear regression model takes in a dataset not a matrix? It would be helpful for this example if you could set up the whole problem end to end using one of the columns of the matrix as b. So A is a sparse matrix and b is a sparse vector Best regards. Iman On Sun, Nov 6, 2016 at 6:43 AM wrote: > Thank you! Would happen to have this code in Java?. > This is extremely helpful! > > > Iman > > > > > On Sun, Nov 6, 2016 at 3:35 AM -0800, "Robineast [via Apache Spark User > List]" wrote: > > Here’s a way of creating sparse vectors in MLLib: > > import org.apache.spark.mllib.linalg.Vectors > import org.apache.spark.rdd.RDD > > val rdd = sc.textFile("A.txt").map(line => line.split(",")). > map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble)) > > val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el)) > > val create = (first: (Int, Int, Double)) => (Array(first._2), > Array(first._3)) > val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int, > Double)) => (head._1 :+ tail._2, head._2 :+ tail._3) > val merge = (a: (Array[Int], Array[Double]), b: (Array[Int], > Array[Double])) => (a._1 ++ b._1, a._2 ++ b._2) > > val A = pairRdd.combineByKey(create,combine,merge).map(el => > Vectors.sparse(3,el._2._1,el._2._2)) > > If you have a separate file of b’s then you would need to manipulate this > slightly to join the b’s to the A RDD and then create LabeledPoints. I > guess there is a way of doing this using the newer ML interfaces but it’s > not particularly obvious to me how. > > One point: In the example you give the b’s are exactly the same as col 2 > in the A matrix. I presume this is just a quick hacked together example > because that would give a trivial result. > > > --- > Robin East > *Spark GraphX in Action* Michael Malak and Robin East > Manning Publications Co. > http://www.manning.com/books/spark-graphx-in-action > > > > > > On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] <[hidden > email] <http:///user/SendEmail.jtp?type=node&node=28027&i=0>> wrote: > > I would like to use it. But how do I do the following > 1) Read sparse data (from text or database) > 2) pass the sparse data to the linearRegression class? > > For example: > > Sparse matrix A > row, column, value > 0,0,.42 > 0,1,.28 > 0,2,.89 > 1,0,.83 > 1,1,.34 > 1,2,.42 > 2,0,.23 > 3,0,.42 > 3,1,.98 > 3,2,.88 > 4,0,.23 > 4,1,.36 > 4,2,.97 > > Sparse vector b > row, column, value > 0,2,.89 > 1,2,.42 > 3,2,.88 > 4,2,.97 > > Solve Ax = b??? > > > > -- > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28008.html > To start a new topic under Apache Spark User List, email [hidden email] > <http:///user/SendEmail.jtp?type=node&node=28027&i=1> > To unsubscribe from Apache Spark User List, click here. > NAML > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > > Robin East > Spark GraphX in Action Michael Malak and Robin East > Manning Publications Co. > http://www.manning.com/books/spark-graphx-in-action > > > -- > If you reply to this email, your message will be added to the discussion > below: > > http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28027.html > To unsubscribe from mLIb solving linear regression with sparse inputs, click > here > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=28006&code=aW1hbi5tb2h0YXNoZW1pQGdtYWlsLmNvbXwyODAwNnwtMTc1OTAxNjQz> > . > NAML > <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > > -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28029.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: mLIb solving linear regression with sparse inputs
Also in Java as well. Thanks again! Iman On Sun, Nov 6, 2016 at 8:28 AM Iman Mohtashemi wrote: Hi Robin, It looks like the linear regression model takes in a dataset not a matrix? It would be helpful for this example if you could set up the whole problem end to end using one of the columns of the matrix as b. So A is a sparse matrix and b is a sparse vector Best regards. Iman On Sun, Nov 6, 2016 at 6:43 AM wrote: Thank you! Would happen to have this code in Java?. This is extremely helpful! Iman On Sun, Nov 6, 2016 at 3:35 AM -0800, "Robineast [via Apache Spark User List]" wrote: Here’s a way of creating sparse vectors in MLLib: import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.rdd.RDD val rdd = sc.textFile("A.txt").map(line => line.split(",")). map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble)) val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el)) val create = (first: (Int, Int, Double)) => (Array(first._2), Array(first._3)) val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int, Double)) => (head._1 :+ tail._2, head._2 :+ tail._3) val merge = (a: (Array[Int], Array[Double]), b: (Array[Int], Array[Double])) => (a._1 ++ b._1, a._2 ++ b._2) val A = pairRdd.combineByKey(create,combine,merge).map(el => Vectors.sparse(3,el._2._1,el._2._2)) If you have a separate file of b’s then you would need to manipulate this slightly to join the b’s to the A RDD and then create LabeledPoints. I guess there is a way of doing this using the newer ML interfaces but it’s not particularly obvious to me how. One point: In the example you give the b’s are exactly the same as col 2 in the A matrix. I presume this is just a quick hacked together example because that would give a trivial result. --- Robin East *Spark GraphX in Action* Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] <[hidden email] <http:///user/SendEmail.jtp?type=node&node=28027&i=0>> wrote: I would like to use it. But how do I do the following 1) Read sparse data (from text or database) 2) pass the sparse data to the linearRegression class? For example: Sparse matrix A row, column, value 0,0,.42 0,1,.28 0,2,.89 1,0,.83 1,1,.34 1,2,.42 2,0,.23 3,0,.42 3,1,.98 3,2,.88 4,0,.23 4,1,.36 4,2,.97 Sparse vector b row, column, value 0,2,.89 1,2,.42 3,2,.88 4,2,.97 Solve Ax = b??? -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28008.html To start a new topic under Apache Spark User List, email [hidden email] <http:///user/SendEmail.jtp?type=node&node=28027&i=1> To unsubscribe from Apache Spark User List, click here. NAML <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co. http://www.manning.com/books/spark-graphx-in-action -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28027.html To unsubscribe from mLIb solving linear regression with sparse inputs, click here <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=28006&code=aW1hbi5tb2h0YXNoZW1pQGdtYWlsLmNvbXwyODAwNnwtMTc1OTAxNjQz> . NAML <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28030.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
TallSkinnyQR
I am getting the correct rows but they are out of order. Is this a bug or am I doing something wrong? public class CoordinateMatrixDemo { public static void main(String[] args) { //boiler plate needed to run locally SparkConf conf = new SparkConf().setAppName("Simple Application").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); SparkSession spark = SparkSession .builder() .appName("CoordinateMatrix") .getOrCreate() .newSession(); run(sc,"Data/sparsematrix.txt"); } private static void run(JavaSparkContext sc, String file) { //Read coordinate matrix from text or database JavaRDD fileA = sc.textFile(file); //map text file with coordinate data (sparse matrix) to JavaRDD JavaRDD matrixA = fileA.map(new Function() { public MatrixEntry call(String x){ String[] indeceValue = x.split(","); long i = Long.parseLong(indeceValue[0]); long j = Long.parseLong(indeceValue[1]); double value = Double.parseDouble(indeceValue[2]); return new MatrixEntry(i, j, value ); } }); //coordinate matrix from sparse data CoordinateMatrix cooMatrixA = new CoordinateMatrix(matrixA.rdd()); //create block matrix BlockMatrix matA = cooMatrixA.toBlockMatrix(); //create block matrix after matrix multiplication (square matrix) BlockMatrix ata = matA.transpose().multiply(matA); //print out the original dense matrix System.out.println(matA.toLocalMatrix().toString()); //print out the transpose of the dense matrix System.out.println(matA.transpose().toLocalMatrix().toString()); //print out the square matrix (after multiplication) System.out.println(ata.toLocalMatrix().toString()); JavaRDD entries = ata.toCoordinateMatrix().entries().toJavaRDD(); //QR decomposition DEMO // Convert it to an IndexRowMatrix whose rows are sparse vectors. IndexedRowMatrix indexedRowMatrix = cooMatrixA.toIndexedRowMatrix(); // Drop its row indices. RowMatrix rowMat = indexedRowMatrix.toRowMatrix(); // QR decomposition *QRDecomposition result = rowMat.tallSkinnyQR(true);* *System.out.println("Q: " + result.Q().toBreeze().toString());* System.out.println("R: " + result.R().toString()); Vector[] collectPartitions = (Vector[]) result.Q().rows().collect(); System.out.println("Q factor is:"); for (Vector vector : collectPartitions) { System.out.println("\t" + vector); } //compute Qt //need to compute d = Qt*b where b is the experimental //Then solve for d using Gaussian elimination //Extract Q values and create matrix //TODO:! The array will be HUGE String Qm = result.Q().toBreeze().toString(); String[] Qmatrix = Qm.split("\\s+"); int rows = (int)result.Q().numRows(); int cols = (int)result.Q().numCols(); try { PrintWriter pw = new PrintWriter("Data/qMatrix.txt"); pw.write(Qm); pw.close(); PrintWriter pw1 = new PrintWriter("Data/qMatrix1.txt"); //write coordinate matrix to file int k = 0; for(int i = 0; i < (int)result.Q().numRows();i++){ for(int j = 0; j < (int)result.Q().numCols();j++){ pw1.println(i + "," + j + "," + Qmatrix[k]); k++; } } pw1.close(); } catch (FileNotFoundException e) { // TODO Auto-generated catch block e.printStackTrace(); } //Read coordinate matrix from text or database JavaRDD fileQ = sc.textFile("Data/qMatrix1.txt");
RDD flatmap to multiple key/value pairs
Here is a MapReduce Example implemented in Java. It reads each line of text and for each word in the line of text determines if it starts with an upper case. If so, it creates a key value pair public class CountUppercaseMapper extends Mapper { @Override protected void map(LongWritable lineNumber, Text line, Context context) throws IOException, InterruptedException { for (String word : line.toString().split(" ")) { if (Character.isUpperCase(word.charAt(0))) { context.write(new Text(word), new IntWritable(1)); } } } } What is the equivalent spark implementation? A more use-case specific example below with objects: In this case, the mapper emits multiple key:value pairs that are (String,String) What is the equivalent spark implementation? import java.io.IOException; public class IsotopeClusterMapper extends Mapper { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { System.out.println("Inside Isotope Cluster Map !"); String line = value.toString(); // Get Isotope clusters here are write out to text Detector detector = new Detector(); ArrayList clusters = detector.GetClusters(line); for (int i = 0; i < clusters.size(); i++) { String cKey = detector.WriteClusterKey(clusters.get(i)); String cValue = detector.WriteClusterValue(clusters.get(i)); context.write(new Text(cKey), new Text(cValue)); } } } -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-flatmap-to-multiple-key-value-pairs-tp28154.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Running spark from Eclipse and then Jar
Hello, I have a simple word count example in Java and I can run this in Eclipse (code at the bottom) I then create a jar file from it and try to run it from the cmd java -jar C:\Users\Owner\Desktop\wordcount.jar Data/testfile.txt But I get this error? I think the main error is: *Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: text* Any advise on how to run this jar file in spark would be appreciated Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/12/07 15:16:41 INFO SparkContext: Running Spark version 2.0.2 16/12/07 15:16:42 INFO SecurityManager: Changing view acls to: Owner 16/12/07 15:16:42 INFO SecurityManager: Changing modify acls to: Owner 16/12/07 15:16:42 INFO SecurityManager: Changing view acls groups to: 16/12/07 15:16:42 INFO SecurityManager: Changing modify acls groups to: 16/12/07 15:16:42 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(Owner); groups with view permissions: Set(); users with modify permissions: Set(Owner); groups with modify permissions: Set() 16/12/07 15:16:44 INFO Utils: Successfully started service 'sparkDriver' on port 10211. 16/12/07 15:16:44 INFO SparkEnv: Registering MapOutputTracker 16/12/07 15:16:44 INFO SparkEnv: Registering BlockManagerMaster 16/12/07 15:16:44 INFO DiskBlockManager: Created local directory at C:\Users\Owner\AppData\Local\Temp\blockmgr-b4b1960b-08fc-44fd-a75e-1a0450556873 16/12/07 15:16:44 INFO MemoryStore: MemoryStore started with capacity 1984.5 MB 16/12/07 15:16:45 INFO SparkEnv: Registering OutputCommitCoordinator 16/12/07 15:16:45 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/12/07 15:16:45 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.19.2:4040 16/12/07 15:16:45 INFO Executor: Starting executor ID driver on host localhost 16/12/07 15:16:45 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 10252. 16/12/07 15:16:45 INFO NettyBlockTransferService: Server created on 192.168.19.2:10252 16/12/07 15:16:45 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.19.2, 10252) 16/12/07 15:16:45 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.19.2:10252 with 1984.5 MB RAM, BlockManagerId(driver, 192.168.19.2, 10252) 16/12/07 15:16:45 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.19.2, 10252) 16/12/07 15:16:46 WARN SparkContext: Use an existing SparkContext, some configuration may not take effect. 16/12/07 15:16:46 INFO SharedState: Warehouse path is 'file:/C:/Users/Owner/spark-warehouse'. Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: text. Please find packages at https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:148) at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:79) at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:79) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149) at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:504) at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:540) at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:513) at JavaWordCount.main(JavaWordCount.java:57) Caused by: java.lang.ClassNotFoundException: text.DefaultSource at java.net.URLClassLoader.findClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source) at java.lang.ClassLoader.loadClass(Unknown Source) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5$$anonfun$apply$1.apply(DataSource.scala:132) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:132) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$5.apply(DataSource.scala:132) at scala.util.Try.orElse(Try.scala:84) at org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:132) ... 8 more 16/12/07 15:16:46 INFO SparkContext: Invoking stop() from shutdown hook 16/12/07 15:16:46 INFO SparkUI: Stopped Spark web UI at http://192.168.19.2:4040 16/12/07 15:16:46 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 16/12/07 15:16:46 INFO MemoryStore: MemoryStore cleared 16/12/07
flatmap pair
The class 'Detector' has a function 'detectFeature(cluster) However, the method has changed to return a list of features as opposed to one feature as it is below. How do I change this so it returns a list of feature objects instead // creates key-value pairs for Isotope cluster ID and Isotope cluster // groups them by keys and passes // each collection of isotopes to the feature detector JavaPairRDD features = scans.flatMapToPair(keyData).groupByKey().mapValues(values -> { System.out.println("Inside Feature Detection Reduce !"); Detector d = new Detector(); ArrayList clusters = (ArrayList) StreamSupport .stream(values.spliterator(), false).map(d::GetClustersfromKey).collect(Collectors.toList()); return d.WriteFeatureValue(d.DetectFeature(clusters)); }); -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/flatmap-pair-tp28187.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org