Thanks for all your feedback! I'm a little new to scala/spark so hopefully you'll bare with me while I try to explain how I plan to go about this and give me advice as to why this may or may not work. My terminology may be a little incorrect as well. Any feedback would be greatly appreciated. Even if it's just to tell me I'm doing it wrong :)
Below is a multilabelclassification implementation attempt using a RandomForest Algorithm. Hopefully you'll bare with the semi pseudocode below. If that is unreadable you can find the code here: https://www.refheap.com/100596 def magicMultiLabelClassification(): Unit = { val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]") val sc = new SparkContext(conf) val data = MLUtils.loadLibSVMFile(sc, "moo.txt") val splits = data.randomSplit(Array(0.7, 0.3)) val (trainingData, testData) = (splits(0), splits(1)) //Inputs for random forest val numClasses = 1 // Because I'm only training one class at a time. val categoricalFeaturesInfo = Map[Int, Int]() val numTrees = 3 // Use more in practice. val featureSubsetStrategy = "auto" val impurity = "gini" val maxDepth = 4 val maxBins = 32 val groupedData = trainingData.groupBy(d => d.label) // This will give me data for each item I'm trying to classify in it's own collection // I will need to then iterate over each key and get the values for each label (They should be Labeled points) // I will then need to train individual models for each class (hopefully I'm correct in calling it this here). // I will do this by mapping over each key and getting the RDD with the labeledPoints and apply them to the model. See below. val models = groupedData.map{d => val trainingDataForClass = d._2 //Returns org.apache.spark.rdd.RDD[Iterable[org.apache.spark.mllib.regression.LabeledPoint]] = MapPartitionsRDD[57] //as opposed to org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = PartitionwiseSampledRDD[53] //so not very confident it will work. RandomForest.trainClassifier(trainingDataForClass, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins) } // From here I'm not entirely sure how to do it. My initial idea is to run predict on each of the models and take only those with a good enough score. testData.map { point => models.map{ m => (point.label, m.predict(point.features))}} } On Tue, May 5, 2015 at 9:27 PM, Ulanov, Alexander <alexander.ula...@hp.com> wrote: > If you are interested in multilabel (not multiclass), you might want to take > a look at SPARK-7015 https://github.com/apache/spark/pull/5830/files. It is > supposed to perform one-versus-all transformation on classes, which is > usually how multilabel classifiers are built. > > > > Alexander > > > > From: Joseph Bradley [mailto:jos...@databricks.com] > Sent: Tuesday, May 05, 2015 3:44 PM > To: DB Tsai > Cc: peterg; user@spark.apache.org > Subject: Re: Multilabel Classification in spark > > > > If you mean "multilabel" (predicting multiple label values), then MLlib does > not yet support that. You would need to predict each label separately. > > > > If you mean "multiclass" (1 label taking >2 categorical values), then MLlib > supports it via LogisticRegression (as DB said), as well as DecisionTree and > RandomForest. > > > > Joseph > > > > On Tue, May 5, 2015 at 1:27 PM, DB Tsai <dbt...@dbtsai.com> wrote: > > LogisticRegression in MLlib package supports multilable classification. > > Sincerely, > > DB Tsai > ------------------------------------------------------- > Blog: https://www.dbtsai.com > > > > On Tue, May 5, 2015 at 1:13 PM, peterg <pe...@garbers.me> wrote: >> Hi all, >> >> I'm looking to implement a Multilabel classification algorithm but I am >> surprised to find that there are not any in the spark-mllib core library. >> Am >> I missing something? Would someone point me in the right direction? >> >> Thanks! >> >> Peter >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Multilabel-Classification-in-spark-tp22775.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org