Hi Yanbo, Appreciate the response. I might not have phrased this correctly, but I really wanted to know how to convert the pipeline rdd into a data frame. I have seen the example you posted. However I need to transform all my data, just not 1 line. So I did sucessfully use map to use the chisq selector to filter the chosen features of my data. I just want to convert it to a df so I can apply a logistic regression model from spark.ml.
Trust me I would use the dataframes api if I could, but the chisq functionality is not available to me in the python spark 1.4 api. Regards, Tobi On Jul 16, 2016 4:53 AM, "Yanbo Liang" <yblia...@gmail.com> wrote: > Hi Tobi, > > The MLlib RDD-based API does support to apply transformation on both > Vector and RDD, but you did not use the appropriate way to do. > Suppose you have a RDD with LabeledPoint in each line, you can refer the > following code snippets to train a ChiSqSelectorModel model and do > transformation: > > from pyspark.mllib.regression import LabeledPoint > > from pyspark.mllib.feature import ChiSqSelector > > data = [LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), > LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})), LabeledPoint(1.0, [0.0, > 9.0, 8.0]), LabeledPoint(2.0, [8.0, 9.0, 5.0])] > > rdd = sc.parallelize(data) > > model = ChiSqSelector(1).fit(rdd) > > filteredRDD = model.transform(rdd.map(lambda lp: lp.features)) > > filteredRDD.collect() > > However, we strongly recommend you to migrate to DataFrame-based API since > the RDD-based API is switched to maintain mode. > > Thanks > Yanbo > > 2016-07-14 13:23 GMT-07:00 Tobi Bosede <ani.to...@gmail.com>: > >> Hi everyone, >> >> I am trying to filter my features based on the spark.mllib ChiSqSelector. >> >> filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label, >> model.transform(lp.features))) >> >> However when I do the following I get the error below. Is there any other >> way to filter my data to avoid this error? >> >> filteredDataDF=filteredData.toDF() >> >> Exception: It appears that you are attempting to reference SparkContext from >> a broadcast variable, action, or transforamtion. SparkContext can only be >> used on the driver, not in code that it run on workers. For more >> information, see SPARK-5063. >> >> >> I would directly use the spark.ml ChiSqSelector and work with dataframes, >> but I am on spark 1.4 and using pyspark. So spark.ml's ChiSqSelector is not >> available to me. filteredData is of type piplelineRDD, if that helps. It is >> not a regular RDD. I think that may part of why calling toDF() is not >> working. >> >> >> Thanks, >> >> Tobi >> >> >