Re: Filtering RDD Using Spark.mllib's ChiSqSelector

Tobi Bosede Sat, 16 Jul 2016 14:52:08 -0700

Hi Yanbo,

Appreciate the response. I might not have phrased this correctly, but I
really wanted to know how to convert the pipeline rdd into a data frame. I
have seen the example you posted. However I need to transform all my data,
just not 1 line. So I did sucessfully use map to use the chisq selector to
filter the chosen features of my data. I just want to convert it to a df so
I can apply a logistic regression model from spark.ml.


Trust me I would use the dataframes api if I could, but the chisq
functionality is not available to me in the python spark 1.4 api.

Regards,
Tobi

On Jul 16, 2016 4:53 AM, "Yanbo Liang" <yblia...@gmail.com> wrote:

> Hi Tobi,
>
> The MLlib RDD-based API does support to apply transformation on both
> Vector and RDD, but you did not use the appropriate way to do.
> Suppose you have a RDD with LabeledPoint in each line, you can refer the
> following code snippets to train a ChiSqSelectorModel model and do
> transformation:
>
> from pyspark.mllib.regression import LabeledPoint
>
> from pyspark.mllib.feature import ChiSqSelector
>
> data = [LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), 
> LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})), LabeledPoint(1.0, [0.0, 
> 9.0, 8.0]), LabeledPoint(2.0, [8.0, 9.0, 5.0])]
>
> rdd = sc.parallelize(data)
>
> model = ChiSqSelector(1).fit(rdd)
>
> filteredRDD = model.transform(rdd.map(lambda lp: lp.features))
>
> filteredRDD.collect()
>
> However, we strongly recommend you to migrate to DataFrame-based API since
> the RDD-based API is switched to maintain mode.
>
> Thanks
> Yanbo
>
> 2016-07-14 13:23 GMT-07:00 Tobi Bosede <ani.to...@gmail.com>:
>
>> Hi everyone,
>>
>> I am trying to filter my features based on the spark.mllib ChiSqSelector.
>>
>> filteredData = vectorizedTestPar.map(lambda lp: LabeledPoint(lp.label,
>> model.transform(lp.features)))
>>
>> However when I do the following I get the error below. Is there any other
>> way to filter my data to avoid this error?
>>
>> filteredDataDF=filteredData.toDF()
>>
>> Exception: It appears that you are attempting to reference SparkContext from 
>> a broadcast variable, action, or transforamtion. SparkContext can only be 
>> used on the driver, not in code that it run on workers. For more 
>> information, see SPARK-5063.
>>
>>
>> I would directly use the spark.ml ChiSqSelector and work with dataframes, 
>> but I am on spark 1.4 and using pyspark. So spark.ml's ChiSqSelector is not 
>> available to me. filteredData is of type piplelineRDD, if that helps. It is 
>> not a regular RDD. I think that may part of why calling toDF() is not 
>> working.
>>
>>
>> Thanks,
>>
>> Tobi
>>
>>
>

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

Reply via email to