make sense :) On Sun, Jun 18, 2017 at 8:38 AM, 颜发才(Yan Facai) <facai....@gmail.com> wrote:
> Yes, perhaps we could use SQLTransformer as well. > > http://spark.apache.org/docs/latest/ml-features.html#sqltransformer > > On Sun, Jun 18, 2017 at 10:47 AM, Pralabh Kumar <pralabhku...@gmail.com> > wrote: > >> Hi Yan >> >> Yes sql is good option , but if we have to create ML Pipeline , then >> having transformers and set it into pipeline stages ,would be better option >> . >> >> Regards >> Pralabh Kumar >> >> On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) <facai....@gmail.com> >> wrote: >> >>> To filter data, how about using sql? >>> >>> df.createOrReplaceTempView("df") >>> val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN >>> (HAPPY,SAD,ANGRY,NEUTRAL,NA)") >>> >>> https://spark.apache.org/docs/latest/sql-programming-guide.html#sql >>> >>> >>> >>> On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar <pralabhku...@gmail.com> >>> wrote: >>> >>>> Hi Saatvik >>>> >>>> You can write your own transformer to make sure that column contains >>>> ,value which u provided , and filter out rows which doesn't follow the >>>> same. >>>> >>>> Something like this >>>> >>>> >>>> case class CategoryTransformer(override val uid : String) extends >>>> Transformer{ >>>> override def transform(inputData: DataFrame): DataFrame = { >>>> inputData.select("col1").filter("col1 in ('happy')") >>>> } >>>> override def copy(extra: ParamMap): Transformer = ??? >>>> @DeveloperApi >>>> override def transformSchema(schema: StructType): StructType ={ >>>> schema >>>> } >>>> } >>>> >>>> >>>> Usage >>>> >>>> val data = sc.parallelize(List("abce","happy")).toDF("col1") >>>> val trans = new CategoryTransformer("1") >>>> data.show() >>>> trans.transform(data).show() >>>> >>>> >>>> This transformer will make sure , you always have values in col1 as >>>> provided by you. >>>> >>>> >>>> Regards >>>> Pralabh Kumar >>>> >>>> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah < >>>> saatvikshah1...@gmail.com> wrote: >>>> >>>>> Hi Pralabh, >>>>> >>>>> I want the ability to create a column such that its values be >>>>> restricted to a specific set of predefined values. >>>>> For example, suppose I have a column called EMOTION: I want to ensure >>>>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA. >>>>> >>>>> Thanks and Regards, >>>>> Saatvik Shah >>>>> >>>>> >>>>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar < >>>>> pralabhku...@gmail.com> wrote: >>>>> >>>>>> Hi satvik >>>>>> >>>>>> Can u please provide an example of what exactly you want. >>>>>> >>>>>> >>>>>> >>>>>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <saatvikshah1...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Yan, >>>>>>> >>>>>>> Basically the reason I was looking for the categorical datatype is >>>>>>> as given here >>>>>>> <https://pandas.pydata.org/pandas-docs/stable/categorical.html>: >>>>>>> ability to fix column values to specific categories. Is it possible to >>>>>>> create a user defined data type which could do so? >>>>>>> >>>>>>> Thanks and Regards, >>>>>>> Saatvik Shah >>>>>>> >>>>>>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <facai....@gmail.com >>>>>>> > wrote: >>>>>>> >>>>>>>> You can use some Transformers to handle categorical data, >>>>>>>> For example, >>>>>>>> StringIndexer encodes a string column of labels to a column of >>>>>>>> label indices: >>>>>>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 < >>>>>>>> saatvikshah1...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> I'm trying to convert a Pandas -> Spark dataframe. One of the >>>>>>>>> columns I have >>>>>>>>> is of the Category type in Pandas. But there does not seem to be >>>>>>>>> support for >>>>>>>>> this same type in Spark. What is the best alternative? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> View this message in context: http://apache-spark-user-list. >>>>>>>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in- >>>>>>>>> Spark-Dataframe-tp28764.html >>>>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>>>> Nabble.com. >>>>>>>>> >>>>>>>>> ------------------------------------------------------------ >>>>>>>>> --------- >>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Saatvik Shah,* >>>>>>> *1st Year,* >>>>>>> *Masters in the School of Computer Science,* >>>>>>> *Carnegie Mellon University* >>>>>>> >>>>>>> *https://saatvikshah1994.github.io/ >>>>>>> <https://saatvikshah1994.github.io/>* >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Saatvik Shah,* >>>>> *1st Year,* >>>>> *Masters in the School of Computer Science,* >>>>> *Carnegie Mellon University* >>>>> >>>>> *https://saatvikshah1994.github.io/ >>>>> <https://saatvikshah1994.github.io/>* >>>>> >>>> >>>> >>> >> >