Re: Best alternative for Category Type in Spark Dataframe

Pralabh Kumar Sat, 17 Jun 2017 20:16:34 -0700

make sense :)

On Sun, Jun 18, 2017 at 8:38 AM, 颜发才(Yan Facai) <facai....@gmail.com> wrote:


> Yes, perhaps we could use SQLTransformer as well.
>
> http://spark.apache.org/docs/latest/ml-features.html#sqltransformer
>
> On Sun, Jun 18, 2017 at 10:47 AM, Pralabh Kumar <pralabhku...@gmail.com>
> wrote:
>
>> Hi Yan
>>
>> Yes sql is good option , but if we have to create ML Pipeline , then
>> having transformers and set it into pipeline stages ,would be better option
>> .
>>
>> Regards
>> Pralabh Kumar
>>
>> On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) <facai....@gmail.com>
>> wrote:
>>
>>> To filter data, how about using sql?
>>>
>>> df.createOrReplaceTempView("df")
>>> val sqlDF = spark.sql("SELECT * FROM df WHERE EMOTION IN 
>>> (HAPPY,SAD,ANGRY,NEUTRAL,NA)")
>>>
>>> https://spark.apache.org/docs/latest/sql-programming-guide.html#sql
>>>
>>>
>>>
>>> On Fri, Jun 16, 2017 at 11:28 PM, Pralabh Kumar <pralabhku...@gmail.com>
>>> wrote:
>>>
>>>> Hi Saatvik
>>>>
>>>> You can write your own transformer to make sure that column contains
>>>> ,value which u provided , and filter out rows which doesn't follow the
>>>> same.
>>>>
>>>> Something like this
>>>>
>>>>
>>>> case class CategoryTransformer(override val uid : String) extends
>>>> Transformer{
>>>>   override def transform(inputData: DataFrame): DataFrame = {
>>>>     inputData.select("col1").filter("col1 in ('happy')")
>>>>   }
>>>>   override def copy(extra: ParamMap): Transformer = ???
>>>>   @DeveloperApi
>>>>   override def transformSchema(schema: StructType): StructType ={
>>>>    schema
>>>>   }
>>>> }
>>>>
>>>>
>>>> Usage
>>>>
>>>> val data = sc.parallelize(List("abce","happy")).toDF("col1")
>>>> val trans = new CategoryTransformer("1")
>>>> data.show()
>>>> trans.transform(data).show()
>>>>
>>>>
>>>> This transformer will make sure , you always have values in col1 as
>>>> provided by you.
>>>>
>>>>
>>>> Regards
>>>> Pralabh Kumar
>>>>
>>>> On Fri, Jun 16, 2017 at 8:10 PM, Saatvik Shah <
>>>> saatvikshah1...@gmail.com> wrote:
>>>>
>>>>> Hi Pralabh,
>>>>>
>>>>> I want the ability to create a column such that its values be
>>>>> restricted to a specific set of predefined values.
>>>>> For example, suppose I have a column called EMOTION: I want to ensure
>>>>> each row value is one of HAPPY,SAD,ANGRY,NEUTRAL,NA.
>>>>>
>>>>> Thanks and Regards,
>>>>> Saatvik Shah
>>>>>
>>>>>
>>>>> On Fri, Jun 16, 2017 at 10:30 AM, Pralabh Kumar <
>>>>> pralabhku...@gmail.com> wrote:
>>>>>
>>>>>> Hi satvik
>>>>>>
>>>>>> Can u please provide an example of what exactly you want.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 16-Jun-2017 7:40 PM, "Saatvik Shah" <saatvikshah1...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Yan,
>>>>>>>
>>>>>>> Basically the reason I was looking for the categorical datatype is
>>>>>>> as given here
>>>>>>> <https://pandas.pydata.org/pandas-docs/stable/categorical.html>:
>>>>>>> ability to fix column values to specific categories. Is it possible to
>>>>>>> create a user defined data type which could do so?
>>>>>>>
>>>>>>> Thanks and Regards,
>>>>>>> Saatvik Shah
>>>>>>>
>>>>>>> On Fri, Jun 16, 2017 at 1:42 AM, 颜发才(Yan Facai) <facai....@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> You can use some Transformers to handle categorical data,
>>>>>>>> For example,
>>>>>>>> StringIndexer encodes a string column of labels to a column of
>>>>>>>> label indices:
>>>>>>>> http://spark.apache.org/docs/latest/ml-features.html#stringindexer
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jun 15, 2017 at 10:19 PM, saatvikshah1994 <
>>>>>>>> saatvikshah1...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> I'm trying to convert a Pandas -> Spark dataframe. One of the
>>>>>>>>> columns I have
>>>>>>>>> is of the Category type in Pandas. But there does not seem to be
>>>>>>>>> support for
>>>>>>>>> this same type in Spark. What is the best alternative?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> View this message in context: http://apache-spark-user-list.
>>>>>>>>> 1001560.n3.nabble.com/Best-alternative-for-Category-Type-in-
>>>>>>>>> Spark-Dataframe-tp28764.html
>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>> Nabble.com.
>>>>>>>>>
>>>>>>>>> ------------------------------------------------------------
>>>>>>>>> ---------
>>>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Saatvik Shah,*
>>>>>>> *1st  Year,*
>>>>>>> *Masters in the School of Computer Science,*
>>>>>>> *Carnegie Mellon University*
>>>>>>>
>>>>>>> *https://saatvikshah1994.github.io/
>>>>>>> <https://saatvikshah1994.github.io/>*
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Saatvik Shah,*
>>>>> *1st  Year,*
>>>>> *Masters in the School of Computer Science,*
>>>>> *Carnegie Mellon University*
>>>>>
>>>>> *https://saatvikshah1994.github.io/
>>>>> <https://saatvikshah1994.github.io/>*
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Best alternative for Category Type in Spark Dataframe

Reply via email to