Till 1.3, you have to prepare the DF appropriately
def setupCondition(t):
if t[1] > 100:
v = 1
else:
v = 0
return Row(col1=t[0],col2=t[1],col3=t[2],col4=v)
d1=[[1001,100,50],[1001,200,100],[1002,100,99]]
d1RDD = sc.parallelize(d1).map(setupCondition)
d1DF = ssc.createDataFrame(d1RDD)
d1DF.printSchema()
d1DF.show()
res = d1DF.groupBy("col1").agg({'col3':'min','col4':'sum'})
print "\n\n"
res.show()
root
|-- col1: long (nullable = true)
|-- col2: long (nullable = true)
|-- col3: long (nullable = true)
|-- col4: long (nullable = true)
col1 col2 col3 col4
1001 100 50 0
1001 200 100 1
1002 100 99 0
col1 SUM(col4) MIN(col3)
1001 1 50
1002 0 99
Good news is since 1.4, DF will have methods like when,otherwise (and a LOT
more)....cant wait to get my hands on 1.4 :)
On Wed, May 27, 2015 at 5:12 PM, Masf <[email protected]> wrote:
> Yes. I think that your sql solution will work but I was looking for a
> solution with DataFrame API. I had thought to use UDF such as:
>
> val myFunc = udf {(input: Int) => {if (input > 100) 1 else 0}}
>
> Although I'd like to know if it's possible to do it directly in the
> aggregation inserting a lambda function or something else.
>
> Thanks!!!!
>
> Regards.
> Miguel.
>
>
> On Wed, May 27, 2015 at 1:06 AM, ayan guha <[email protected]> wrote:
>
>> For this, I can give you a SQL solution:
>>
>> joined data.registerTempTable('j')
>>
>> Res=ssc.sql('select col1,col2, count(1) counter, min(col3) minimum,
>> sum(case when endrscp>100 then 1 else 0 end test from j'
>>
>> Let me know if this works.
>> On 26 May 2015 23:47, "Masf" <[email protected]> wrote:
>>
>>> Hi
>>> I don't know how it works. For example:
>>>
>>> val result = joinedData.groupBy("col1","col2").agg(
>>> count(lit(1)).as("counter"),
>>> min(col3).as("minimum"),
>>> sum("case when endrscp> 100 then 1 else 0 end").as("test")
>>> )
>>>
>>> How can I do it?
>>>
>>> Thanks!!!!
>>> Regards.
>>> Miguel.
>>>
>>> On Tue, May 26, 2015 at 12:35 AM, ayan guha <[email protected]> wrote:
>>>
>>>> Case when col2>100 then 1 else col2 end
>>>> On 26 May 2015 00:25, "Masf" <[email protected]> wrote:
>>>>
>>>>> Hi.
>>>>>
>>>>> In a dataframe, How can I execution a conditional sentence in a
>>>>> aggregation. For example, Can I translate this SQL statement to
>>>>> DataFrame?:
>>>>>
>>>>> SELECT name, SUM(IF table.col2 > 100 THEN 1 ELSE table.col1)
>>>>> FROM table
>>>>> GROUP BY name
>>>>>
>>>>> Thanks!!!!
>>>>> --
>>>>> Regards.
>>>>> Miguel
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> Saludos.
>>> Miguel Ángel
>>>
>>
>
>
> --
>
>
> Saludos.
> Miguel Ángel
>
--
Best Regards,
Ayan Guha